Academic Publications

Data Science / Bioinformatics / Machine Learning Approaches

Tarca AL, Pataki BÁ, Romero R, Sirota M, Guan Y, Kutum R, Gomez-Lopez N, Done B, Bhatti G, Yu T, Andreoletti G, Chaiworapongsa T; DREAM Preterm Birth Prediction Challenge Consortium; Hassan SS, Hsu CD, Aghaeepour N, Stolovitzky G, Csabai I, Costello JC. Crowdsourcing assessment of maternal blood multi-omics for predicting gestational age and preterm birth. Cell Rep Med. 2021 Jun 15;2(6):100323. doi: 10.1016/j.xcrm.2021.100323. PMID: 34195686.

In the above publication I organized a team of fellow grad students at South Dakota State University that competed in this biological machine learning competition. We used both R and Python to do feature engineering, feature selection, and modeling with XGBoost and Neural Nets on this high dimensional data to predict risk of preterm birth and gestational age.

Liu J, Liu S, Zheng K, Tang M, Gu L, Young J, Wang Z, Qiu Y, Dong J, Gu S, Xiong L, Zhou R, Nie L. Chromosome-level genome assembly of the Chinese three-keeled pond turtle (Mauremys reevesii) provides insights into freshwater adaptation. Mol Ecol Resour. 2022 May;22(4):1596-1605. doi: 10.1111/1755-0998.13563. Epub 2021 Dec 9. PMID: 34845835.

In the above publication I contributed to the genome assembly and synteny analysis ran on the SDSU HPC.

Manuscript In Submission 2023: Young J, Gu L, Zhou R. Secondary Metabolites Are Highly Predictive of Diazotrophic Cyanobacteria Strains

​In the above manuscript we mined secondary metabolites from CyanoDB, mapped them to evidence suggesting presence/absence of diazotrophic capabilities, and validated a model using leave on out cross validation. We were able to achieve an AUC of ~ 0.98 on the holdout which gives us high confidence in the model. The usefulness of this is in prioritizing research on organisms based on associated metabolites for a higher hit rate on diazotrophic phenotypes.

Manuscript In Submission 2023: Young J, Gu L, Zhou R. Predicting Cyanobacterial FOX Genes with A Data-Centric Machine Learning Approach​

In the above manuscript I merged proteomic, transcriptiomic, and genome (promoters) level data to build a model that is predictive of FOX genes (necessary for oxygen tolerant nitrogen fixation in Cyanobacteria). The model is validated on experimentally determined FOX genes and non-essential genes, with a holdout AUC in the 0.85 range. The usefullness of this model is its ability to prioritize genes of unknown diazotrophic importance for further investigation with an expeccted higher hit rate.

Wet-Lab Approaches

Young, J, Gu, L, Gibbons, W, Zhou, R. (2021). Harnessing Solar-Powered Oxic N 2-fixing Cyanobacteria for the BioNitrogen Economy. Cyanobacteria Biotechnology (eds J. Nielsen, S. Lee, G. Stephanopoulos and P. Hudson).

In the above chapter I laid out the landscape of nitrogen fixation research, synthesized current trends across disciplines, and used calculations to represent the economic opportunity for industrial scale biological nitrogen fixation.

Young, J, Gu, L, Hildreth, M, Zhou, R. Unicellular Cyanobacteria Exhibit Light-Driven, Oxygen-Tolerant, Constitutive Nitrogenase Activity Under Continuous Illumination. bioRxiv 619353; doi:

In the above publication I found that a unique nitrogen-fixation phenotype that tolerates net-gains of oxygen is dependent on light-adaptation, photosystems, and rapid protein production. I used tools including GCMS, Oxygraph, controlled growth chambers, inhibition assays, proteomics, and network analysis. This particular manuscript claims a very odd and counter-intuitive phenotype that our lab has been pursuing additional work (mainly time series proteomics) on before submitting for publication.


In the above publication I used microscopy to quantify cell surface flourescent of Cyanothece under different lectin treatments to identify cell envelope polysaccharides. Cell envelope development is an important aspect of nitrogen-fixation capabilities in Cyanobacteria.

%d bloggers like this: