Applied Machine Learning and Bioinformatics Methods for Advanced Single-Cell Metabolic and Taxonomic Analysis for Environmental Applications
Access to this document is restricted. Some items have been embargoed at the request of the author, but will be made publicly available after the "No Access Until" date.
During the embargo period, you may request access to the item by clicking the link to the restricted file(s) and completing the request form. If we have contact information for a Cornell author, we will contact the author and request permission to provide access. If we do not have contact information for a Cornell author, or the author denies or does not respond to our inquiry, we will not be able to provide access. For more information, review our policies for restricted content.
Single-cell technology has emerged as a promising tool for high-resolution and fundamental studies in environmental microbiology, surpassing traditional cultivation-based and bulk-measurement methods. With the advancement of machine-learning methods, the taxonomic, metabolic, and functional analysis of datasets generated by single-cell technology has reached unprecedented levels of scale and efficiency. In this dissertation, several methods and pipelines integrating machine-learning and bioinformatics methods have been proposed to enhance the sampling size, taxonomy, and metabolic function analysis of environmental datasets obtained through single-cell technologies. First, a sampling size assessment protocol was developed that does not require prior knowledge of population sizes, designed specifically for single-cell-based sampling from large communities like environmental microbial communities. This protocol aims to standardize sampling size assessments across all single-cell technologies, replacing conventional empirical estimations. Second, a standardized pipeline for single-cell Raman spectroscopy (SCRS) classification was developed, suitable for environmental applications, to unveil biochemical fingerprints linked to taxonomic and cell age differentiation. Third, an improved agent-based metabolic simulation model was developed to incorporate cell state heterogeneities, metabolic pathway switching, and metabolic phenotypes, providing unprecedented resolution for investigating phenotype-based microbial interactions, validated using single-cell phenotypic survey datasets such as SCRS. Fourth, a pipeline was applied to investigate microdiversity from 16S rRNA amplicon sequencing datasets, resolving operational taxonomic units (OTUs) into subgenus-level taxa. These resolved taxa offer better resolution for single-cell technologies and unveil co-occurrence patterns among similar environments, enhancing our understanding of microbial interactions. Lastly, metagenomic analysis was applied to compare taxonomic, functional, and core marker gene distinctions between Enhanced Biological Phosphorus Removal (EBPR) and Side-Stream EBPR systems. The results indicated that fine-scale microdiversity is more crucial than overall functional profiling and highlighted knowledge gaps regarding novel species of core functional organisms.