Statistical Methods For Genome Variant Calling And Population Genetic Inference From Next-Generation Sequencing Data
Next Generation Sequencing (NGS) technology has been widely adopted as a platform for DNA sequence variation detection and hence, accurate and rapid detection of genome variations using NGS data is critical for population genetics analyses. In my dissertation, I present three models that I developed to detect genome variation with high accuracy. In Chapter 2, I analyzed sequence data in orang-utan. The orang-utan species, Pongo pygmaeus (Bornean) and Pongo abelii (Sumatran),are great apes found on the islands of Borneo and Sumatran. Populations on both islands are from the same ancestry but were subsequently isolated after the split. Due to recent deforestation to both islands, these species are critically endangered. Knowing their demographical history will not only help us better protect them, but it will provide us with a higher resolution evolutionary map for primates. It will also give us a powerful perspective on hominid biology because orangutans are the most phytogenetically distant great apes from humans. In this study, we have sampled five wild-caught orang-utans from each of the two populations. One individual was sequenced to 20X coverage; the rest have median coverages between 6-8X. I developed a Bayesian population genomic variation detection tool which not only captures the population structure between these two populations but also pools all the allele frequency information among all in- dividuals within the same population to boost the power of the variation detection in low coverage individuals.Our analysis revealed that, compared to other primates, the orang-utan genome has many unique features. From the population perspective, both Pongo species are deeply diverse; however, Sumatran individuals possess greater diversity than their Bornean counterparts, and more species-specific variation. Our estimate of Bornean/Sumatran speciation time, 400k years ago (ya), is more recent than most previous studies and underscores the complexity of the orang-utan speciation process. Despite a smaller modern census population size, the Sumatran effective population size (Ne) expanded exponentially relative to the ancestral Ne after the split, while Bornean Ne declined over the same period with more deleterious mutation accumulation. Despite some evidence for stronger negative selection in Sumatran orang-utans, detecting patterns of selection by fitting different selection models upon the baseline demographical model with nonsynonmous SNPs using ∂a∂i showed that the distribution of selection forces is actually similar to that in human with roughly 80% of mutations having a selection coefficient more negative than s [ALMOST EQUAL TO] 3 x 10[-]5 . In Chapter 3, I undertook a second project aimed at understanding the molecular mechanisms that lead to mutation variation in yeast. This work is likely to provide insights not only in molecular evolution but also in understanding human disease progression. To analyze with limited bias genomic features associated with DNA polymerase errors, we performed a genomewide analysis of mutations that accumulate in mismatch repair (MMR) deficient diploid lines of Saccharomyces cerevisiae. These lines were derived from a common ancestor and were grown for 160 generations, with bottlenecks reducing the population to one cell every twenty generations. We sequenced one wild- type and three mutator lines at coverages from eight and twenty-fold using Illumina Solexa 36-bp single reads. Using an experimentally aware Bayesian genotype caller developed to pool experimental data across sequencing runs for all strains, we detected 28 heterozygous single-nucleotide polymorphisms (SNPs) and 48 single nucleotide (nt) insertion/deletions (indels) from the data set. This method was evaluated on simulated data sets and found to have a very low false positive rate (~6 x 10[-]5 ) and a false negative rate of 0.08 within the unique (i.e., non-repetitive) mapping regions of the genome that contained at least sevenfold coverage. The heterozygous mutations identified by the Bayesian genotype caller were confirmed by Sanger sequencing. Our findings is interesting because frameshift mutations in homopolymer (HP) tracts, which are present at high levels in the yeast genome (> 77,400 for five to twenty nt HP tracts), are likely to disrupt gene function and further demonstrate that the mutation pattern seen previously in mismatch repair defective strains using a limited number of reporters holds true for the entire genome. In Chapter 4, I presented an analysis of mutation hotspots in yeast deficient in DNA mismatch repair (MMR). Classical evolutionary theory assumes that mutations occur randomly in the genome; however studies performed in a variety of organisms indicate existence of context-dependent mutational biases. All of these biases involve local sequence context (e.g., increased rate of cytosine deamination at methylated CpG's in mammals), but the source of mutagenesis variation across larger genomic contexts (e.g., tens or hundreds of bases) have not been identified. Therefore, we use high-coverage whole genome sequencing (>200X coverage) of progenitor and derived conditional MMR mutant line of diploid yeast to confidently identify 92 mutations that accumulated after 160 generations of vegetative growth by using log-likelihood ratio test. We found that the 73 single and double bp insert/deletion mutations accumulate much more frequently in homopolymeric poly-A and poly-T tracts with all mutations occurring at sites with at least 5 hp runs. Surprisingly, we demonstrated that the the likelihood of an indel mutation in a given poly (dA:dT) homopolymeric tract is increased by the presence of nearby poly (dA:dT) tracts in up to a 1000 bp region centered on the given tract. Furthermore, we identified nine positions that were mutated independently in at least two replicate lines and these all occurred at sites with at least 8 homopolymeric runs, suggesting greater instability for higher poly An or poly T n sites. Our work suggests that specific mutation hotspots can contribute disproportionately to the genetic variation that is introduced into populations, and provides the first long-range genomic sequence context that contributes to mutagenesis.
Next Generation Sequencing; SNP Calling Algorithm; Bayesian Model; Statistics Genomics; Population Genetics
Bustamante, Carlos D.
Alani, Eric; Wells, Martin Timothy
Ph.D. of Statistics
Doctor of Philosophy
dissertation or thesis