Population Genetic Inference When Mutation Rates are Context-Dependent
Population genetic studies often analyze patterns of single nucleotide polymorphisms (SNPs) to gain insight into the evolutionary history of a population. One summary statistic that has proved invaluable in these efforts is the frequency distribution of derived mutations (i.e., the site-frequency spectrum, or SFS). In order to generate the SFS, orthologous sequences from closely related outgroup species are frequently used to distinguish ancestral and derived alleles at each SNP (assuming the ancestral allele is the one that matches the outgroup). In a series of studies, I test the robustness of the parsimony assumption to a more realistic finite-sites model of context-dependent mutation biases inferred along the human lineage. I show (using both simulations and a theoretical model) that enough unobserved substitutions could have occurred since the divergence of human and chimpanzee to cause a shift in the SFS. The shifted SFS induced by misidentifying the ancestral states of some SNPs can lead to poor fitting demographic models and cause many statistical tests to spuriously reject neutrality in favor of models with positive selection. By constructing a novel model of the context-dependent mutation process, polymorphism data can be corrected for the effect of ancestral misidentification. Using this correction, statistical tests return to their proper rejection rates, allowing for more accurate inference of both demographic events as well as the strength and abundance of natural selection. This correction is used to better understand the evolution of GC-content in the human genome, and to perform accurate demographic inference in two populations of the biomedically important rhesus macaque. Finally, I present a new forward simulation program, SFS_CODE, that can simulate several populations under a Wright-Fisher style island model. This program is highly flexible, allowing the user to simulate several loci (with or without linkage), where each locus can be annotated as either coding or non-coding, sex or autosome, selected or neutral. In addition to providing the source code for our program, we have also developed a web server that will allow the user to perform simulations using the high performance computing resources of the Computational Biology Service Unit at Cornell University (http://cbsuapps.tc.cornell.edu/sfscode.aspx).
Special Committee Chair: Carlos D. Bustamante Special Committee Members: Andrew G. Clark, Richard T. Durrett
population genetics; statistical inference; ancestral misidentification; forward simulation; single nucleotide polymorphism
dissertation or thesis