A Statistical Model Of Transcription Factor Binding Site Gain And Loss
Gene regulation is a critical determinant of an organism's phenotype. Furthermore, there is mounting evidence that gene regulation, or rather gradual changes in gene regulation as a result regulatory sequence turnover, has played an important role in evolution and speciation. Given the remarkable phenotypic diversity among mammals, it is surprising to learn that all share roughly the same set of around 20,000 genes, most of which are highly conserved. Indeed, the degree of divergence at the gene level fails to explain the diversity observed among mammals, suggesting that it is not differences in genes that explain the balance of the phenotypic changes, but when and where genes are used in different species. The spatiotemporal expression patterns of genes are intricately controlled through the process of transcriptional regulation, a multifactorial process involving interactions between a host of regulatory sequences, DNA-binding proteins and cofactors, signaling pathways and epigenetic factors. Transcription factor binding sites (TFBS's) are an important class of regulatory sequences involved in the gene regulatory process and it is known that TFBS's are frequently gained and lost in mammalian genomes. This is consistent with an important role of TFBS's in gene regulatory evolution. However, little is known about the TFBS turnover process and its relationship to gene regulatory evolution and, by extension, phenotypic change and the adaptive evolutionary process. In order to gain insight into the process of TFBS turnover, it is necessary to reliably identify TFBS's that have been gained or lost in a lineage-specific manner through the process of regulatory evolution. Here I present a phylogenetic hidden-Markov model (phylo-HMM) that describes the process of lineage-specific TFBS gain and loss and test its performance on simulated and biological datasets using two methods: a Viterbi algorithm implementation and a Gibbs sampler. In both contexts, the model performs well on simulated data but does not appear robust to violations of the model assumptions that are present in biological datasets. With further refinement, the model and methods may yield better performance on real data. However, key limitations include large memory and computational requirements and a need to simplify the model and restrict the dataset size to ensure tractability. These shortcomings increase the user inputs required to apply the methods and complicate data interpretation and generalization, thus limiting the utility of the methods.
dissertation or thesis