Contributions To The Statistical And Computational Modeling Of Dna Transcription Regulation
Transcription is a fundamental and tightly regulated process in living cells and a key step in the expression of the information contained in DNA. A wide variety of experimental assays have been developed that enable genome-wide analysis of the features of transcription and transcription regulation. We present statistical analysis combining both large existing datasets and new experimental assays to explore three aspects of transcription regulation: (i) determinants of transcription factor binding intensity, (ii) characterization of transcription initiation regions at both promoters and enhancers and (iii) unsupervised identification of transcription units. Transcription factor binding intensity is affected by both DNA sequence and local chromatin landscape. We aimed to disentangle these influences by combining PB-seq (a new experimental approach developed by Michael Guertin) with existing modENCODE data in the study of Drosophila Heat Shock Factor (HSF). PB-seq enabled the estimation of the genome-wide binding energy landscape in the absence of chromatin. It further allowed the development of a statistical model to predict the departure of in-vivo binding intensities (from ChIPseq) from the naked chromatin binding intensities (from PB-seq), based on covariates describing the local pre heat shock chromatin environment. We found that DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates. Furthermore DNase I hypersensitivity could also be largely recapitulated from the remaining covariates. Lastly, PB-seq data was applied to develop an unbiased model of HSF binding sequences, which revealed distinct biophysical properties of the HSF/HSE interaction and a previously unrecognized substructure within the HSE. Transcription initiation regions at promoters and enhancers have conventionally been treated separately, although they share many features in mammals. We examined all transcription initiation sites, for both stable and unstable transcripts, using GRO-cap (a new experimental assay developed by Leighton Core). Statistical modeling and analysis of this data, and its contrast with existing ENCODE datasets, reveal a common architecture of initiation at both promoters and enhancers. This common architecture features tightly spaced (110 bp) divergent initiation with similar frequencies of core-promoter sequence elements, highly-positioned flanking nucleosomes, and two modes of TF binding. Transcript elongation stability, a feature determined after transcription initiation, provides a more fundamental distinction between promoters and enhancers than the relative abundance of histone modifications and the presence of TFs or co-activators. These results support a unified model of transcription initiation at both promoters and enhancers. Finally, we turn to the identification of transcription units from nascent RNA assays (GRO-seq and PRO-seq). Although existing annotations focus on stable RNA transcripts (cleavage and poly-Adenylation point), transcription extends beyond the cleavage site. As such, the transcription process can potentially influence surrounding regions. We improve on previous work on the detection of transcription units by obtaining an unsupervised method that does not depend on RNA product annotations. We use these results to examine post polyAdenylation extension and cross-strand RNA polymerase collision effects.
Transcription Regulation; Genomics; Probabilistic Sequence Analysis
Siepel, Adam Charles
Hooker, Giles J.; Lis, John T
Ph. D., Computational Biology
Doctor of Philosophy
dissertation or thesis