SINGLE NUCLEOTIDE RESOLUTION MODELING OF TRANSCRIPTION INITIATION
Deciphering how cis-regulatory DNA sequences encode transcriptional regulation, and how noncoding genetic variation disrupts these processes, is a central challenge in human genetics. This thesis develops deep learning models to characterize the sequence basis of transcriptional regulation and highlights methods for improving variant effect prediction. First, I present CLIPNET, a base-pair resolution model of transcription initiation trained on population-scale PRO-cap data. CLIPNET reveals a combinatorial grammar in which transcriptional activators and core promoter motifs act synergistically to determine trasncription initiation, with activators primarily modulating initiation levels and core promoter elements specifying initiation sites. The model further identifies DPR motifs and AT-rich candidate TFIID-binding sequences as prevalent determinants of transcription initiation in TATA-less promoters. Next, I show that training sequence-to-function models on functional genomic data with matched personal genomes substantially improves prediction of the molecular impact of genetic variants. Variant effect representations learned in this framework transfer across cell types and experimental readouts. Collectively, this work advances understanding of the cis-regulatory code of transcription and establishes strategies for improving variant effect prediction.