Statistical Significance For Dna Motif Discovery

dc.contributor.authorNg, Patricken_US
dc.contributor.chairKeich, Urien_US
dc.contributor.committeeMemberBooth, Jamesen_US
dc.contributor.committeeMemberFriedman, Eric J.en_US
dc.description.abstractThe identification of transcription factor binding sites, and of cis -regulatory elements in general, is an important step in understanding the regulation of gene expression. To address this need, many motif-finding tools have been described that can find short sequence motifs given only an input set of sequences. In this dissertation, we will begin by discussing why a reliable significance evaluation should be considered an essential component of any motif finder. We will introduce a biologically realistic method to estimate the reported motif's statistical significance based on a novel 3-Gamma approximation scheme. Furthermore, we show how the reliability of the significance evaluation can be further improved by incorporating local base composition information to its null model. We then demonstrate its reliability by applying GIMSAN/MOTISAN - de novo motif finding tool that incorporates this novel significance evaluation technique - to a well-studied set of Saccharomyces cerevisiae motif input data. Our results also reveal that an ensemble method based on our significance evaluation can substantially improve the actual motif finding task. Finally we will present ALICO (Alignment Constrained) null set generator: a framework to generate randomized versions of an input multiple sequence alignment that preserve some of its crucial features including its dependence structure. In particular, we will show that, on average, ALICO samples approximately preserve the PIDs (percent identities) between every pair of input sequences as well as the average Markov model composition. We will demonstrate its utility in phylo- genetic motif finders - motif finding tools that leverage conservation information - in terms of both reliability of statistical significance and improvement of motif finding task through ensemble method.en_US
dc.identifier.otherbibid: 7745353
dc.subjectmotif discoveryen_US
dc.subjectsequence analysisen_US
dc.subjectcomputational biologyen_US
dc.titleStatistical Significance For Dna Motif Discoveryen_US
dc.typedissertation or thesisen_US Science Universityen_US of Philosophy D., Computer Science


Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
2.31 MB
Adobe Portable Document Format