Statistical Significance For Dna Motif Discovery
The identification of transcription factor binding sites, and of cis -regulatory elements in general, is an important step in understanding the regulation of gene expression. To address this need, many motif-finding tools have been described that can find short sequence motifs given only an input set of sequences. In this dissertation, we will begin by discussing why a reliable significance evaluation should be considered an essential component of any motif finder. We will introduce a biologically realistic method to estimate the reported motif's statistical significance based on a novel 3-Gamma approximation scheme. Furthermore, we show how the reliability of the significance evaluation can be further improved by incorporating local base composition information to its null model. We then demonstrate its reliability by applying GIMSAN/MOTISAN - de novo motif finding tool that incorporates this novel significance evaluation technique - to a well-studied set of Saccharomyces cerevisiae motif input data. Our results also reveal that an ensemble method based on our significance evaluation can substantially improve the actual motif finding task. Finally we will present ALICO (Alignment Constrained) null set generator: a framework to generate randomized versions of an input multiple sequence alignment that preserve some of its crucial features including its dependence structure. In particular, we will show that, on average, ALICO samples approximately preserve the PIDs (percent identities) between every pair of input sequences as well as the average Markov model composition. We will demonstrate its utility in phylo- genetic motif finders - motif finding tools that leverage conservation information - in terms of both reliability of statistical significance and improvement of motif finding task through ensemble method.
motif discovery; sequence analysis; computational biology
Booth, James; Friedman, Eric J.
Ph. D., Computer Science
Doctor of Philosophy
dissertation or thesis