Statistical Significance For Dna Motif Discovery

Other Titles



The identification of transcription factor binding sites, and of cis -regulatory elements in general, is an important step in understanding the regulation of gene expression. To address this need, many motif-finding tools have been described that can find short sequence motifs given only an input set of sequences. In this dissertation, we will begin by discussing why a reliable significance evaluation should be considered an essential component of any motif finder. We will introduce a biologically realistic method to estimate the reported motif's statistical significance based on a novel 3-Gamma approximation scheme. Furthermore, we show how the reliability of the significance evaluation can be further improved by incorporating local base composition information to its null model. We then demonstrate its reliability by applying GIMSAN/MOTISAN - de novo motif finding tool that incorporates this novel significance evaluation technique - to a well-studied set of Saccharomyces cerevisiae motif input data. Our results also reveal that an ensemble method based on our significance evaluation can substantially improve the actual motif finding task. Finally we will present ALICO (Alignment Constrained) null set generator: a framework to generate randomized versions of an input multiple sequence alignment that preserve some of its crucial features including its dependence structure. In particular, we will show that, on average, ALICO samples approximately preserve the PIDs (percent identities) between every pair of input sequences as well as the average Markov model composition. We will demonstrate its utility in phylo- genetic motif finders - motif finding tools that leverage conservation information - in terms of both reliability of statistical significance and improvement of motif finding task through ensemble method.

Journal / Series

Volume & Issue



Date Issued




motif discovery; sequence analysis; computational biology


Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Keich, Uri

Committee Co-Chair

Committee Member

Booth, James
Friedman, Eric J.

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Rights URI


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record