Show simple item record

dc.contributor.authorLogsdon, Benjaminen_US
dc.date.accessioned2012-06-28T20:56:40Z
dc.date.available2016-06-01T06:15:43Z
dc.date.issued2011-01-31en_US
dc.identifier.otherbibid: 7745053
dc.identifier.urihttps://hdl.handle.net/1813/29227
dc.description.abstractHigh throughput sequencing and expression characterization have lead to an explosion of phenotypic and genotypic molecular data underlying both experimental studies and outbred populations. We develop a novel class of algorithms to reconstruct sparse models among these molecular phenotypes (e.g. expression products) and genotypes (e.g. single nucleotide polymorphisms), via both a Bayesian hierarchical model, when the sample size is much smaller than the model dimension (i.e. p n) and the well characterized adaptive lasso algo- rithm. Specifically, we propose novel approaches to the problems of increasing power to detect additional loci in genome-wide association studies using our variational algorithm, efficiently learning directed cyclic graphs from expression and genotype data using the adaptive lasso, and constructing genomewide undirected graphs among genotype, expression and downstream phenotype data using an extension of the variational feature selection algorithm. The Bayesian hierarchical model is derived for a parametric multiple regression model with a mixture prior of a point mass and normal distribution for each regression coefficient, and appropriate priors for the set of hyperparameters. When combined with a probabilistic consistency bound on the model dimension, this approach leads to very sparse solutions without the need for cross validation. We use a variational Bayes approximate inference approach in our algorithm, where we impose a complete factorization across all parameters for the approximate posterior distribution, and then minimize the KullbackLeibler divergence between the approximate and true posterior distributions. Since the prior distribution is non-convex, we restart the algorithm many times to find multiple posterior modes, and combine information across all discovered modes in an approximate Bayesian model averaging framework, to reduce the variance of the posterior probability estimates. We perform analysis of three major publicly available data-sets: the HapMap 2 genotype and expression data collected on immortalized lymphoblastoid cell lines, the genome-wide gene expression and genetic marker data collected for a yeast intercross, and genomewide gene expression, genetic marker, and downstream phenotypes related to weight in a mouse F2 intercross. Based on both simulations and data analysis we show that our algorithms can outperform other state of the art model selection procedures when including thousands to hundreds of thousands of genotypes and expression traits, in terms of aggressively controlling false discovery rate, and generating rich simultaneous statistical models.en_US
dc.language.isoen_USen_US
dc.subjectVariational Bayesen_US
dc.subjectGene expression network reconstructionen_US
dc.subjectGraphical modelsen_US
dc.titleSparse Model Building From Genome-Wide Variation With Graphical Modelsen_US
dc.typedissertation or thesisen_US
thesis.degree.disciplineComputational Biology
thesis.degree.grantorCornell Universityen_US
thesis.degree.levelDoctor of Philosophy
thesis.degree.namePh. D., Computational Biology
dc.contributor.chairMezey, Jason G.en_US
dc.contributor.committeeMemberClark, Andrewen_US
dc.contributor.committeeMemberBustamante, Carlos D.en_US
dc.contributor.committeeMemberWells, Martin Timothyen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Statistics