Modeling Biological Processes In Genome-Wide Association Studies Using Regularized Regression
Genome-wide association studies (GWAS) have become a a widely adopted approach to identify genetic variation that produces variation in complex phenotype. Standard statistical methods are able to identify strong associations in these datasets, but more sophisticated statistical methods that model complex aspects of the biological data can identify weaker associations and further elucidate the underlying molecular biology. We develop and apply statistical methods that explicitly model two aspects of GWAS data using two complementary forms of regularized regression. First, we model the polygenic architecture of complex phenotypes using feature selection methods in a penalized regression framework. We propose novel algorithmic, computational and heuristic approaches in order to produce a method that scales to high dimensional GWAS data and increases power to detect weak associations that are not detectable by standard tests. Second, we model the covariance between individuals due to kinship and population structure using a linear mixed model that regularizes the statistical contribution of a metric of ancestry. Linear mixed models have been widely adopted for analysis of GWAS data, but their theoretical properties have not been examined in this context. We formalize the statistical properties of the linear mixed model, develop a novel interpretation in relation to population genetics, and propose a novel low rank linear mixed model that learns the dimensionality of the correction for kinship and population structure from the data. Finally, we combine these two complementary regularized regression models into a penalized linear mixed model. We develop a unified model incorporating a novel algorithm with novel approaches to tuning nonconvex penalties and determining the optimal stopping point in the regularization path. Leveraging recent work on assessing significance of selected features, we produce a well-principled and scalable statistical method applicable to feature selection, hypothesis testing and prediction in many contexts.
genome-wide association study; GWAS; LASSO; regularized regression
Mezey, Jason G.
Siepel, Adam Charles; Clark, Andrew
Ph. D., Genetics
Doctor of Philosophy
dissertation or thesis