Show simple item record

dc.contributor.advisorMezey, Jason
dc.contributor.authorJu, Jin Hyun
dc.description.abstractExpression quantitative trait loci (eQTL) have become an attractive research topic in the past decade assisted by the technical advances in next-generation sequencing (NGS) and high-throughput gene expression measurements. eQTL discoveries provided researchers with new insights into genetic regulatory mechanisms, and are crucial in establishing functional links in genome-wide association study (GWAS) results. A powerful aspect of these studies are that the simultaneous genome wide measurements of gene expression values and sequence variants make it possible to detect associations independent of prior knowledge. However, the high dimensionality of the data also creates multiple challenges in the analysis process. Population structure in genotype data can induce significant inflation in the results leading to false positive findings, and confounding factors in gene expression measurements, such as technical batch effects and environmental differences, can lower the detection power of small genetic effects. The focus of this thesis is on the challenges in analyzing high-dimensional gene expression data to increase the accuracy in eQTL discovery. A central problem in developing confounding factor correction methods for eQTL analysis is to account for non-genetic confounding factors, while preventing broad impact genetic effects of being modeled as non-genetic variation. To address this issue, we developed a novel method CONFETI: CONfounding Factor Estimation Through Independent component analysis. CONFETI is based on a linear mixed model framework and uses independent component analysis (ICA) to estimate statistically independent generative sources from the observed gene expression profiles. Candidate genetic effects are excluded from the correction to maximize the discovery of broad impact eQTL, using the estimated independent components. We evaluated our framework by comparing the performance to other published confounding factor correction methods using both simulated and real human data. In the analysis of simulated data, we show that CONFETI most accurately recovered simulated eQTL results in the presence of confounding factors by distinguishing genetic effects from non-genetic variance.We then analyzed matched twin pair datasets from the Multiple Tissue Human Expression Resource (MuTHER) consortium and datasets consisting of similar tissue pairs from the Genotype-Tissue Expression (GTEx) consortium. To assess the performance of each method in human data, we investigated the replication of cis and trans-eQTL identified in each dataset. We found that accounting for confounding factors greatly increased both the number of identified cis-eQTL in each dataset, and replicating cis-eQTL between twin pairs and similar tissue types. The number of identified trans-eQTL increased as well, however, most of the findings were specific to each dataset and the replication rate remained significantly lower compared to cis-eQTL. While the use of confounding factor correction methods increased the power of the analysis, we found little difference in identifying replicating cis and trans-eQTL in human data by removing candidate genetic effects prior to correction.
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.subjectconfounding factor
dc.subjectindependent component analysis
dc.subjectlinear mixed model
dc.titleConfounding Factor Correction For Accurate Expression Quantitative Trait Loci Discovery
dc.typedissertation or thesis, Biophysics & Systems Biology Cornell Graduate School of Medical Sciences of Philosophy

Files in this item


This item appears in the following Collection(s)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International