Variable Selection for High-Dimensional Longitudinal Omics Data with a Continuous or Misclassified Binary Outcome
Large collections of omics data, such as metabolomic and proteomic abundances, are rapidly becoming ubiquitous in many clinical and observational studies. Cross-sectional problems like differential expression/abundance have been extensively studied, with a wide range of statistical and machine learning methods both well established and newly developing. Longitudinal omics data, holding valuable short term dynamic information, present additional challenges not seen in cross-sectional data, and the corresponding methodologies are not as comprehensive. In particular, the literature studying relationships between high-dimensional omics data and clinical outcomes of interest contains key gaps. This dissertation contributes to this area by presenting variable selection and inference frameworks for longitudinal omics variables paired with longitudinal clinical outcomes as well as cross-sectional misclassified binary outcomes. First, we discuss two key concepts that are consistent across the three main chapters in this dissertation: first-differencing and graph-Laplacian penalization using empirical absolute correlation. First-differencing longitudinal data, particularly when both the outcome and predictors are longitudinal with the same time points, reduces unwanted correlation and simplifies the model parameters. The graph-Laplacian is commonly used for data with a prior graph structure, like gene networks. We argue that smoothing model coefficients over graphs induced by empirical dependence, like absolute correlation, provides both numeric and theoretic benefits at the cost of solving one additional convex optimization problem for associated hyper-parameters. Chapter 2 describes the development of a model, PROLONG, that performs variable selection for high-dimensional (p > n) longitudinal omics data with a continuous clinical outcome. This model combines the least-squares error with the group lasso and the graph-Laplacian penalties. We must grapple with low signal in the data as well as the challenges that come with both longitudinal models and high-dimensional variable selection. This method smooths coefficients over time as well as over graph that connects all variables with edge weights being the between-variable pairwise correlations. The work is particularly motivated by TB mycobacterial load measurements paired with a large set of metabolite abundances, both measured over 4 time points across 15 subjects, with the objective being to identify metabolites that co-vary over time with the outcome. More generally, this method will work for any longitudinal and continuous clinical outcome and omics variables. We apply this method to our motivating data as well as a set of simulated data scenarios that mimic characteristics of the real data. In Chapter 3 we extend PROLONG to the case with multiple treatment groups and provide a framework for uncertainty quantification and inference, with a particular focus on FDR control. Our motivating data is the same as in Chapter 2 but an additional 19 subjects from a different treatment group with much lower signal. The method can be generalized to data with two or more treatment groups, and makes no assumptions on consistent sparsity or dependence patterns across treatment groups. After fitting PROLONG, we apply a subgradient-based one-step correction that de-biases the coefficients enough to achieve asymptotic normality and quantify uncertainty in our finite samples. We also provide a framework for inference that allows model hyper-parameters to vary across treatment groups, key in our data where one group has much less signal and would normally suppress the other group's cooefficients. We evaluate our inference procedure on the motivating data as well as simulations, some which again mirror the real data and others inspired by setups used in recent literature. We change our focus in Chapter 4 from a longitudinal continuous outcome to a cross-sectional binary outcome that may have covariate-related, i.e. structural, misclassification. In this setting, we need to overcome bias resulting from structural misclassification while capturing the relationship between our variables and the latent clinical outcome that we can not measure with full accuracy. We consider motivating data that is fully cross-sectional clinical and/or demographic data, and data that includes longitudinal omics. In the first case, we prioritize accurate coefficient estimation with minimal bias, and only use the graph-Laplacian penalization. In the latter case, to achieve variable selection we again use a group lasso penalty. In both cases, we apply a penalized EM algorithm that iteratively updates both the coefficients of interest as well as the coefficients corresponding to the covariate-related misclassification, with separate terms relating to sensitivity and specificity. We evaluate our results on real cross-sectional data, as well as simulations imitating longitudinal omics variables.