Model-Based Classification With Applications To High-Dimensional Data In Bioinformatics
In recent years, sparse classification problems have emerged in many fields of study. Finite mixture models have been developed to facilitate Bayesian inference where parameter sparsity is substantial. Shrinkage estimation allows strength borrowing across features in light of the parallel nature of multiple hypothesis tests. Important examples that incorporate shrinkage estimation and finite mixture model for sparse classification include the hierarchical model in Smyth (2004) and the explicit mixture model in Bar et al. (2010) for Bayesian microarray analysis. Classification with finite mixture models is based on the posterior expectation of latent indicator variables. These quantities are typically estimated using the expectation-maximization (EM) algorithm in an empirical Bayes approach or Markov chain Monte Carlo (MCMC) in a fully Bayesian approach. MCMC is limited in applicability where high-dimensional data are involved because its sampling-based nature leads to slow computations and hard-to-monitor convergence. In a fully Bayesian framework, we investigate the feasibility and performance of variational Bayes (VB) approximation and apply the VB approach to fully Bayesian versions of several finite mixture models that have been proposed in bioinformatics. We find that it achieves desirable speed and accuracy in sparse classification with hierarchical mixture models for high-dimensional data. Another example of sparse classification in bioinformatics solvable via model-based approaches is expression quantitative trait loci (eQTL) detection, in which determining whether association between a gene and any given single nucleotide polymorphism (SNP) is significant is regarded as classifying genes as null or non-null with respect to the given SNP. High-dimensionality of the data not only causes difficulties in computations, but also renders the confounding impact of unwanted variation in the data irrefutable. Model-based approaches that account for unwanted variation by incorporating a factor analysis term representing hidden factors and their effects have been adopted in applications such as differential analysis and eQTL detection. HEFT (Gao et al., 2014) is a fast approach for model-based eQTL identification while simultaneously learning hidden effects. We develop a hierarchical mixture model-based empirical Bayes approach for sparse classification while simultaneously accounting for unwanted variation, as well as a family of model-based approaches that are its simplifications with the aim of attractive computational efficiency. We investigate feasibility and performance of these model-based approaches in comparison with HEFT using several real data examples in bioinformatics.
Bayesian inference; Linear mixed models; Bioinformatics
Hooker, Giles J.; Wells, Martin Timothy
Ph. D., Statistics
Doctor of Philosophy
dissertation or thesis