Latent Structure in Linear Prediction and Corpora Comparison

Other Titles
This work first studies the finite-sample properties of the risk of the minimum-norm interpolating predictor in high-dimensional regression models. If the effective rank of the covariance matrix of the p regression features is much larger than the sample size n, we show that the min-norm interpolating predictor is not desirable, as its risk approaches the risk of trivially predicting the response by 0. However, our detailed finite-sample analysis reveals, surprisingly, that this behavior is not present when the regression response and the features are jointly low-dimensional, following a widely used factor regression model. Within this popular model class, and when the effective rank of the covariance matrix is smaller than n, while still allowing for p >> n, both the bias and the variance terms of the excess risk can be controlled, and the risk of the minimum-norm interpolating predictor approaches optimal benchmarks. Moreover, through a detailed analysis of the bias term, we exhibit model classes under which our upper bound on the excess risk approaches zero, while the corresponding upper bound in the recent work [arXiv:1906.11300] diverges. Furthermore, we show that the minimum-norm interpolating predictor analyzed under the factor regression model, despite being model-agnostic and devoid of tuning parameters, can have similar risk to predictors based on principal components regression and ridge regression, and can improve over LASSO based predictors, in the high-dimensional regime. The second part of this work extends the analysis of the minimum-norm interpolating predictor to a larger class of linear predictors of a real-valued response Y. Our primary contribution is in establishing finite sample risk bounds for prediction with the ubiquitous Principal Component Regression (PCR) method, under the factor regression model, with the number of principal components adaptively selected from the data---a form of theoretical guarantee that is surprisingly lacking from the PCR literature. To accomplish this, we prove a master theorem that establishes a risk bound for a large class of predictors, including the PCR predictor as a special case. This approach has the benefit of providing a unified framework for the analysis of a wide range of linear prediction methods, under the factor regression setting. In particular, we use our main theorem to recover the risk bounds for the minimum-norm interpolating predictor, and a prediction method tailored to a subclass of factor regression models with identifiable parameters. This model-tailored method can be interpreted as prediction via clusters with latent centers. To address the problem of selecting among a set of candidate predictors, we analyze a simple model selection procedure based on data-splitting, providing an oracle inequality under the factor model to prove that the performance of the selected predictor is close to the optimal candidate. In the third part of this work, we shift from the latent factor model to developing methodology in the context of topic models, which also rely on latent structure. We provide a new, principled, construction of a distance between two ensembles of independent, but not identically distributed, discrete samples, when each ensemble follows a topic model. Our proposal is a hierarchical Wasserstein distance, that can be used for the comparison of corpora of documents, or any other data sets following topic models. We define the distance by representing a corpus as a discrete measure theta over a set of clusters corresponding to topics. To a cluster we associate its center, which is itself a discrete measure over topics. This allows for summarizing both the relative weight of each topic in the corpus (represented by the components of theta) and the topic heterogeneity within the corpus in a single probabilistic representation. The distance between two corpora then follows naturally as a hierarchical Wasserstein distance between the probabilistic representations of the two corpora. We demonstrate that this distance captures differences in the content of the topics between two corpora and their relative coverage. We provide computationally tractable estimates of the distance, as well as accompanying finite sample error bounds relative to their population counterparts. We demonstrate the usage of the distance with an application to the comparison of news sources.
Journal / Series
Volume & Issue
202 pages
Date Issued
Factor Model; High dimensional; Latent; Wasserstein
Effective Date
Expiration Date
Union Local
Number of Workers
Committee Chair
Wegkamp, Marten H.
Bunea, Florentina
Committee Co-Chair
Committee Member
Sridharan, Karthik
Degree Discipline
Degree Name
Ph. D., Statistics
Degree Level
Doctor of Philosophy
Related Version
Related DOI
Related To
Related Part
Based on Related Item
Has Other Format(s)
Part of Related Item
Related To
Related Publication(s)
Link(s) to Related Publication(s)
Link(s) to Reference(s)
Previously Published As
Government Document
Other Identifiers
Attribution-NonCommercial-NoDerivatives 4.0 International
dissertation or thesis
Accessibility Feature
Accessibility Hazard
Accessibility Summary
Link(s) to Catalog Record