Statistical Learning for Structural Patterns with Trees
In achieving structural patterns in parameters, we focus on two challenging cases in which (1) hierarchical sparsity pattern is desired such that one group of parameters is set to zero whenever another is set to zero; and (2) many features that are counts of rarely occurring events are present, and appropriate aggregation of the rare features may lead to better estimation. In either case, the methods under consideration use a tree or a directed acyclic graph (DAG) that encodes relations among parameters as side information. For achieving hierarchical sparsity patterns in parameters, we investigate the differences between group lasso (GL) and latent overlapping group lasso (LOG) in terms of their statistical properties and computational efficiency. We highlight a phenomenon of GL in which parameters embedded deep within the DAG are more aggressively regularized than those that are less deeply embedded. By contrast, we show that using LOG fulfills our goal without any additional complication and performs, both in practice and in theory, very similarly to the GL penalty that is modified to curb its over-aggressiveness. In terms of computation, we derive a finite-step algorithm for the proximal operator of LOG in the case of the DAG being a directed path graph; we later exploit this efficiency to propose a novel path-based block coordinate descent scheme. Finally, we compare the two frameworks in estimating banded covariance matrix, where we introduce a new sparsely-banded estimator using LOG, which we show achieves the statistical advantages of an existing GL-based method but is simpler to express and more efficient to compute. Another kind of sparsity we care about is sparsity in the data itself. It is prevalent to have many highly sparse features for counting frequency of rare events in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We propose a tree-guided framework for aggregating rare features into denser ones through solving a convex optimization problem. The tree, which encodes feature similarity information on the leaves, comes from prior knowledge or data sources external to the current problem and is used as side information in aggregation. In our proposal, aggregating rare features is equivalent as enforcing equal coefficients within each group learned from solving the convex problem, resulting in another case of structural pattern in parameters. We apply our method on two data sets: a TripAdvisor hotel review data set, in which we predict the numerical rating of a hotel based on the text of an associated review; and a microbiome data set from the American Gut project that measures microbial species abundance from fecal samples, in which we predict the one's BMI based on both microbiome and non-microbiome features. In both applications, our method achieves high accuracy by making effective use of rare features and yields more interpretable results.
Convex optimization; Statistics; hierarchical sparse modeling; l1 penalization; rare feature; statistical learning
Wells, Martin Timothy; Joachims, Thorsten
Ph. D., Statistics
Doctor of Philosophy
dissertation or thesis