Statistical Inference for Machine Learning: Feature Importance, Uncertainty Quantification and Interpretation Stability
Machine learning has become ubiquitous in many areas, including high-stake applications such as autonomous driving, financial forecasting and clinical decisions. However, many models are complex in nature and act as ``black boxes", providing predictions but little insight as to how they were arrived at. In this thesis, we present our work from three different perspectives towards a better understanding of several types of machine learning models through the lens of statistical inference. Tree-based methods, including decision trees, random forests and gradient boosting machines, are a popular class of nonparametric statistical model. They are widely used owing to their flexibility and superior performances. Many practitioners reply on some kind of feature importance measurements to examine model behavior. We propose a modification that corrects for split-improvement variable importance measures in random forests and other tree-based methods. These measurements have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out-of-sample data, this bias can be corrected yielding better summaries and screening tools. Our next study focuses on understanding statistical properties and quantifying uncertainty for ensemble models. Tree-based ensembles like random forests remain one such popular option for which several important theoretical advances have been made in recent years by drawing upon a connection between their natural subsampled structure and the classical theory of U-statistics. Unfortunately, the procedures for estimating predictive variance resulting from these studies are plagued by severe bias and extreme computational overhead. Here, we argue that the root of these problems lies in the structure of the resamples themselves. We develop a general framework for analyzing the asymptotic behavior of V-statistics, demonstrating asymptotic normality under precise regularity conditions and establishing previously unreported connections to U-statistics. Importantly, these findings allow us to produce a natural and efficient means of estimating the variance of a conditional expectation, a problem of wide interest across multiple scientific domains that also lies at the heart of uncertainty quantification for supervised learning ensembles. As an application, we apply this result to design a stopping rule for determining the ideal tree depth in model distillation. Lastly, we investigate the stability for model explanation. Post hoc explanations based on perturbations are widely used approaches to interpret a machine learning model after it has been built. This class of methods has been shown to exhibit large instability, posing serious challenges to the effectiveness of the method itself and harming user trust. We propose a new algorithm called S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation. Experiments on both simulated and real world data sets are provided to demonstrate the effectiveness of our method.
Decision Trees; Ensembles; Feature Importance; Model Interpretation; Stability; Uncertainty Quantification
Hooker, Giles J.
Weinberger, Kilian Quirin; Udell, Madeleine Richards
Ph. D., Statistics
Doctor of Philosophy
Attribution 4.0 International
dissertation or thesis
Except where otherwise noted, this item's license is described as Attribution 4.0 International