Ensemble Trees And Clts: Statistical Inference In Machine Learning
As data grows in size and complexity, scientists are relying more heavily on learning algorithms that can adapt to underlying relationships in the data without imposing a formal model structure. These learning algorithms can produce very accurate predictions, but create something of a black-box and thus are very difficult to analyze. Classical statistical models on the other hand insist on a more rigid structure but are intuitive and easy to interpret. The fundamental goal of this work is to bridge these approaches by developing limiting distributions and formal statistical inference procedures for broad classes of ensemble learning methods. This is accomplished by drawing a connection between the structure of subsampled ensembles and U-statistics. In particular, we extend the existing theory of U-statistics to include infinite-order and random kernel cases and develop the relevant asymptotic theory for these new classes of estimators. This allows us to produce confidence intervals for predictions generated by supervised learning ensembles like bagged trees and random forests. We also develop formal testing procedures for feature significance and extend these to produce hypothesis tests for additivity. When a large number of test points is required or the additive structure is particularly complex, we employ random projections and utilize recent theoretical developments. Finally, we further extend these ideas and propose an alternative permutation scheme to address the problem of variable selection with random forests.
U-statistics; Random Forests; Bagging
Wegkamp,Marten H.; Wells,Martin Timothy
Ph. D., Statistics
Doctor of Philosophy
dissertation or thesis