DATA SUBSAMPLING FOR MODEL SELECTION IN AUTOML FRAMEWORKS
This project studies methods of using data subsampling to perform model selection. Most commonly used methods for model selection require training all models on the entire training data several times in order to pick the best one. This is often one of the most computationally expensive aspects of model selection. It would therefore be valuable to understand how resources can be better allocated to pick the best model for a given dataset. This project explores this question of how to optimize resource allocation for model selection by subsampling data. We try three different approaches to model selection starting with (1) a randomized multi-armed bandit approach, (2) subsampling using influence functions and finally (3) a new boosting based method that can be called iterative boosting. The first method uses 10 tabular datasets while the following two approaches use MNIST and CIFAR-10 image datasets and deep learning models. Each of these approaches uses a unique set of assumptions which provide some pros and cons for the intended task of model selection. Analysis of these three methods is done to better understand how subsampling can be better approached in order to take meaningful subsets of data to accurately estimate a model’s relative test performance. The hyperband method for subsampling seems to be the most effective in terms of computational complexity as well as getting good relative model performance. The iterative boosting method shows some promise on MNIST but requires more work in order to make it significantly better than random subsampling for more complex datasets like CIFAR-10.
Artificial Intelligence; AutoML; Data; Machine Learning; Model Selection; Subsampling
Udell, Madeleine Richards
M.S., Computer Science
Master of Science
dissertation or thesis