Cornell University
Library
Cornell UniversityLibrary

eCommons

Help
Log In(current)
  1. Home
  2. Cornell University Graduate School
  3. Cornell Theses and Dissertations
  4. DATA SUBSAMPLING FOR MODEL SELECTION IN AUTOML FRAMEWORKS

DATA SUBSAMPLING FOR MODEL SELECTION IN AUTOML FRAMEWORKS

File(s)
Nayar_cornell_0058O_11145.pdf (3.31 MB)
Permanent Link(s)
https://doi.org/10.7298/vwhw-cv46
https://hdl.handle.net/1813/109675
Collections
Cornell Theses and Dissertations
Author
Nayar, Nandini
Abstract

This project studies methods of using data subsampling to perform model selection. Most commonly used methods for model selection require training all models on the entire training data several times in order to pick the best one. This is often one of the most computationally expensive aspects of model selection. It would therefore be valuable to understand how resources can be better allocated to pick the best model for a given dataset. This project explores this question of how to optimize resource allocation for model selection by subsampling data. We try three different approaches to model selection starting with (1) a randomized multi-armed bandit approach, (2) subsampling using influence functions and finally (3) a new boosting based method that can be called iterative boosting. The first method uses 10 tabular datasets while the following two approaches use MNIST and CIFAR-10 image datasets and deep learning models. Each of these approaches uses a unique set of assumptions which provide some pros and cons for the intended task of model selection. Analysis of these three methods is done to better understand how subsampling can be better approached in order to take meaningful subsets of data to accurately estimate a model’s relative test performance. The hyperband method for subsampling seems to be the most effective in terms of computational complexity as well as getting good relative model performance. The iterative boosting method shows some promise on MNIST but requires more work in order to make it significantly better than random subsampling for more complex datasets like CIFAR-10.

Description
42 pages
Date Issued
2021-05
Keywords
Artificial Intelligence
•
AutoML
•
Data
•
Machine Learning
•
Model Selection
•
Subsampling
Committee Chair
Udell, Madeleine Richards
Committee Member
Hariharan, Bharath
Degree Discipline
Computer Science
Degree Name
M.S., Computer Science
Degree Level
Master of Science
Type
dissertation or thesis
Link(s) to Catalog Record
https://newcatalog.library.cornell.edu/catalog/15049503

Site Statistics | Help

About eCommons | Policies | Terms of use | Contact Us

copyright © 2002-2026 Cornell University Library | Privacy | Web Accessibility Assistance