Adaptive Learning: Algorithms and Complexity
Foster, Dylan James
Recent empirical success in machine learning has led to major breakthroughs in application domains including computer vision, robotics, and natural language processing. There is a chasm between theory and practice here. Many of the most impressive practical advances in learning rely heavily on parameter tuning and domain-specific heuristics, and the development effort required to deploy these methods in new domains places a great burden on practitioners. On the other hand, mathematical theory of learning has excelled at producing broadly applicable algorithmic principles (stochastic gradient methods, boosting, SVMs), but tends to lag behind in state-of-the-art performance, and may miss out on practitioners' intuition. Can we distill our collective knowledge of "what works" into learning procedures that are general-purpose, yet readily adapt to problem structure in new domains? We propose to bridge the gap and get the best of both worlds through adaptive learning: Learning procedures that go beyond the worst case and automatically exploit favorable properties of real-world instances to get improved performance. The aim of this thesis is to develop adaptive algorithms and investigate their limits, and to do so in the face of real-world considerations such as computation, interactivity, and robustness. In more detail, we: 1) introduce formalism to evaluate and assert optimality of adaptive learning procedures. 2) develop tools to prove fundamental limits on adaptivity. 3) provide efficient and adaptive algorithms to achieve these limits. In classical statistical decision theory, learning procedures are evaluated by their worst-case performance (e.g., prediction accuracy) across all problem instances. Adaptive learning evaluates performance not just worst case, but in the best case and in between. This necessitates the development of new statistical and information-theoretic ideas to provide instance-dependent performance guarantees, as well as new algorithmic and computational principles to derive efficient and adaptive algorithms. The first major contribution this thesis makes concerns sequential prediction, or online learning. We prove the equivalence of adaptive algorithms, probabilistic objects called martingale inequalities, and geometric objects called Burkholder functions. We leverage the equivalence to provide: 1) a theory of learnability for adaptive online learning. 2) a unified algorithm design principle for adaptive online learning. The equivalence extends the classical Vapnik-Chervonenkis theory of (worst-case) statistical learning to adaptive online learning. It allows us to derive new learning procedures that efficiently adapt to problem structure, and serves as our starting point for investigating adaptivity in real-world settings. In many modern applications, we are faced with data that may be streaming, non-i.i.d., or simply too large to fit in memory. In others, we may interact with and influence the data generating process through sequential decisions. Developing adaptive algorithms for these challenges leads to fascinating new questions. Must we sacrifice adaptivity to process and make predictions from data as it arrives in a stream? Can we adapt while balancing exploration and exploitation? Major contributions this thesis makes toward these questions include: * We introduce a notion of "sufficient statistics" for online learning and show that this definition leads to adaptive algorithms with low memory requirements. * We develop large scale optimization algorithms for learning that adapt to problem structure via automatic parameter tuning, and characterize their limits. * We give adaptive algorithms for interactive learning/sequential decision making in contextual bandits, a simple reinforcement learning setting. Our main result here is a new margin theory paralleling that of classical statistical learning. * We provide robust sequential prediction algorithms that obtain optimal instance dependent performance guarantees for statistical learning, yet make no assumptions on the data generating process. We then characterize their limits. * We design algorithms that adapt to model misspecification in the ubiquitous statistical task of logistic regression. Here we give a new improper learning algorithm that attains a doubly-exponential improvement over sample complexity lower bounds for proper learning. This resolves a COLT open problem of McMahan and Streeter (2012), as well as two open problems related to adaptivity in bandit multiclass classification (Abernethy and Rakhlin, 2009) and online boosting (Beygelzimer et al., 2015).
Statistics; adaptivity; bandits; statistical learning; Optimization; Computer science; Online learning; machine learning
Tardos, Eva; Kleinberg, Robert David; Weinberger, Kilian Quirin
Ph.D., Computer Science
Doctor of Philosophy
dissertation or thesis