DISCOVERING AND EXPLOITING STRUCTURE FOR GAUSSIAN PROCESSES A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Jacob Ross Gardner May 2018 ©c 2018 Jacob Ross Gardner ALL RIGHTS RESERVED DISCOVERING AND EXPLOITING STRUCTURE FOR GAUSSIAN PROCESSES Jacob Ross Gardner, Ph.D. Cornell University 2018 Gaussian processes have emerged as a powerful tool for modeling complex and noisy functions. They have found wide applicability in personalized medicine, time series analysis, prediction tasks in the physical sciences, and recently black- box optimization. Their success is in large part due to two fundamental advan- tages they enjoy over many other models. First, they provide convenient and often well-calibrated uncertainty estimates. Machine learning models make mis- takes, and by offering users full probabilistic predictions, Gaussian processes allow for more informed decision making in the face of uncertainty. Second, they allow users to encode and incorporate prior knowledge about their mod- elling task through the use of flexible and composable covariance functions or kernels. This aspect of Gaussian processes is particularly critical when faced with functions that are expensive or burdensome to evaluate, as they allow users to develop models that will generalize even with very limited data–in some cases, even before the first training example is collected. Despite these two clear advantages, some of the most popular applications of Gaussian processes have focused on exploiting the first advantage of GPs, and very little on exploiting the latter. As an example, in Bayesian optimization, off-the-shelf implementations often use the most generic and flexible covariance functions available. While a priori this ensures the generality of Bayesian opti- mization, it can significantly increase the number of function evaluations requred to perform optimization. In this thesis, we will demonstrate by way of application that the second advantage can be just as critical as the first. By leveraging expert medical knowl- edge, we develop a GP model that exploits basic facts about human hearing to dramatically improve both the quality and speed of modern audiometric testing. By automatically discovering independence structure in an objective function, we can leverage recent work on additive structure in Bayesian optimization to achieve exponentially lower sample complexities. Finally, by exploiting the prod- uct structure inherent to the RBF kernel–arguably the most common covariance function in usage–we will develop approximations for the GP marginal log like- lihood that result in exponential improvement to the running time complexity of Gaussian process inference compared to existing inducing point methods, resulting in the fastest asymptotic time complexity for training we are aware of. For my wife and best friend, Katie. I could not function without your love, support, and unbounded patience. P.S. Have you seen my wallet? 5 BIOGRAPHICAL SKETCH Jacob Gardner was born in 1990 in Raleigh, North Carolina where he attended Ravenscroft for kindergarten through highschool. He received his B.S. in 2013 from Washington University in St. Louis in an independently designed major focusing on computational biology. He started his Ph.D. in Computer Science at Washington University in 2013 advised by Kilian Weinberger, before mov- ing with the lab after to Cornell University in 2015. After graduating, he will be starting as a postdoctoral associate in Operations Research and Information Engineering. iii ACKNOWLEDGEMENTS The work I have done over the last five years would not have been possibly without the guidance and help of so many amazing and talented people along the way. I would like to thank my advisor, Kilian Weinberger. I’ve had a lot of people teach me a lot of things in graduate school, but Kilian taught me what you are truly supposed to learn in graduate school: how to be a scientist. He taught me how to design experiments that validate a claim or truly get to the bottom of a problem; the importance of always trying the simple solution first; and the value of clear and effective communication. I would also like to thank my other committee members, Kavita Bala and Karthik Sridharan, for their always helpful comments, insight, and dedication to my success as a graduate student. This thesis deals with Gaussian processes and Bayesian machine learning. These are tools that I found to be conceptually very difficult to grasp when I started my degree. Fortunately, I have had the pleasure of working with many exceptional experts in the area over the past five years, including John Cunning- ham, Roman Garnett, and Andrew Wilson. I would like to thank them for their guidance on everything Bayesian. The last chapter of this thesis and my recent work since has had a large focus on numerical linear algebra techniques, and I’d like to thank David Bindel for an excellent class and useful discussions in this area. Finally, I’d like to thank the outstanding labmates I’ve had throughout the years. Thanks to Stephen Tyree and Eddie Xu for being the realist foils to my usually excessive optimisim; to Matt Kusner for being a paragon of work ethic; to Geoff Pleiss for his guidance in software development; to Chuan Guo for always having the principled point of view; and to Felix, for finding the simplest solution for any problem. iv TABLE OF CONTENTS Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction 1 1.1 Discovering covariance structure . . . . . . . . . . . . . . . . . . . 2 1.2 Exploiting covariance structure . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Mathematical Background 6 2.1 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Hyperparameter learning . . . . . . . . . . . . . . . . . . . 7 2.2 Covariance functions . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Bayesian optimization . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Bayesian active learning . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Bayesian Model Selection . . . . . . . . . . . . . . . . . . . 13 2.4 Inference techniques based on matrix-vector multiplies . . . . . . 14 2.5 Structured kernel interpolation . . . . . . . . . . . . . . . . . . . . 15 3 Psychophysical Detection Testing with Bayesian Active Learning 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.2 Observation Model . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 Multiple Tones . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 Query Selection . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.5 Or-channel Analysis . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Bayesian Active Model Selection 39 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Active Bayesian model selection . . . . . . . . . . . . . . . . . . . 44 4.3 Active model selection for Gaussian processes . . . . . . . . . . . 45 4.3.1 Approximating the model evidence and hyperparameter posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Approximating the predictive distribution . . . . . . . . . 47 v 4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Computational details for common observation likelihoods . . . . 49 4.4.1 Regression with Gaussian noise . . . . . . . . . . . . . . . 50 4.4.2 Probit regression . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5 Audiometric threshold testing . . . . . . . . . . . . . . . . . . . . . 51 4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6.1 Diagnosing NIHL . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Discovering and Epxloiting Additive Structure for Bayesian Optimiza- tion 60 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Model selection via MCMC . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Additive structure discovery . . . . . . . . . . . . . . . . . . . . . 67 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.5.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5.3 Optimization of Benchmark Functions . . . . . . . . . . . . 71 5.5.4 Determining Cosmological Parameters . . . . . . . . . . . 74 5.5.5 Matrix completion . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 Exploiting Product Structure for Scalable Gaussian Processes 80 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 Matrix-vector multiplication with product kernels . . . . . . . . . 84 6.2.1 Structured kernel interpolation for products (SKIP) . . . . 88 6.3 Proofs for key results . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.1 Proof of Lemma 6.2.1 . . . . . . . . . . . . . . . . . . . . . . 89 6.3.2 Proof of Theorem 6.2.3 . . . . . . . . . . . . . . . . . . . . . 90 6.4 Evaluating MVM accuracy . . . . . . . . . . . . . . . . . . . . . . . 91 6.5 An exponential improvement to SKI . . . . . . . . . . . . . . . . . 92 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7 Discussion and future directions 97 Bibliography 100 vi LIST OF TABLES 6.1 Asymptotic complexities of a single inference step with n training examples and m inducing points, using k Lanczos iterations for SKIP and p CG iterations for SKIP and KISS-GP. . . . . . . . . . . 91 6.2 Comparison of SKIP and other methods on higher dimensional datasets. In this table, m is the total number of inducing points, rather than number of inducing points per dimension. (*We use m = 100 for SKIP on all datasets except precipitation, where we use m = 120K.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 vii LIST OF FIGURES 1.1 A simple toy regression problem (left), along with four possible GP models given the six input datapoints (right). Each model corresponds to a different choice of covariance function. “Periodic x Linear” corresponds to a composition of two elementary co- variance functions. Composing covariance functions is discussed further in Section 2.2. . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Difference in mutual information I2 − I1 between a paired query and a single query: (a) discrete distribution with two atoms (α, β) = 0.05, 0.65, and corresponding ᾱ = 1 − (1 − α)2 ≈ 0.1, β̄ ≈ 0.88). Here I2 − I1 ≈ 0.18; (b) I2 − I1 as a function of α, β (white cross denotes the specific example of panel a); (c) the nor- mally distributed latent input case. I2 − I1 is shown as a function of the mean µ and correlation ρ. Colorbar at right is for both panel b and c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Standard grid search audiogram with tones played at every oc- tave from 250 to 8000 Hz, and every 5 dB HL from -10 dB to 80 dB, compared to a multi-tone GP audiogram with 60 iterations (and therefore 119 “samples”). . . . . . . . . . . . . . . . . . . . . 31 3.3 Comparison of multi-tone and single-tone GP audiometrics. Left: A GP trained on 100 single tones. Blue circles denote tones de- tected by the subject, and red crosses denote tones that were not detected. The posterior probabilities are shown as color contours. Right: Log likelihood of random presentation of tones (no ac- tive learning, shown in gray), active learning presentation of sin- gle tones (shown in blue), and active learning with paired tones (shown in red), under the ground truth audiometric function from the left figure. Log likelihood is plotted as a function of iterations in each audiometric testing strategy. Shaded areas de- note standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 The posterior probability of detection within the frequency / intensity space during a GP audiometric test on a human sub- ject. Panels show the learned GP after 1, 15, 30, and 60 itera- tions. Queries consist of a single or a paired tone (as selected by the model). Blue circles indicate a positive outcome (sound was heard), red crosses indicate a negative outcome. Paired tones with positive outcome (at least one of the two tones was heard) are connected by a blue line. Almost all queries are close to the fi- nal audible threshold (0.5 posterior detection probability), which is well approximated even after only 15 iterations. . . . . . . . . 33 viii 4.1 Distinguishing two models with active model selection. Given only the input data (black dots), deciding whether the red model or the blue model best describes the true function is a difficult task. However, with a single additional training input at the dashed black line, the models are immediately distinguished: the potential blue point has very low probability under the red model, and the potential red point has very low probability under the blue model. Choosing the next point that maximizes the disagree- ment among all candidate models is achieved by maximizing the mutual information. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Samples selected by B A M S (red) and the method from Chapter 3(white) when run on the normal-hearing ground truth and the N I H L model ground truth. Contours denote probability of de- tection at 10% intervals. Circles indicate presentations that were heard by the simulated patient; exes indicate presentations that were not heard by the simulated patient. Left: Normal hearing model ground truth. Right: Notch model ground truth. . . . . . 57 4.3 Posterior probability of the correct model as a function of iter- ation number. Left: Notch model ground truth. Right: Normal hearing model ground truth. . . . . . . . . . . . . . . . . . . . . . 58 5.1 An illustration of the proposal distribution g(M′ | M) on a simple input model with three sub-partitions, M = [1, 3][2][4]. Splits with dice represent choices performed uniformly at ran- dom (without replacement). See text for details. . . . . . . . . . . 66 5.2 Plot of the relative distance to the true loglikelihood as a function of the number of model evaluations (see text). For MCMC, this means the number of samples drawn in the Markov chain. For bag of models, this means the number of models in the selected bag. Error bars are standard error averaged over 10 runs. . . . . . 69 5.3 Optimizing the 10d Styblinksi-Tang and Michalewicz func- tions. Shaded error bars are 2 standard errors over 10 runs. BayesOpt+Model MCMC converges faster than BOM in both cases. 71 5.4 Optimizing the 10d transformed Styblinksi-Tang function. Shaded error bars are 2 standard errors over 10 runs. BayesOpt+Model MCMC converges to the same final solution as BOM much faster, and significantly outperforms it in terms of final objective value. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 Setting values of various cosmological constants to match exper- imental data using BayesOpt. Shaded regions correspond to 2 standard errors over 10 runs. BayesOpt+Model MCMC matches BOM 38% faster, and then outperforms it. . . . . . . . . . . . . . . 75 ix 5.6 Optimizing the reconstruction error of images with matrix com- pletion. Corner: Example of image reconstruction on the peppers image using hyperparameters found by both the baseline and by BayesOpt with additive structure. The additive BayesOpt param- eters produce a significantly sharper image. . . . . . . . . . . . . 76 6.1 Computing fast matrix-vector multiplies (MVMs) with the prod- uct kernel (1) ◦ (2)KXX KXX . 1: Rewrite the element-wise product as the diagonal ∆(·) of a product of matrices. 2: Compute the rank-k Lanczos decomposition of (1) (2)KXX and KXX . . . . . . . . . . . . . . 83 6.2 Left: Relative error of MVMs computed using SKIP compared to the exact value Kv. Right: Training time as a function of the number of inducing points per dimension on the Power dataset. KISS-GP scales badly here, because the required total number of inducing points scales exponentially with the number of dimen- sions. On the power dataset d = 4. . . . . . . . . . . . . . . . . . . 87 x CHAPTER 1 INTRODUCTION Gaussian processes (GPs) have emerged as a powerful probabilistic tool for modeling complex and noisy functions. They have found wide applicability in personalized medicine, time series analysis, prediction tasks in the physical sciences, and recently blackbox optimization. Their success is in large part due to two fundamental advantages they enjoy over many other models. First, they provide convenient and often well-calibrated uncertainty estimates. Machine learning models make mistakes, and by offering users full probabilistic predic- tions, Gaussian processes allow for more informed decision making in the face of uncertainty. Second, they allow users to encode and incorporate prior knowl- edge about their modelling task through the use of flexible and composable covariance functions or kernels. This aspect of Gaussian processes is particularly critical when faced with functions that are expensive or burdensome to evaluate, as they allow users to develop models that will generalize even with very limited data–in some cases, even before the first training example is collected. This thesis focuses on this second aspect of Gaussian processes: taking advan- tage of the structure imposed on the model by the choice of covariance function. In particular, we seek to answer two fundamental questions (1) how can we in- telligently select a prior (i.e., covariance function) even when faced with little or no data at all?, and (2) once a prior has been selected, how do we best exploit the structure encoded in that prior? In the next two section, we will make these questions more concrete and provide motivation for them. 1 1.1 Discovering covariance structure The ability to encode prior knowledge about a function through the covariance function is one of the defining aspects and key advantages of the Gaussian pro- cess. When strong domain knowledge about a modeling task is available, se- lecting an appropriate covariance function can allow GPs to accurately model a function with very little data. This fact is arguably at the heart of the GP’s success in the low data regime. In Figure 1.1, we illustrate this point on a simple toy regression task for clarity. Six training examples are sampled from a true function, and a Gaussian process is then trained with four different covariance functions. The lower right plot depicts a Gaussian process model with a covariance function that exactly encodes the class of functions that the true function belongs to. As a result, it is able to exactly recover the true function, despite being limited to only six training examples. Choosing a strong covariance function to encode prior knowledge has one major caveat: covariance functions that encode strong prior knowledge are often not flexible or general purpose. They significantly narrow the hypothesis class that can be used to explain the training data. For example, the periodic kernel discussed later can only give rise to models that exactly repeat. As a result, care must be taken to ensure that the chosen covariance function encodes valid as- sumptions, or the model’s generalization ability will suffer greatly. This can be burdensome on users, and in practice–despite the rich and powerful language of covariance functions available–standard practice in off the shelf Gaussian processes is to use overly generic kernels. This motivates techniques that select 2 Figure 1.1: A simple toy regression problem (left), along with four possible GP models given the six input datapoints (right). Each model corresponds to a dif- ferent choice of covariance function. “Periodic x Linear” corresponds to a compo- sition of two elementary covariance functions. Composing covariance functions is discussed further in Section 2.2. covariance functions automatically. Recently proposed methods exist for auto- matically and effectively discovering covariance structure [19]. However, these techniques typically assume a somewhat large pre-existing set of data, which precludes their application in the limited data setting. 1.2 Exploiting covariance structure The procedures necessary to train Gaussian processes are often prohibitively expensive, and prevent their application to tasks with more than a few thou- sand data points or a handful of input dimensions without significant losses in performance. Traditional techniques for training Gaussian processes involve computing linear solves and log determinants with an n×n positive definite and symmetric kernel matrix K–operations that require O (n3) time. Recent work on so-called inducing point methods construct a small set of m < n inducing points 3 that allow for the construction of a low-rank approximation of K. This allows inference to be performed in O (nm2) time [97, 89]. Structured kernel interpolation (SKI) [113] is a recently proposed inducing point method that, given a regular grid ofm inducing points, allows for inference to be performed in O (n+m logm) time–an impressive improvement over other inducing point methods. Unfortunately, to achieve this impressive running time, the number of inducing points m grows exponentially with the dimensionality of the inputs, limiting the applicability of SKI to problems with fewer than 5 input dimensions. In this thesis, we will demonstrate that by exploiting product structure displayed by many popular covariance functions, we can reduce this exponential dependence on input dimensionality to linear time. 1.3 Outline This thesis details the following contributions for dealing with these two ques- tions: 1. In Chapter 2, we will describe the basic mathematical background that this thesis builds upon. We give an overview of Gaussian processes, including covariance function composition and model selection. We review Bayesian op- timization and Bayseian active learning, two key applications that will appear later in the thesis. We additionally review techniques for scalable Gaussian pro- cess inference, including recent work on inducing point methods. 2. In Chapter 3, we develop a novel audiometric detection (hearing) test based on active learning with Gaussian processes. This is a real-world case study 4 motivating the need for intelligent covariance function selection in the small data regime. By choosing a likelihood and covariance function using expert domain knowledge, we are able to develop a hearing test that is significantly faster and more accurate than state-of-the-art, yielding accurate test results typically with fewer than 20 tones presented to a patient. 3. In Chapter 4, we extend the active learning techniques from Chapter 2 to not only learn the patient’s hearing response function, but also actively select an appropriate covariance function. The idea of active learning for model selection will provide us with a tool to quickly choose an appropriate prior for a task in the regime where data collection is expensive. 4. In Chapter 5, we leverage an efficient MCMC sampling strategy to scale the active model selection technique discussed above to deal with choosing from large classes of priors. We apply this to the problem of Bayesian optimiza- tion for high dimensional functions, and develop a technique that–by automati- cally discovering whether a function decomposes additively and exploiting this structure–is able to significantly outperform standard Bayesian optimization techniques when faced with more than three or four input dimensions. 5. In Chapter 6, we describe how to leverage covariance function structure for efficient inference. We describe a technique based on structured kernel inter- polation (SKI) [113] that can exploit structure existing in certain types of covari- ance function compositions. This technique results in an exponential improvement in the running time complexity of SKI when faced with high dimensional data. 5 CHAPTER 2 MATHEMATICAL BACKGROUND 2.1 Gaussian processes A Gaussian process generalizes the multivariate normal distribution to define a distribution over functions. Analogous to the mean vector and covariance matrix of the multivariate normal distribution, a GP is specified by a prior mean function and prior covariance function f(x) ∼ GP (µ(x), k(x,x′)). Gaussian processes by definition have the property that the function values at any finite set of inputs [x1, ...,xN ] are jointly Gaussian distributed: f = [f(x1), ..., f(xN)] ∼ N (µX , KXX) where µX = [µ(x1), ..., µ(xN)]> and KXX = [k(xi,xj)]Ni,j=1. Throughout, we will denote by KAB the kernel matrix achieved by applying the kernel function to all pairs of points with one from A and one from B, (ai,bj). While µ(·) can be chosen to be essentially any function, the covariance function has a typical restriction: in the case where A = B, K must be a valid covariance matrix (i.e., it must be symmetric positive definite). For practical purposes, this restriction is often relaxed–some covariance functions may result in K having some zero eigenvalues. It is common in Gaussian process regression to assume an additive noise model relating function values to observed measurements. That is, we assume an observation yi is related to the observed features xi according to yi = f(xi)+, where  ∼ N (0, σ2). Given a training dataset D = {(x1, y1), ..., (xn, yn)}, the goal of inference is to infer the posterior distribution over the latent function, p(f | D). 6 Define K̂XX = [K 2XX + σ I] for notational simplicity. Under the Gaussian noise observation model this distribution is again a Gaussian process: ( ) p(f(x∗) | D ′) ∼ GP µ (x∗f |D ), k ∗ ∗f |D(x ,x ) , µf |D(x) = µ(x ∗) +K K̂−1x∗X XXy, (2.1) k ∗ ∗ −1 >f |D(x ,x ) = Kx∗x∗ −Kx∗XK̂XXKx∗X , (2.2) These equations can be derived using standard Gaussian conditioning and sum rules. This follows directly from the definition of the Gaussian process above: the observed training labels y1, ..., yn are the (noisy) observed function values for x1, ...,xn. Given some set of test points X∗, the definition of a GP immediately implies that f(X∗) = [f(x∗1), ..., f(x∗t )] and [y1, ..., yn] are jointly Gaussian dis- tributed:       y   µ (y)   KXX + σ2I KXX∗∼ N ,  . (2.3) f(X∗) µ(f(X∗)) K>XX∗ KX∗X∗ This allows us to immediately derive equations (2.1) and (2.2) using standard properties of Gaussian processes. 2.1.1 Hyperparameter learning All kernel matrices implicitly depend on hyperparameters θ. The log marginal likelihood of the data, conditioned only on these hyperparameters, is given by | −1log p(y θ) = y>K̂−1 1XXy − log |K̂XX |+ c , (2.4)2 2 All prior hyperparameters can be treated automatically, either by maximizing or by choosing a prior p(θ) and sampling from p(θ | D) ∝ p(y|θ)p(θ). 7 2.2 Covariance functions As discussed above, the two entities that must be specified to describe a Gaussian process prior are the prior mean function and prior covariance function. In principle, any function k(x,x′) that results in KXX being positive definite and symmetric is a valid choice of covariance function. In machine learning in particular, such functions have often been studied in the context of kernel methods (e.g., [88]), and the functions themselves are typically referred to as kernels. In the context of Gaussian processes, kernels specify the covariance (and hence, roughly the degree of correlation) between the function values of two inputs, x and x′. Beyond this, kernels specify what functions are likely a priori. There are in some sense two “views” of how kernels can best be chosen and utilized depending on how much data is available. Composing simple kernels. The first option is to choose or compose relatively simple commonly used kernels. There are a variety of ways in which kernels can be combined to perform new kernels, but in this thesis we will primarily focus on two: addition and mul- tiplication. In particular, if ka(x,x′) and k (x,x′b ) are kernels, then the following are also kernels: k ′ ′a+b = ka(x,x ) + kb(x,x ) (2.5) ka×b = ka(x,x ′)× kb(x,x′) (2.6) This follows immediately from the fact that the set of positive definite matrices is closed under addition and elementwise multiplication. In particular, letting 8 A ◦B denote the elementwise matrix product, the training covariance matrices resulting from the above kernels, Ka+b = Ka +Kb (2.7) Ka◦b = Ka ◦Kb (2.8) are both positive definite. This strategy often allows practioners to encode prior knowledge about their modeling function in their choice of Gaussian process prior, which can allow for learning even in the presence of very little data. As an example, in Figure 1.1, data for a simple toy regression problem and the true underlying function is plotted, as well as four possible Gaussian process models. Each model corresponds to a different covariance function–three very common choices of covariance function, as well as a composition of two covariance functions. The model resulting from a product of a linear kernel and a periodic kernel results in a very accurate model of the underlying function, even though only six data points are available. 2.3 Bayesian optimization Gradient-free optimization of expensive black-box functions is a ubiquitous task in a many fields. Applications range from sensor placement [27], to robotic con- trol [10], to hyperparameter tuning for complex machine learning algorithms such as deep convolutional neural networks [90]. These optimization tasks are challenging, because we typically have little knowledge about the objective func- tion beyond the ability to evaluate it at a chosen location, and because the func- tion is expensive to evaluate. For example, each lab experiment costs time and 9 money, and evaluating a given set of hyperparameters for a deep neural network may take days or even weeks of training time. Bayesian optimization (BayesOpt) [64, 7] is a widely used technique for op- timizing expensive black-box functions. Broadly, BayesOpt operates by main- taining a probabilistic belief about the objective function. We place a prior distribution—most often a Gaussian process—over the objective function, which we condition on data as we observe it. We use this belief to drive the optimization process by sequentially evaluating the objective. At each iteration, the posterior distribution of the function is used to suggest queries to evaluate next, typically trading off exploration (i.e., querying areas with high posterior uncertainty) and exploitation (i.e., querying areas with low posterior mean). More formally, Bayesian optimization (BayesOpt) is a technique for optimiz- ing an expensive black-box function [64]: min f(x) x∈X It has become popular in the machine learning community in recent years for its application to hyperparameter tuning [90], but as a strong global optimization algorithm with many theoretical results [92, 8], it has been successfully applied in many other areas as well [27, 10, 58, 12]. For a comprehensive overview of BayesOpt see [7]; we give a brief overview of the main components below. The core idea of BayesOpt is to deal with the fact that f(x) is expensive by constructing a model of f given samples seen so far, and then using this model to decide where to sample next. This new sample reveals new information about the function, and our model of it can therefore be updated; the process then repeats. Intuitively, if the model of f is perfect and can be optimized cheaply, then the function itself can be optimized cheaply. However, when we are faced 10 with very few and noisy samples the model will typically be far from perfect. It is therefore critical that the model have well-calibrated uncertainty estimates. An implementation of Bayesian optimization requires specifying two entities: the model to use for f , and an acquisition function that evaluates or estimates how useful a point would be to sample next. In this thesis, we will exclusively consider Gaussian processes as the choice to model f , as they naturally provide uncertainty estimates with their predictions, and are by far the most common model employed for this task. Acquisition functions. After conditioning on data, we use the predictive dis- tribution (equations (2.1) and (2.2)) to evaluate how promising each candidate x∗ is using some acquisition function. One of the most-popular acquisition functions used in BayesOpt is the expected improvement. Let fbest be the best function value observed so far. Define I(f(x∗)) as the improvement obtained by sampling at x∗: ( ) ( ) I f(x∗) = max 0, fbest − f(x∗) The EI acquisition function evaluates the expectation of the above function over the predictive posterior EI(x∗ [ ( )] ) = ∫E I f((x ∗) p(f(x∗)|D,x∗) fbest ) = fbest − f(x∗) p(f(x∗) | D,x∗) df(x∗) −∞ The expected improvement may be evaluated analytically. Further, the form of the expected improvement has two terms that can naturally be interpreted as encouraging both exploitation, i.e. points that our model suggest have low pre- dictive mean µf |D(x), and exploration, i.e. points with high posterior variance σf |D(x,x). To maximize the acquisition function, it is typically evaluated on a 11 fine grid of candidate points within a pre-defind hyper-cube. Although the eval- uation of a single candidate point is very fast, it is important to emphasize that this search scales exponentially with the number of dimensions and can become prohibitively slow in regimes of 10 or more dimensions. 2.3.1 Bayesian active learning The goal of Bayesian active learning is to sequentially choose samples so as to accurately model an unknown function f(x) with as few samples as possible on some domain x ∈ X . From a Bayesian perspective, we would like for our predictive posterior belief p(y∗|X,y,x∗) to match f(x∗) as well as possible and as confidently as possible. A natural metric that captures how valuable a single candidate point (xt, yt) is when added to an existing dataset X,y is the mutual information, proposed for this purpose by [40]: I(f , yt|xt)=H [f |X,y]−E [H[f |X,y, yt]]p(yt|X,y,x ) (2.9)t This is the expected reduction in differential entropy for the latent function, H[f ] after observing the new point xt. Much as in the Bayesian optimization setting with expected improvement, Bayesian active learning proceeds by sequentially choosing selecting points xt that maximize mutual information: xt = arg max I(f , y|x) (2.10) x In general, computing Equation 2.9 as presented is intractable. However, [40] present a method for approximating it both efficiently and effectively. A re- derivation of this method will be presented in Chapter 3. 12 2.3.2 Bayesian Model Selection When using Gaussian process regression to model a function f , the choice of covariance function k is critical, as it allows the data scientist to encode informa- tion about the structure of f . However, doing so manually may require expert knowledge about both the function being modeled as well as about kernel meth- ods in general. Bayesian model selection alleviates this need by providing a mechanism to perform inference over the covariance function from a set of can- didate kernels. This technique has become popular in recent years for automatic covariance function selection [19]. Suppose we are given a set of observed data D = (X,y) and a set of kernel functions to choose from,M1, ...,Mm, with corresponding hyperparameters θi. The first object of interest in Bayesian model selection is the model evidence of a model, p(y | Mi). It represents the likelihood of having drawn the datasetD from a Gaussian process with kernel KM . Computing the model evidence exactly isi in general not analytically tractable for Gaussian processes. One option is to use the MLE hyperparameters: log p(y | Mi) ≈ p(y | θ̂i,Mi). (2.11) The log marginal likelihood for a modelMi and hyperparameters θi can be computed analytically: ( ) log p(y | M , θ ) = −1y>K−1 1i i M y − log (2π)n |K | .2 i 2 Mi Because the log marginal likelihood does not penalize model complexity, the Bayesian information criterion (BIC) is often used to penalize the number of hyper- parameters |θi| and better approximate the model evidence: −1BIC = log p(y | Mi, θ 1i)− |θi| log(|D|). (2.12)2 2 13 In our experiments, we found that the difference in log marginal likelihood be- tween “correct” additive mod(els a)nd “incorrect” additive models was so large that the penalty term−1 |θi| log |D| did not have a sigificant impact. Given some2 prior distribution over the possible kernels, p(M), applying Bayes’ theorem to the above, and marginalizing out θi, results in the model posterior, p(y | Mi)p(Mi) p(Mi | D) = ∑ , (2.13) j p(y | Mj)p(Mj) a probability distribution over the possible models given the data. The model posterior provides us with a direct measure of how well each model in the set explains the data. It allows us to generalize eqs. (2.1) and (2.2) to compute a predictive distribution marginalized over all possible models: ∑ E [p(f(x∗) | D,x∗,M)]p(M|D) (2.14) = p(f(x∗) | D,x∗,Mi)p(Mi | D). (2.15) i 2.4 Inference techniques based on matrix-vector multiplies In order to compute the predictive mean in (2.1), the predictive covariance in (2.2), and the marginal log likelihood in (2.1.1), we need to perform linear solves (i.e. [K 2 −1XX + σ I] v) and log determinants (i.e. log |KXX + σ2I|). Traditionally, these operations are achieved using the Cholesky decompo- sition of KXX [73]. Computing this decomposition requires O (n3) operations and storing the result requires O (n2) space. Given the Cholesky decomposition, linear solves can be computed inO (n2) time and log determinants in O (n) time. There exist alternative approaches [113] that require only matrix-vector mul- tiplies (MVMs) with [K 2XX + σ I]. To compute linear solves, one can use use the 14 method of conjugate gradients (CG). This technique exploits that the solution to Ax = b is the unique minimizer of the quadratic function 1x>Ax− x>b, which 2 can be found by iterating a simple three term recurrence. Each iteration requires a single MVM with the matrix A [85]. Letting µ(A) denote the time complexity of an MVM with A, k iterations of CG requires O(kµ(A)) time. If A is n× n, then CG is exact when k = n. However, the linear solve can often be approximated by k < n iterations, since the magnitude of the residual r = Ax − b often de- cays exponentially. In practice the value of k required for convergence to high precision is a small constant that depends on the conditioning of A rather than n [28]. A similar technique known as stochastic Lanczos quadrature exists for ap- proximating log determinants in O(kµ(A)) time [17, 99]. In short, inference and learning for GP regression can be done inO (kµ(KXX)) time using these iterative approaches. Critically, if the kernel matrices admit fast MVMs – either through the struc- ture of the data [77, 14] or the structure of a general purpose kernel approxima- tion [113] – this iterative approach offers massive scalability gains over conven- tional Cholesky-based methods. 2.5 Structured kernel interpolation Structured kernel interpolation (SKI) [113] replaces a user-specified kernel k(x,x′) with an approximate kernel that affords very fast matrix-vector mul- tiplies. Assume we are given a set of m inducing points U that we will use to approximate kernel values. Instead of computing kernel values between data points directly, SKI com- 15 putes kernel values between inducing points and interpolates these kernel values to approximate the true data kernel values. This leads to the approximate SKI kernel: k(x, z) ≈ w K w>x UU z , (2.16) where wx is a sparse vector that contains interpolation weights. For example, when using local cubic interpolation [46], wx contains four nonzero elements. Applying this approximation for all data points in the training set, we see that: KXX ≈ WXKUUW>X (2.17) With arbitrary inducing points U , matrix-vector multiplies with [WK >UUW ]v require O (n+m2) time. In one dimension, we can reduce this running time by instead choosing U to be a regular grid, which results in KUU being Toeplitz. In higher dimensions, a multi-dimensional grid results in KUU being the Kronecker product of Toeplitz matrices. This results in the ability to perform matrix-vector multiplies in at most O (n+m logm) time, and O (n+m) storage. 16 CHAPTER 3 PSYCHOPHYSICAL DETECTION TESTING WITH BAYESIAN ACTIVE LEARNING 3.1 Introduction In this chapter, we demonstrate the effectiveness of exploiting problem structure when it exists using the real-world task of psychophysical detection testing. We develop a new model for psychophysical detection, guided by experts in the field. While this chapter deals primarily with the technical details and modelling choices made, we note that the primary techniques and results discussed have been thoroughly verified clinically as well [91]. The work in this chapter was published in UAI 2015 [25]. In a psychophysical detection test, a test operator presents n sensory stim- uli to a subject, and ask for n binary reports as to whether each stimulus was detected or not. Detection tests exist for vision [79], pain [13], and many other settings. Perhaps the most common example is audiometry [11, 16, 41]. In the standard audiometric detection test, a subject is presented with a sequence of n tones xt ∀t= 1, . . . , n, where each tone x ∈R2t is a pure tone with a specific frequency (pitch) and intensity (volume). The subject reports an observation yt = 1 if they heard the tone (for example, by raising their hand), and a yt = 0 is concluded in the absence of a positive report. The purpose of the test is to infer, from this sequence of observations, the underlying audiometric function g(x). An audiometric function is unique to a subject, and describes how likely the subject is to hear sounds over the domain of typical frequencies and intensities. There is substantial variability in each person’s audiogram, particularly for those with 17 partial, selective, or degenerative hearing loss [30, 75, 80]. Accurate estimates of audiograms are thus essential to understanding human audition, and to all medical studies and treatments of various forms of hearing loss. A standard auditory detection test is carried out by playing an n-length se- quence of pure tones on a pre-defined grid in frequency-intensity space. This approach, while simple, has several salient drawbacks that lead to an unneces- sarily large n. First, a given tone is played multiple times, even if it is highly audible or highly inaudible. Second, information is not shared between previous outcomes. For example, human audition is monotonically increasing in intensity, but in the standard test, even if a particular frequency of sound is heard at a given intensity, tones with the same frequency but higher intensity will still be tested. Finally, owing to limitations on the size of sequence n, a standard detec- tion test probes only six discrete frequencies [57]. The coarseness of this grid can cause significant errors, as human hearing loss can span a range narrow enough to be entirely missed by these six frequencies [44, 115, 114]. All of these issues, combined with the impracticality and burden to human subjects of a large n sequence, motivate an active learning approach. We will take take a more modern view of audiometric testing as an active learning problem in personalized medicine [47, 78, 4], extending and adapting recent work on active learning with Gaussian processes (GPs) as discussed in Section 2.3.1 [40, 43, 26]. This approach addresses all the drawbacks of grid- sampling by performing Bayesian active learning of the audiometric function g(x). We use this model to sequentially sample at each time step t the most infor- mative next tones conditioned on the previous t−1 observations y1, ..., yt−1. This model significantly enhances the accuracy and efficiency of learning audiograms. 18 These advancements stem from two major contributions that we developed after discussions with expert audiologists: 1. We extend and adapt existing work on Bayesian optimization and active learning to the setting of psychophysical detection tests. We present a model that incorporates strong prior knowledge about the auditory stim- ulus space, and we present experimental results demonstrating the effec- tiveness of a Bayesian active learning approach. 2. We develop a novel ‘OR-channel’ likelihood that allows the query of mul- tiple tones simultaneously. We analyze this likelihood in the active learning context, clarifying the non-obvious intuition for why and when such an approach can outperform single-tone queries. We evaluate our algorithm on both simulated and real audiometric detection tests. Our active learning approach obtains finer grained estimates of the audio- gram g(x) with substantially fewer stimuli queries (lower n). We note that, in the remainder of this chapter (notably our experiments), we will continue to use the example and nomenclature of audiometry, though our algorithm is precisely equivalent for other psychophysical detection tests as well. 3.2 Related work A number of papers have been recently published on Bayesian active learning. Many papers have considered Bayesian active learning using mutual informa- tion in the regression setting [92, 33, 48]. However, the computation of mutual information is significantly less tractable in the classification setting. To our 19 knowledge, [40] is the first paper to leverage the rewriting of mutual informa- tion in (3.7), allowing for tractable computation of mutual information with the Bernoulli observation model. This paper is most similar to ours, as the Bernoulli observation model is identical to our single tone audiometric algorithm. In the conext of our application, we make two primary contributions over [40]. First, we develop a novel OR-channel likelihood, that allows for simultaneous querying of multiple tones. Second, we extend the basic mutual information derivation to this new likelihood, allowing us to compute the mutual information of an entire set of tones to be queried simultaneously. A number of other, orthogonal applications and extensions of this method have since been published [43, 26]. There has also been an enormous quantity of work on the topic of active learn- ing outside the Bayesian active learning framework as well (e.g., [98, 51, 83, 15]). Many active learning methods even approach the problem with a similar intu- ition as Bayesian active learning: the strategy is to define some notion of uncer- tainty, and then query unlabled examples that maximize this notion of uncer- tainty; this notion of uncertainty is commonly based on the unlabeled example’s distance from the margin. Indeed, in the Bayesian active learning framework, mutual information is also maximized when an unlabeled example is either close to the “decision boundary” p(y | D) = 0.5 or has high posterior variance. While Bayesian active learning has been shown to work quite competitively with non- Bayesian appraoches, the main advantage of the Bayesian framework we focus on here is the ability to encode expert knowledge about human hearing via the mean and covariance functions. Furthermore, the Bayesian framework gives us a natural way to handle querying multiple tones simultaneously via a novel OR-channel likelhood, a concept that is not easily adapted to standard empirical risk minimization. 20 Alternative approaches exist for active learning as well, in particular based on expected loss reduction (e.g., [76, 34]). In these approaches, the idea is to select unlabeled examples that, in expectation over the classifier’s prediction of the label, maximally reduce the loss if added to the dataset. This can in a sense be viewed as most directly analogous to the mutual information calculation used in Gaussian processes, which focus on directly reducing the entropy of the model in expectation over the label. However, using the rewriting of mutual information in (3.7), Gaussian processes enjoy the distinct advantange of being able to compute mutual information in closed form without retraining the model. In the audiometric setting this distinction is crucial, as we have only a small number of seconds each iteration to select a new tone, and retraining the model for each candidate tone is entirely infeasible. Alternative techniques for estimating audiograms have existed for many years. Sweep-based audiometry, such as Bekesy audiometry and Audioscan, are able to produce a more continuous estimate of the audiogram that can often detect notches, but with the disadvantage of a particularly time- and attention- demanding task [44, 62]. Several Bayesian audiogram estimation techniques, such as parameter estimation by sequential testing (PEST) and maximum likeli- hood methods also exist, although most do not simultaneously estimate multiple frequencies [32, 54, 67, 69, 96]. More recent advances in audiometric testing have focused on improving the accessibility of hearing screening by distribution over telephone, Internet, or mobile devices [87, 93, 100, 104, 107]. 21 3.3 Method In this section, we discuss our model and approach to psychophysical detection testing using Gaussian processes. As a running example, we will use audiometry. In an audiometric detection test, a patient is presented with tones of varying frequency and intensity. The patient is asked to respond (e.g., by pressing a button) if he/she hears the sound. In the absence of a timely reaction the tone is assumed to be inaudible to the patient. The delay between tones is sufficiently randomized to prevent patients from responding to predictable patterns [30]. At time step t, we choose a tone xt = (ω, i) with frequency ω and intensity i to present to the subject. In return, we receive a response yt ∈ {0, 1}, where yt=1 indicates that the patient heard the sound and yt=0 indicates that he/she did not. There is inherent observation noise in patient responses. When patients become uncertain when presented with sounds very close to their threshold (i.e., the sounds become faint and hard to hear). Patients do not have perfect detection boundaries, and only hear tones near their hearing threshold with some probability. This uncertainty is observed in reality for a number of reasons. First, patient attention may waver, or they may be unable to distinguish between tones near their hearing threshold and slight background noise. Alternatively, this uncertainty may derive from physical sources. For example, if a tone is faint enough, a patient may be able to hear that tone between–but not during–heart beats. Our goal is therefore to predict the probability that a patient is able to hear a given sound. 22 3.3.1 Prior In the case of audiometric testing, we have valuable prior knowledge about a patient’s audiometric function that we can encode in our GP model. In particular, the probability that a patient hears a sound (ω, i) is monotonically increasing in the intensity i. In other words, if a tone is audible to a patient, then an even louder tone is more likely to be audible. Furthermore, audition is a smooth function with respect to the frequency ω. Human nerves that detect similar frequencies are co- located in the cochlea and, as a result, a partial loss of hearing in one frequency is likely to cause a loss of hearing in nearby frequencies. A GP prior can encode both properties naturally through its covariance function. A combination of a linear kernel in intensity and a squared exponential kernel in frequency ensures the monotonicity and smoothness properties{: } 1 k ((ω, i), (ω′, i′)) = ii′ + exp − ‖ω − ω′‖22 . (3.1)` Here, ` regulates the smoothness (characteristic length-scale) w.r.t. frequency. Note that a GP prior is technically incapable of supporting only monotonically increasing functions. However, we only need that the posterior probability of detection be monotonic, which is generally true after a few tones are sampled (for example, see figure 3.4). For the mean function µ0, we note that intensity is typically measured in dB HL, which is an empirical unit of measurement normalized based on population data so that at each frequency the typical human hearing threshold is around 0 dB HL. As a result we choose a constant mean function. 23 3.3.2 Observation Model This mean function, µ0(·), and covariance function, k(·, ·), define a prior over real-valued latent functions f ∼ GP(µ0(·), k(·, ·)). Our goal is to predict the prob- ability (i.e. within [0, 1]) that a patient hears a tone with a specified frequency and intensity. We can never observe these probabilities directly. For any tone, we can instead only observe the outcome of a Bernoulli trial with the true probability. This setting is akin to Gaussian Process classification [52] and similarly we use a Bernoulli likelihood, where Pr [y = 1|f ] = Φ(f) and Φ(·) denotes the standard normal cumulative density function (CDF). The linear component of the kernel in (3.1) results in a function that, after being warped by Φ(·), is sigmoidal in the intensity dimension: after the slope is fixed (by conditioning on the first few points), the posterior belief about Φ(f) will tend to 0 as the intensity decreases and 1 as the intensity increases. This reflects our prior knowledge that tones of extremely low intensity are unlikely to be heard, whereas tones of high intensity are more likely to be audible. Predictions. Once we have collected data, we can use the predictive distribu- tion p(y∗|X,y x∗) to summarize our belief about whether the patient will hear a test tone x∗. As our likelihood is non-Gaussian, the posterior p(f ∗|X,y,x∗) has no closed form solution. However, an approximate Gaussian posterior over f ∗ can be obtained with the standard Laplace approximation to the likelihood [52, 74]. 24 3.3.3 Multiple Tones An interesting property of audiometry (that may also be common to other psy- chophysical domains, e.g. visual or touch sensory tests), is that multiple tone stimuli can be presented to a patient simultaneously by overlaying tones. In this setting however, we can still only query whether the patient heard the overlaid tones. A negative response to a multi-tone sample indicates that the patient did not hear any of the overlaid tones; a positive response indicates that the patient heard at least one of them. OR-Channel. Presenting a patient with k tones leads to a novel extension to the standard Bernoulli likelihood used in classification. We present the patient with k tones x1, ...,xk. The patient hearing the individual tone xi is still the outcome of a Bernoulli trial with ¶yi|fi = Φ(fi), as the individual trials are independent conditioned on f . However, we cannot directly observe any individual yi. Rather, we record them through an OR-channel, that is we observe ȳ, which is 1 if the patient hears at least one of the k tones presented, and is 0 otherwise. This leads to the OR-channel likelihood: ∏ Pr [ȳ = 1|f1..k] = 1− (1− Φ(fj)) ∏j = 1− Φ(−fj) (3.2) j Note when k=1, eq. (3.2) reduces to the standard Bernoulli likelihood for single tones, Pr [ȳ = 1|f1] = Φ(f1). 25 3.3.4 Query Selection In iteration t we present the subject with a query set of overlaid tones qt = [{x1, ...,xk}] and query the response ȳt. To select qt we pick the point set that maximizes the expected decrease in posterior entropy, analogous to eq. (2.9). Single tone mutual information. We first consider the setting of picking a single tone, i.e. where qt = [{xt}]. [40] derive an analytical approximation to the mutual information, eq. (2.10), when using a Bernoulli likelihood. These results directly apply when picking a single tone xt. When ft is known, the entropy of the Bernoulli variable yt is given by h(Φ(ft)), where h(p) = −p log p− (1−p) log(1−p), is the Bernoulli entropy function. We can rephrase the entropy in eq. (2.9) as I(f , yt|qt) = H [yt|X,y]− E [H [yt|f ]]p(f |X,y) , (3.3) and rewrite both terms on the right hand side through h. If f is unknown and yt is conditioned on X,y, the entropy can be expressed in terms of the expectation over the posterior for ft: H [yt|X,y] = h (E [Φ(ft)]) . (3.4) If ft is known we have Pr [y|f ] = Φ(ft), yielding H [yt|f ] = h (Φ(ft)) . (3.5) Substituting eqs. (3.4), (3.5) into (3.3) leads us to the following expression for the mutual information between f and yt in the single tone scenario: I1(f , yt|qt) = h (E [Φ(ft)])− E [h (Φ(ft))] . (3.6) 26 The computation of I1 involves an intractable integral, which can be approx- imated through numerical integration. This approach is very fast in practice as the integral is only one dimensional and can be computed efficiently using quadrature. Multiple tone mutual information The above results can be extended to com- pute the mutual information when sampling multiple tones qt = [{x1, ...,xk}]. In particular, the probability of observing ȳt = 1 changes from Φ(ft) to the OR- channel probability, (3.2)(. Thu∏s, when f1,)..., fk are known, the entropy of the Bernoulli variable ȳt is h 1− ki=1 Φ(−fi) . To simplify notation, let us define p̄1 = Pr [ȳ = 1|f1..k] as defined in (3.2). Sub- stituting p̄1 for Φ(ft) in (3.6) gives the mutual information of paired tone sample qt after observing the outcome ȳt: Ik(f , ȳt|qt)=h((E [p̄[1∏])−E [h(p̄1]))] [ (∏ )] = h E jΦ(−fj) −E h jΦ(−fj) (3.7) where the second equality holds by the linearity of expectation and because h(p) is a concave function that is symmetric about p = 0.5 (i.e. h(p) = 1−h(p)). The last term leads again to an intractable integral. However, similar to the one tone scenario, Ik can also be evaluated efficiently using numerical integration, as k is relatively small. Computational Considerations Finding a set of k ≤ K((to)n)es (k)qt to maximize Ik(f , ȳt|qt) from a candidate set X of size S requires O S considerations. Ink order to ensure that patients do not have to wait for a lengthy duration between sounds are played, we construct a set of multiple tones to play greedily. We select 27 a b 1 c 1 0.1 B 1 h(·) Single Paired I 01 I2 −1 0 −2 0 0 −3 −0.1 0αᾱ β β̄ 1 0 α 1 0 correlation ρ 1 Figure 3.1: Difference in mutual information I2 − I1 between a paired query and a single query: (a) discrete distribution with two atoms (α, β) = 0.05, 0.65, and corresponding ᾱ = 1− (1− α)2 ≈ 0.1, β̄ ≈ 0.88). Here I2 − I1 ≈ 0.18; (b) I2 − I1 as a function of α, β (white cross denotes the specific example of panel a); (c) the normally distributed latent input case. I2− I1 is shown as a function of the mean µ and correlation ρ. Colorbar at right is for both panel b and c. the best single tone by exhaustively searching X . Then, to select the best set of size k, we exhaustively add each ∈ X to the best set of size − , (k−1)x̂ k 1 qt and compute the expected decrease in posterior entropy of (k−1)qt ∪ x̂. This greedy selection procedure reduces the computational complexity of considering tone sets of up to size k to O (Sk), and in practice requires only a few seconds of computation time. 3.3.5 Or-channel Analysis We first investigate the OR-channel likelihood of eq. (3.2), as it is unclear if this elaboration can provide any benefit over a standard Bernoulli likelihood. Intu- itively, the result of ȳ = 0 from an OR-channel is quite informative: all inputs into that channel must have been 0 (in the auditory example, no sounds were heard). On the other hand, the result of ȳ = 1 is much less informative than in the Bernoulli channel, as it means only that one or more of the inputs were 1 (some sound or sounds were heard), but there is no information about which. Here 28 β mean µ I2−I1 we analyze simple models that support the use of the OR-channel likelihood. We compare a single input, corresponding to the standard Bernoulli likelihood, to a paired input, corresponding to an OR-channel likelihood with two inputs. That is, with inputs {f1, f2} and output y ∈ {0, 1} as above, our quantities of interest are I1 := I(y, f1) and I2 := I(y, {f1, f2}), and we seek to understand if more information about the inputs can exist in the paired-input query, than in the single-input query. OR-channel Inputs With Discrete Support The simplest case involves perfectly correlated inputs f1 = f2, and further, a discrete distribution on f1 with two atoms of equal mass. The implied prob- ability φ(f1) will then have the same discrete distribution, which we write as p (φ(f )) = 11 δ(φ(f ) = α) + 1 1 δ(φ(f1) = β), for some atoms α and β. Then, the2 2 mutual information of the single query is: I1 = H(y)−H(y|f1) = h ((Ef [φ (f1)]))− Ef [h (φ (f1))] (3.8) 1 1 = h (α + β) − (h(α) + h(β)) , 2 2 where Ef is the expectation under the distribution on f . The OR-channel likeli- hood for two terms is similarly p(y = 1|{f1, f2}) = 1 − (1− φ(f1)) (1− φ(f2)) = 1−(1− φ(f1))2. The mutua(l informatio)n of a paired-input query becomes 1 ( ) ( ( )) I2 = h ᾱ + β̄ − 1 h (ᾱ) + h β̄ , (3.9) 2 2 where ᾱ=1− (1−α)2 and β̄=1− (1−β)2. I2 and I1 offer a convenient geometric interpretation by viewing mutual information as the Jensen’s inequality gap of h (eqs. (3.8) and (3.9)). With this simple discrete distribution, α and β can be chosen 29 such that I2 − I1 will be positive or negative. We show the critical case I2 > I1 in Figure 3.1a, where the blue line segment connects (α,h(α)) to (β,h(β)) with (α, β) = (0.05, 0.65), and the red line segment is then implied by those choices of α, β (that is, (ᾱ, β̄) ≈ (0.10, 0.88) in the figure). Here the difference is I2−I1 = 0.18 bits. The contours of I2 − I1 as a function of (α, β) is shown in Figure 3.1b. OR-channel Inputs With Normal Densities We next analyze the OR-channel likelihood with two latent factors f1 = f(x1) and f2 = f(x2), which are jointly Gaussian according to the GP model of Section 3.3: [f1, f2] ∼ N (m,S). We calculate I2− I1 numerically using eq. (3.6) (note that, compared to the previous example, only the expectationover f has changed). µ 1 ρ We simplify the parameter space with  m =  and S =   (but note that µ ρ 1 the function I2 − I1 is not invariant to either of these simplifications). We plot the contours of I2 − I1 as a function of correlation ρ and mean µ in Figure 3.1c, which indeed has substantial regions of both positive and negative mass. In summary, though intuitively non-obvious, the above analyses clarify that the OR-channel likelihood can, but need not, increase mutual information be- tween the input distribution and the binary outcome y. This finding offers a critical takeaway: the OR-channel can be used effectively, but only in the set- ting where a judicious choice of input distribution can be made. Indeed, this is exactly what our framework will achieve: it will choose pairs of input points (paired sounds) to learn more about the underlying audiogram than a single point alone. Thus, the OR-channel likelihood offers benefit beyond this scheme, which we already expect to outperform a naive approach to learning these la- 30 Standard Audiogram, 114 Samples Multi-tone GP Audiogram, 60 Iterations 80 80 1 70 70 60 60 0.8 50 50 40 40 0.6 30 30 0.8 0.4 20 20 0.60.4 .80 0.8 .6 10 10 0.2 0 0.6 .40 0.4 0.2 0.2 0 0 0.2 -10 -10 0 8 9 10 11 12 13 8 9 10 11 12 13 Frequency (Log Hz) Figure 3.2: Standard grid search audiogram with tones played at every octave from 250 to 8000 Hz, and every 5 dB HL from -10 dB to 80 dB, compared to a multi-tone GP audiogram with 60 iterations (and therefore 119 “samples”). tent functions. In this work we only consider paired inputs; a future question for study is how the information gain distribution changes with increasing numbers of inputs. 3.4 Results In this section, we empirically evaluate our proposed algorithms for psychophys- ical detection. We focus on our application to audiometry, and seek to evaluate the merits of using Gaussian processes for audiometry in general, as well as to compare single-tone and multi-tone audiometry, focusing on the machine learn- ing aspects of our algorithms. We have also evaluated our method further in a clinical trial [91], which we discuss further in the discussion section. To begin, we compare the audiograms found by a standard grid audiometric test and by our multi-tone GP model. In both cases we run the same human sub- ject in the same audiometric setting. The only differences are the tones presented 31 Intensity (dB HL) Posterior Prob. of Detection 80 1 70 60 0.8 50 40 0.6 30 0.4 20 0.8 .6 10 .8 0 0.8 0 .4 .6 0 0.2 0.20. 06 0.40 0.4 0.2 -10 0 8 9 10 11 12 13 Frequency (Log Hz) Figure 3.3: Comparison of multi-tone and single-tone GP audiometrics. Left: A GP trained on 100 single tones. Blue circles denote tones detected by the subject, and red crosses denote tones that were not detected. The posterior probabilities are shown as color contours. Right: Log likelihood of random presentation of tones (no active learning, shown in gray), active learning presentation of single tones (shown in blue), and active learning with paired tones (shown in red), under the ground truth audiometric function from the left figure. Log likelihood is plotted as a function of iterations in each audiometric testing strategy. Shaded areas denote standard error. and the method used to infer the audiometric function. All audiometric tests were run in accordance with an approved IRB. In the standard setting, tones from this grid are presented in a pre-determined order, typically ascending in frequency and decreasing in intensity. In the GP model, pairs of tones were ac- tively selected given all previous pairs of tones and the responses to those tones. A random delay of up to 3 seconds was inserted between tone presentations to prevent subjects from memorizing a pattern in the test. Figure 3.2 shows the resulting data and inferred audiograms plotted in frequency-intensity space (left panel: standard audiometric test; right panel: GP method). For both the standard and GP experiments, tones that were detected by the patient are plotted as blue circles, and tones that were not detected are plotted as red crosses. For the paired- tone GP test (right panel), paired samples that were detected are plotted as blue 32 Intensity (dB HL) 0.2 35 1 Iteration 35 15 Iterations 30 30 25 25 20 20 15 15 0.8 .8 10 0.6 0.6 10 0 0 0.2 .6 5 0.6 5 0.6 35 0.40 0.4 0.90 0.4 0.4 0.4 30 0.2-5 -5 0.2 0.8 -10 -10 25 0.7 8 9 10 11 12 13 8 20 9 10 11 12 103.8 0.6 15 0.8 0.6 35 30 Iterations 35 0.510 60 Iterations 0. 0.46 30 30 5 0 0.2 0.4.4 25 25 0 0.2 0.3 20 20 -5 0.8 0.8 0.2 15 00.8 15 -10 .6 0. 68 0. 0.1 0.8 10 10 8 0.49 10 11 120.4 130.6 0.6 0.6 5 05 ..4 2 20 0.4 0. 0.4 0 0.2 0 0.2 0.2 -5 -5 -10 -10 8 9 10 11 12 13 8 9 10 11 12 13 Frequency (Log Hz) Figure 3.4: The posterior probability of detection within the frequency / inten- sity space during a GP audiometric test on a human subject. Panels show the learned GP after 1, 15, 30, and 60 iterations. Queries consist of a single or a paired tone (as selected by the model). Blue circles indicate a positive outcome (sound was heard), red crosses indicate a negative outcome. Paired tones with positive outcome (at least one of the two tones was heard) are connected by a blue line. Almost all queries are close to the final audible threshold (0.5 posterior detection probability), which is well approximated even after only 15 iterations. circles connected by a blue line (recall that, due to the OR-channel likelihood, we do not know which tone was heard). Paired tones that were not detected are again plotted as individual red crosses, as these data are functionally equivalent to two single-tone samples that were not detected (again due to the OR-channel observation model). In the standard audiometric test, the inferred audiogram is simply an “au- dible threshold” that is the piecewise linear function connecting the detection 33 Intensity (dB HL) 0.2 0.4 0.6 0.8 0.2 0.2 0.4 0.6 0.8 0. 04 .6 0.8 0.2 0.4 0.6 0.8 Posterior Prob. of Detection 0.8 0.6 0.4 0.2 0.2 4 0.8 0. 6 0. .8 0.6 0.40 0. 2 0.8 threshold at each frequency. This threshold is depicted as a black line in the left panel of Figure 3.2. In the GP case, we infer a full posterior distribution on the detection threshold. We plot contours of the posterior detection probability in the right panel of Figure 3.2, with a solid black line at 50% posterior detection probability. This confirmatory comparison offers several key points of interpretation. First, the tests agree with each other: the 50% posterior detection probability in the GP case is within 5dB of the standard audiogram, giving confidence to the general sensibility of this model. Second, perhaps most importantly to the active learning goal, the GP active learning model presents approximately half as many iterations (60 actively learned paired tones compared to 114 single tones preselected from a grid). Thus the GP model is able to explore substantially more of the frequency space than the standard grid test, and it does so in many fewer overall iterations, reducing the burden of these tests. Third, note that the GP model does not explore uninformative regions of tone space: above a certain intensity (at which the model is confident that tones are certainly heard), there are no tones queried. This observation differs sharply from the standard test, which squanders numerous samples at intensities well above this subject’s au- dible threshold, where little to no information is available. Fourth, by design our GP model offers a full posterior distribution over tone space, and thus pro- duces a richer and more descriptive audiogram than the piecewise linear audible threshold function in the standard test. Finally, it is worth noting that, though the paired tones in the right panel of Figure 3.2 appear to be sampled at very similar frequencies in log-space, the differences were often nontrivial, up to four or five half steps in an octave. 34 Next, Figure 3.4 investigates the convergence of our GP model after 1, 15, 30, 60 iterations of our paired-tone GP audiometric algorithm. The poste- rior after a single iteration (upper left panel) reflects primarily the prior mean and the covariance of the model, which incorporates our knowledge about the general shape of human audiograms. As the active learning procedure contin- ues (other panels), the GP posterior quickly converges to the audiogram of this particular subject. After only 30 iterations, the GP model has already captured the audiogram shape, and subsequent changes are very minor. To investigate the performance of our GP active learning method in greater detail, we construct a synthetic data set with known ground truth (a known audiometric function). We begin by training a GP on 100 single tones and the detection of those tones reported by a second human subject. The tones sampled and the inferred audiogram are presented in Figure 3.4. We use this posterior GP as the true audiogram of a simulated subject. This ground truth audiometric function allows for the critical assessment of performance shown in Figure 3.4. We compare three strategies of data pre- sentation: random presentation of tones (no active learning, shown in gray), active learning presentation of single tones (shown in blue), and active learning with paired tones (shown in red). For each strategy, at each iteration (tone pre- sentation), we infer the GP posterior mean, which is the MAP estimate of the audiometric function, given each stream of data. We evaluate the log likelihood of each strategy’s GP posterior mean under the ground truth GP from Figure 3.4. This step offers a quantitative assessment of how closely each strategy has ap- proximated the true audiometric function. The maroon dashed line depicts the log likelihood of the ground truth GP itself, which is thus the maximum achiev- 35 able performance of any strategy. All three strategies (random, single tone active learning, paired tone active learning) should, with enough iterations, converge to ground truth. Thus, the essential question of this work, and indeed of any active learning method, is how much more quickly a particular strategy approaches the ground truth than competing strategies. We ran the single and paired tone active learning methods ten times each, and standard errors are plotted as shaded regions. Because of the very high standard error of the random tone audiogram, these results were averaged over 100 runs. Figure 3.4 has a few key findings. Both the single and paired tone active learning strategies significantly outperform random sampling. Thus our strong prior rapidly learns that large portions of the tone space are either very likely or very unlikely to be heard, and is able to quickly learn to sample in regions of high information. After 80-90 iterations the paired tone algorithm matches the ground truth model very closely. This result is in significant contrast to ran- domly choosing tones, which not only has very large standard error, but also rarely converges to a good model. Finally, we observe that the paired tone active learning strategy significantly outperforms the single tone strategy. In fact, the paired tone strategy requires only half as many iterations to achieve the same level of likelihood. Compared to random sampling, paired tone active learning reduces the number of iterations by 85%. 36 3.5 Discussion In this chapter, we explored the problem of adapting Bayesian active learning to psychophysical testing, and improving upon standard techniques used in audiometric testing. The success of our method is due primarily to the incorpo- ration of problem specific information in the Gaussian process model. By using a linear kernel to model the simple fact that human hearing is monotonic with volume, our audiometric test avoids exploring large portions of the audiometric test space corresponding to tones louder than ones already heard by the patient. Additionally, we developed a novel OR-channel likelihood that allows us to present multiple tones to a subject simultaneously, leading to an audiometric testing strategy that not only yields good audiogram estimation using signifi- cantly fewer samples, but also leads to much better coverage of the frequency dimension. We demonstrate a non-obvious result, that multiple tones played through an OR-channel can, but do not have to, yield more information than a single tone. Clinical results. As a consequence of the promising results for the methods discussed in this paper, our co-authors have conducted a clinical trial of these methods in [91]. In that paper, these methods were evaluated in 21 participants between the ages of 18 and 90 years old, with varying degrees of hearing capabil- ity. It was found that the methods in this paper do indeed provide comparable or better audiometric testing results to the standard audiometric test, but with sig- nificantly fewer tones presented to the patient. However, matching the standard audiometric test in some sense is only the most basic advantage of our method. In particular, the fully probabilistic treatment of human hearing is novel to this 37 paper, and allows practioners to more precisely reason about patient hearing. Pure tone audiometry is inherently noisy and prone to subject error, particularly when presenting tones around the patient’s hearing threshold. Adoption of our method. This work was done in collaboration in part with Dennis Barbour, a medical expert specializing in part in audiometry. He has made substantial improvements to the work since its publication, particularly in terms of its ease of use by medical experts rather than machine learning experts, and these methods are deployed in a number of real-world settings where they are used to get fast and accurate audiograms. Future directions. From a purely application-oriented stand point, one of the most obvious future directions is the application of these techniques to other psychophysical detection tests than the audiometric test. In particular, there is strong reason to believe that existing methods used for certain types of visual and touch-based tests could be dramatically improved both by incorporating strong prior knowledge about patients. Furthermore, the ability to present patients with multiple stimuli simultaneously is inherent to many types of psychophysical tests. For example, when testing a patient for loss of sense of touch, the OR- channel model applies directly: a patient can be presented with multiple touch stimuli, and can easily respond whether at least one stimulus was felt. 38 CHAPTER 4 BAYESIAN ACTIVE MODEL SELECTION 4.1 Introduction In the previous chapter, we detailed a case study of a real world application where incorporating strong prior knowledge using Gaussian processes has a dramatic impact on performance. By encoding a priori in to the model some simple facts about human hearing–that people are more likely to hear loud tones than soft ones, and that hearing loss in a particular frequency strongly implies hearing loss in nearby frequencies–we were able to dramatically improve the standard audiometric detection test. The work presented in Chapter 3 additionally poses the following question: can we automate the process of encoding prior knowledge? This process can be quite difficult, as it requires expert knowledge in both the problem domain and in Gaussian processes in order to construct a prior that best models the problem. When sufficient data is available, techniques based on Bayesian model selection (e.g., [19]) offer a natural solution to this problem: the best prior is the one with the highest model posterior after conditioning on the data–or even better, an average over all available models weighted by the model posterior. However, in many scenarios like the audiometric setting, the problem by definition afford limited or no data at all: the goal is to actively acquire data to learn the function of interest (for example, the patient’s audiometric function) as quickly as possible. This leads naturally to the question: can we additionally actively acquire data to learn which prior or model best explains the function of 39 Figure 4.1: Distinguishing two models with active model selection. Given only the input data (black dots), deciding whether the red model or the blue model best describes the true function is a difficult task. However, with a single ad- ditional training input at the dashed black line, the models are immediately distinguished: the potential blue point has very low probability under the red model, and the potential red point has very low probability under the blue model. Choosing the next point that maximizes the disagreement among all candidate models is achieved by maximizing the mutual information. interest? A particular, clarifying example that motivates this work is noise-induced hear- ing loss (N I H L), a prevalent disorder affecting 26 million working-age adults in the United States alone [84] and affecting over half of workers in particular occupations such as mining and construction. Most tragically, N I H L is entirely preventable with simple, low-cost solutions (e.g., earplugs). The critical require- ment for prevention is effective early diagnosis. 40 To be tested for N I H L, patients must complete a time-consuming audiomet- ric exam that presents a series of tones at various frequencies and intensities; at each tone the patient indicates whether he/she hears the tone [11, 41, 16]. From the responses, the clinician infers the patient’s audible threshold on a set of discrete frequencies (the audiogram); this process requires the delivery of up to hundreds of tones. Audiologists scan the audiogram for a hearing deficit with a characteristic notch shape—a narrow band that can be anywhere in the frequency domain that is indicative of N I H L. Unfortunately, at early stages of the disorder, notches can be small enough that they are undetectable in a standard audiogram, leaving many cases undiagnosed until the condition has become severe. Increas- ing audiogram resolution would require higher sample counts (more presented tones) and thus only lengthen an already burdensome procedure. We present here a better approach. Note that the N I H L diagnostic challenge is not one of feature selection (choos- ing the next test to run and classifying the result), but rather of model selection: is this patient’s hearing better described by a normal hearing model, or a notched N I H L model? Here we propose a novel active model selection algorithm to make the N I H L diagnosis in as few tones as possible, which directly reflects the time and personnel resources required to make accurate diagnoses in large popula- tions. We note that this is a model-selection problem in the truest sense: a diag- nosis corresponds to selecting between two or more sets of indexed probability distributions (models), rather than the more-common misnomer of choosing an index from within a model (i.e., hyperparameter optimization). In the N I H L case this distinction is critical. We are choosing between two models, the set of possible N I H L hearing functions and the set of normal hearing functions. This approach suggests a very different and more direct algorithm than first learning 41 the most likely N I H L function and then accepting or rejecting it as different from normal, the standard approach. We make the following contributions: first, we design a completely general active-model-selection method based on maximizing the mutual information between the response to a tone and the posterior on the model class. Critically, we develop an analytical approximation of this criterion for Gaussian process (G P) models with arbitrary observation likelihoods, enabling active structure learning for G Ps. Second, we extend the work of [25] (which uses active learning to speed up audiogram inference) to the broader question of identifying which model—normal or N I H L—best fits a given patient. Finally, we develop a novel G P prior mean that parameterizes notched hearing loss for N I H L patients. To our knowledge, this is the first publication with an active model-selection ap- proach that does not require updating each model for every candidate point, allowing audiometric diagnosis of N I H L to be performed in real time. Finally, using patient data from a clinical trial, we show empirically that our method typically automatically detects simulated noise-induced hearing loss with fewer than 15 query tones. This is vastly fewer than the number required to infer a conventional audiogram or even an actively learned audiogram [25], highlight- ing the importance of both the active-learning approach and our focus on model selection. 4.1.1 Related work Although active learning and model selection have been widely investigated, active model selection has received comparatively less attention. [1] proposed 42 an active learning model selection method that requires leav(e-two-out cr)oss vali- dation when evaluating each candidate mx∗, requiring O B2M |mX∗| model updates per iteration, where B is the total budget. [50] also considered an information-theoretic approach to active model selection, suggesting maximiz- ing the expected cross entropy between the current model posterior p(M | D) and the updated distribu(tion p(M) | D′). This approach also requires extensive model retraining, withO M |mX∗| model updates per iteration, to estimate this expectation for each candidate. These approaches become prohibitively expen- sive for real-time applications with large number of candidates. In our audiomet- ric experiments, for example, we consider 10 000 candidate points, expending 1–2 seconds per iteration, whereas these mentioned techniques would take several hours to selected the next point to query. An alternative approach to dealing with uncertainty about what model (ker- nel) should be used is to choose an extremely flexible, highly parametric kernel and learn the structure of data via gradient descent. This is an approach taken by multiple kernel learning [29] and recently using the spectral mixture kernel [111] or deep kernel learning [112, 110]. The primary advantage of composing simple kernels as done in for example [19] is that these highly parametric kernels typically require a large number of training examples to learn useful kernels. In the active learning regime, the goal is to specifically finish learning before collect- ing enough data for these techniques to be useful. As a result, we focus only on kernels with a small number of hyperparameters. 43 4.2 Active Bayesian model selection Suppose that we have a mechanism for actively selecting new data—choosing x∗ ∈ X and observing y∗ = y(x∗)—to add to our dataset D = (X,y), in order to better distinguish the candidate models {Mi}.{After m}aking this observation, we will form an augmented dataset D′ = D ∪ (x∗,y∗) , from which we can recompute a new model posterior p(M | D′). An approach motivated by information theory is to select the location max- imizing the mutual information between the observation value y∗ and the un- known model: I(y∗ [ ] ;M | x∗,D) = H[M | D]− Ey∗ H[[M | D′] ] (4.1) = H[y∗ | x∗,D]− EM H[y∗ | x∗,D,M] , (4.2) where H indicates (differential) entropy. Whereas Equation (4.1) is computation- ally problematic (involving costly model retraining), the equivalent expression (4.2) is typically more tractable, has been applied fruitfully in various active- learning settings [40, 26, 25, 38, 39], and requires only computing the differential entropy of the model-marginal predictive distribution: ∑M p(y∗ | x∗,D) = p(y∗ | x∗,D,Mi)p(Mi | D) (4.3) i=1 { } and the model-conditional predictive distributions p(y∗ | x∗,D,Mi) with all models trained with the currently available data. In contrast to (4.1), this does not involve any retraining cost. Although computing the entropy in (4.3) might be problematic, we note that this is a one-dimensional integral that can easily be resolved with quadrature. Our proposed approach, which we call Bayesian active model selection (B A M S) is then to compute, for each candidate location x∗, 44 the mutual information between y∗ and the unknown model, and query where this is maximized: arg max I(y∗;M | x∗,D). (4.4) x∗ 4.3 Active model selection for Gaussian processes In the previous section, we proposed a general framework for performing se- quential active Bayesian model selection, without making any assumptions about the forms of the models {Mi}. Here we will discuss specific details of our proposal when these models represent alternative structures for Gaussian process priors on a latent function. We assume that our observations are generated via a latent function f : X → R with a known observation model p(y | f), where fi = f(xi). We place a Gaus- sian process (G P) prior distribution on f , p(f) = GP(f ;µ,K) As discussed in Chapter 2, the structural (not hyperparameter) choices made in the mean function µ and covariance function K themselves are typically done by selecting (often blindly!) from several off-the-shelf solutions (see, for exam- ple, [18, 72]; though also see [20, 109]), and this choice has substantial bearing on the resulting functions f we can model. Indeed, in many settings, choosing the nature of plausible functions is precisely the problem of model selection; for example, to decide whether the function has periodic structure, exhibits nonsta- tionarity, etc. Our goal is to automatically and actively decide these structural choices during G P modeling through intelligent sampling. To connect to our active learning formulation, let {Mi} be a set of Gaussian 45 process models for the latent function f . Each model comprises a mean function µi, covariance function Ki, and associated hyperparameters θi. Our approach outlined in Section 4.2 requires the computation of three quantities that are not typically encountered in G P modeling and inference: the hyperparameter poste- rior p(θ | D,M), the model evidence p(y | X,M), and the predictive distribution p(y∗ | x∗,D,M), where we have marginalized over θ in the latter two quanti- ties. The most-common approaches to G P inference are maximum likelihood–II (M L E) or maximum a posteriori–II (M A P) estimation, where we maximize the hyperparameter posterior [106, 72]:1 θ̂ = arg max log p(θ | D,M) = arg max log p(θ | M) + log(y | X, θ,M). (4.5) θ θ Typically, predictive distributions and other desired quantities are then reported at the M L E / M A P hyperparameters, implicitly making the assumption that p(θ | D,M) ≈ δ(θ̂). Although a computationally convenient choice, this does not account for uncertainty in the hyperparameters, which can be nontrivial with small datasets [56]. Furthermore, accounting correctly for model parameter uncertainty is crucial to model selection, where it naturally introduces a model- complexity penalty. We discuss less-drastic approximations to these quantities below. 4.3.1 Approximating the model evidence and hyperparameter posterior The model evidence p(y | X,M) and hyperparameter posterior distribution p(θ | D,M) are in general intractable for G Ps, as there is no conjugate prior 1Using a noninformative prior p(θ | M) ∝ 1 in the case of maximum likelihood. 46 distribution p(θ | M) available. Instead, we will use a Laplace approximation, where we make a second-order Taylor expansion of log p(θ | D,M) around its mode θ̂ (4.5). The result is a multivariate Gaussian approximation: ∣ p(θ | D,M) ≈ N (θ; θ̂,Σ); Σ−1 = −∇2 log p(θ | D,M)∣ . (4.6) θ=θ̂ The Laplace approximation also results in an approximation to the model evi- dence: log p(y | X,M) ≈ log p(y | X, θ̂,M)+log p(θ̂ | M)− 1 log det Σ−1+ d log 2π, (4.7) 2 2 where d is the dimension of θ [71, 49]. The Laplace approximation to the model evidence can be interpreted as rewarding explaining the data well while penal- izing model complexity. Note that the Bayesian information criterion (B I C), com- monly used for model selection, can be seen as an approximation to the Laplace approximation [81, 65]. 4.3.2 Approximating the predictive distribution We next consider the∫predictive d∫istribution: p(y∗ | x∗,D,M) = p(y∗ | f ∗) ︸ p(f ∗ | x∗,D, θ,︷M︷ )p(θ | D,M) dθ︸ df ∗. (4.8) p(f∗|x∗,D,M) The posterior p(f ∗ | x∗,D, θ,M) in (4.8) is typically a known Gaussian distribu- tion, derived analytically for Gaussian observation likelihoods or approximately using standard approximate G P inference techniques [52, 63]. However, the in- tegral over θ in (4.8) is intractible, even with a Gaussian approximation to the hyperparameter posterior as in (4.6). 47 Garnett et al., 2013 introduced a mechanism for approximately marginaliz- ing G P hyperparameters (called the M G P), which we will adopt here due to its strong empirical performance. The M G P assumes that we have a Gaussian approximation to the hyperparameter posterior, p(θ | D,M) ≈ N (θ; θ̂,Σ).2 We define the posterior predictive mean and variance functions as µ∗(θ) = E[f ∗ | x∗,D, θ,M]; ν∗(θ) = Var [[] f ∗ | x∗,D, θ,M]. The M G P works by making an expansion of the predictive distribution around the posterior mean hyperparameters θ̂. The nature of this expansion is chosen so as to match various derivatives of the true predictive distribution; see [26] for details. The posterior distribution of f ∗ is approximated by p(f ∗ |mx∗ ( ,D,M) ≈ N f ∗;µ∗ ) (θ̂), σ2M G P , (4.9) where [ ] [ ] [ ] [ ] σ2 = 4 ∗ ∇ ∗ > >M G P ν (θ̂) + µ (θ̂) Σ ∇µ∗(θ̂) + 1∗ ∇ν∗(θ̂) Σ ∇ν∗(θ̂) . (4.10)3 3ν (θ̂) The M G P thus inflates the predictive variance from the the posterior mean hyper- parameters θ̂ by a term that is commensurate with the uncertainty in θ, measured by the posterior covariance Σ, and the dependence of the latent predictive mean and variance on θ, measured by the gradients ∇µ∗ and ∇ν∗. With the Gaussian approximation in (4.9), the integral in (4.8) now reduces to integrating the obser- vation likelihood against a univariate Gaussian. This integral is often analytic [72] and at worse requires one-dimensional quadrature. 2This is arbitrary and need not be the Laplace approximation in (4.6), so this is a slight abuse of notation. 48 4.3.3 Implementation Given the development above, we may now efficiently compute an approxi- mation to the B A M S criterion for active G P model selection. Given currently observed data D, for each of our candidate modelsMi, we first find the Laplace approximation to the hyperparameter posterior (4.6) and model evidence (4.7). Given the approximations to the model evidence, we may compute an approxi- mation to the model posterior (2.13). Suppose we have a set of candidate points X∗ from which we may select our next point. For each of o{ur models, we com} - pute the M G P approximation (4.9) to the latent posteriors p(f∗ | X∗,D,Mi) , from w{ hich we use stand}ard techniques to compute the predictive distribu- tions p(y∗ | X∗,D,Mi) . Finally, with the ability to compute the differen- tial entropies of these model-conditional predictive distributions, as well as the marginal predictive distribution (4.3), we may compute the mutual information of each candidate in parallel. 4.4 Computational details for common observation likelihoods Here we give further details for computing Equation (4.2) with common obser- vation likelihoods. 49 4.4.1 Regression with Gaussian noise For regression problems with zero-mean homoskedastic Gaussian noise with variance σ2n, we have p(y | f) = N (y; f, σ2n). We may integrate this against a Gaussian distribution on the latent value f (such as that resulting from the M G P approximation (4.9)) to find the predictive distri- bution. Suppose p(f |mx,D) = N (f ;µ, σ2). Then p(y |mx,D) = N (y;µ, σ2 + σ2n). { } The model-conditional predictive distributions p(y∗ | X∗,D,Mi) are thus Gaussian under the M G P approximation: ( ( ) p(y∗ |mx∗,D,M ) = N y∗ ∗i ;µi , ν∗M G P)i + σ2n . The differential entropy of a one-dimensional Gaussian is [ ] H N (y;µ, σ2) = 1 log(2πeσ2); 2 with these results and an approximation to the model posterior, we may compute the second term in (4.2). The marginal predictive distribution p(y∗ | mx∗,D) is now a mixture of Gaussians weighted by the approximate model posterior. The differential entropy of a mixture of Gaussians does not have a closed form, so for the first term in (4.2), we must resort to (one-dimensional) quadrature. Note that we may also treat the noise variance σ2n as a model-dependent hyperparameter and use the M G P to approximately marginalize it as well, which changes the above only slightly. An extension to heteroskedastic noise is also trivial. 50 4.4.2 Probit regression For binary classification problems with a probit likelihood, we have Pr [(] y = 1 | f) = Φ(f), where Φ is the univariate standard normal C D F . We may integrate this against a Gaussian distribution on the latent value f to find the predictive distribution. Suppose p(f |mx,D) = N (f ;µ,∫σ2). Then ( ) µ Pr [(] y = 1 |mx,D) = Φ(f)N (f ;µ, σ2){df = Φ √ .1 + σ2 } The model-conditional predictive distributions p(y∗ | X∗,D,Mi) are thus Bernoulli distributions under the M G P a(pp(roximation:√ ∗ ))p(y∗ |mx∗,D,Mi) = B µΦ i . 1 + (ν∗M G P)i The differential entropy of a Bernoulli distribution with success probability p is given by the Berno[ulli en] tropy function h: H B(p) = h(p) = −p log p− (1− p) log(1− p); with these results and an approximation to the model posterior, we may compute the second term in (4.2). The marginal predictive distribution p(y∗ | mx∗,D) is now a mixture of Bernoullis weighted by the approximate model posterior. Bernoulli distributions are closed under taking mixtures, and therefore in this case we may compute both terms in (4.2) without resorting to quadrature. 4.5 Audiometric threshold testing Noise-induced hearing loss occurs when an otherwise healthy individual is ha- bitually subjected to high-intensity sound [61]. This differs substantially from the 51 standard audiometric setting discussed in Chapter 3 because this type of hearing loss can present as a sharp, notch-shaped hearing loss in a narrow (sometimes less than one octave) frequency range. From a technical perspective of the model presented in Chapter 3, this violates the smoothness-in-frequency assumption made by using the RBF kernel over frequency. Furthermore, the standard grid based approach to testing has difficulty diagnosing noise induced hearing loss because a frequency–intensity grid must be very fine to ensure that a notch is detected. We cast the detection of noise-induced hearing loss as an active model se- lection problem. We will describe two Gaussian process models of audiometric functions: a baseline model of normal human hearing, and a model reflecting N I H L. We then use the B A M S framework introduced above to, as rapidly as possible for a given patient, determine which model best describes his or her hearing. Normal-patient model. To model a healthy patient’s audiometric function, we use the model described in [25]. The G P prior proposed in that work com- bines a constant prior mean µhealthy = c (modeling a frequency-independent natural threshold) with a kernel taken to be the sum of two components: a lin- ear covariance in intensity and a squared-exponential covariance in frequency. Let [i, φ] represent a tone stimulus, with i representing its intensity and φ its frequency. We define: ( ) K [i, φ], [i′, φ′] = αii′ ( ) + β exp − 1 ′ 22 |φ− φ | , (4.11)2` where α, β > 0 weight each component and ` > 0 is a length scale of frequency- dependent random deviations from a constant hearing threshold. This kernel en- codes two fundamental properties of human audiologic response. First, hearing 52 is monotonic in intensity. The linear contribution αii′ ensures that the posterior probability of detecting a fixed frequency will be monotonically increasing after conditioning on a few tones. Second, human hearing ability is locally smooth in frequency, because nearby locations in the cochlea are mechanically coupled. The combination of µhealthy with K specifies our healthy model Mhealthy, with parameters θhealthy = [c, α, β, `]>. Noise-induced hearing loss model. We extend the model above to create a second G P model reflecting a localized, notch-shaped hearing deficit characteris- tic of N I H L. We create a novel, flexible prior mean function for this purpose, the parameters of which specify the exact nature of the hearing loss. Our proposed notch mean is: µ ′ 2N I H L(i, φ) = c− dN (φ; ν, w ), (4.12) whereN ′(φ; ν, w) denotes the unnormalized normal probability density function with mean ν and standard deviation w, which we scale by a depth parameter d > 0 to reflect the prominence of the hearing loss. This contribution results in a localized subtractive notch feature with tunable center, width, and height. We retain a constant offset c to revert to the normal-hearing model outside the vicinity of the localized hearing deficit. Note that we completely model the ef- fect of N I H L on patient responses with this mean notch function; the kernel K above remains appropriate. The combination of µN I H L with K specifies our N I H L modelMN I H L with, in addition to the parameters of our healthy model, the additional parameters θN I H L = [ν, w, d]>. 53 4.6 Results To test B A M S on our N I H L detection task, we evaluate our algorithm using audiometric data, comparing to several baselines. From the results of the clini- cal trial discussed in Chapter 3, we have examples of high-fidelity audiometric functions inferred for several human patients. We may use these to simulate audiometric examinations of healthy patients using different methods to select tone presentations. We simulate patients with N I H L by adjusting ground truth inferred from nine healthy patients with in-model samples from our notch mean prior. Recall that high-resolution audiogram data is extremely scarce. We first took a thorough pure-tone audiometric test of each of nine patients from our trial with normal hearing using 100 samples selected using the algo- rithm in Chapter 3 on the domain X = [250, 8000] Hz× [−10, 80] dB HL,3 typical ranges for audiometric testing [41]. We inferred the audiometric function over the entire domain from the measured responses, using the healthy-patient G P modelMhealthy with parameters learned via M L E – I I inference. The observation model was p(y = 1 | f) = Φ(f), where Φ is the standard normal C D F, and ap- proximate G P inference was performed via a Laplace approximation. We then used the approximate G P posterior p(f | D, θ̂,Mhealthy) for this patient as ground- truth for simulating a healthy patient’s responses. The posterior probability of tone detection learned from one patient is shown in the background of Figure 4.2. We simulated a healthy patient’s response to a given query tone mx∗ = [i∗, φ∗] by sampling a conditionally independent Bernoulli random variable with parame- ter p(y∗ = 1 |mx∗,D, θ̂,Mhealthy). We simulated a patient with N I H L by then drawing notch parameters (the 3Inference was done in log-frequency domain. 54 parameters of (4.12)) from an expert-informed prior, adding the corresponding notch to the learned healthy ground-truth latent mean, recomputing the detec- tion probabilities, and proceeding as above. Example N I H L ground-truth detec- tion probabilities generated in this manner are depicted in the background of Figure 4.2. 4.6.1 Diagnosing NIHL To test our active model-selection approach to diagnosing N I H L, we simulated a series of audiometric tests, selecting tones using three alternatives: B A M S , the active learning audiometric algorithm presented in Chapter 3, and random sampling.4 Each algorithm shared a candidate set of 10 000 quasirandom tones X∗ generated using a scrambled Halton set so as to densely cover the two- dimensional search space. We use data from nine healthy patients and simulate a total of 27 patients exhibiting a range of N I H L presentations, using independent draws from our notch mean prior in the latter case. For each audiometric test simulation, we initialized with five random tones, then allowed each algorithm to actively select a maximum of 25 additional tones, a very small fraction of the hundreds typically used in a regular audiometric test. We repeated this proce- dure for each of our nine healthy patients using the normal-patient ground-truth model. We further simulated, for each patient, three separate presentations of N I H L as described above. We plot the posterior probability of the correct model after each iteration for each method in Figure 4.3. In all runs with both ground-truth models, B A M S was able to rapidly achieve 4We also compared with uncertainty sampling and query by committee (Q B C); the perfor- mance was comparable to random sampling and is omitted for clarity. 55 greater than 99% confidence in the correct model without expending the entire budget. Although all methods correctly inferred high healthy posterior proba- bility for the healthy patient, B A M S was more confident. For the N I H L patients, neither baseline inferred the correct model, whereas B A M S rarely required more than 15 actively chosen samples to confidently make the correct diagnosis. Note that, when B A M S was used on N I H L patients, there was often an initial period during which the healthy model was favored, followed by a rapid shift towards the correct model. This is because our method penalizes the increased complexity of the notch model until sufficient evidence for a notch is acquired. Figure 4.2 shows the samples selected by B A M S for typical healthy and N I H L patients. The fundamental strategy employed by B A M S in this application is logical: it samples in a row of relatively high-intensity tones. The intuition for this design is that failure to recognize a normally heard, high-intensity sound is strong evidence of a notch deficit. Once the notch has been found (Figure 4.2), B A M S continues to sample within the notch to confirm its existence and rule out the possibility of the miss (tone not heard) being due to the stochasticity of the process. Once satisfied, the B A M S approach then samples on the periphery of the notch to further solidify its belief. The B A M S algorithm sequentially makes observations where the healthy and N I H L model disagree the most, typically in the top-center of the M A P notch location. The exact intensity at which B A M S samples is determined by the prior over the notch-depth parameter d. When we changed the notch depth prior to support shallower or deeper notches (data not shown), B A M S sampled at lower or higher intensities, respectively, to continue to maximize model disagreement. Similarly, the spacing between samples is controlled by the prior over the notch- 56 80 80 1 60 60 0.8 0.6 40 40 0.4 20 20 0.2 0 0 0 8 9 10 11 12 13 8 9 10 11 12 13 frequency (log2 Hz) frequency (log2 Hz) Figure 4.2: Samples selected by B A M S (red) and the method from Chapter 3(white) when run on the normal-hearing ground truth and the N I H L model ground truth. Contours denote probability of detection at 10% intervals. Circles indicate presentations that were heard by the simulated patient; exes indicate presentations that were not heard by the simulated patient. Left: Normal hearing model ground truth. Right: Notch model ground truth. width parameter w. Finally, it is worth emphasizing the stark difference between the sampling pattern of B A M S and the audiometric test from Chapter 3; see Figure 4.2. Indeed, when the goal is learning the patient’s audiometric function, the audiometric testing algorithm proposed in that work typically has a very good estimate after 20 samples. However, when using B A M S , the primary goal is to detect or rule out N I H L. As a result, the samples selected by B A M S reveal little about the nuances of the patient’s audiometric function, while being highly informative about the correct model to explain the data. This is precisely the tradeoff one seeks in a large-scale diagnostic setting, highlighting the critical importance of focusing on the model-selection problem directly. 57 intensity (dB HL) BAMS I(y∗;f |D,M healthy) random 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 10 20 30 10 20 30 iteration iteration Figure 4.3: Posterior probability of the correct model as a function of iteration number. Left: Notch model ground truth. Right: Normal hearing model ground truth. 4.7 Discussion We have now seen a novel information-theoretic approach for active model selec- tion, Bayesian active model selection, and successfully applied it to rapid screening for noise-induced hearing loss. Our method for active model selection does not require model retraining to evaluate candidate points, making it more feasible than previous approaches. Further, we provided an effective and efficient ana- lytic approximation to our criterion that can be used for automatically learning the model class of Gaussian processes with arbitrary observation likelihoods, a rich and commonly used class of potential models. Impact. Beyond audiometric testing, one of the key methodological contribu- tions of this paper is an extension of covariance function selection techniques based on Bayesian model selection to the active learning setting, and in particular the small data setting. Indeed, one of the broader take-aways from this work is the simple example in Figure 4.3: sufficiently different covariance functions can 58 Pr(correctmodel |D) often be distinguished almost perfectly with a very small number of examples. On the application side, it is our hope that this work will eventually be adopted and supplant existing techniques used for noise-induced hearing loss screening. It is our understanding that standard practice for this problem is to give a subject a full audiometric test; however, as we’ve demonstrated, this can potentially be quite slow or even fail to discover particularly sharp notches entirely. Future work. As a consequence of this insight, one of the most obvious future directions for this work is to use these techniques for more popular applications of Gaussian processes, and in particular Bayesian optimization. The ability to choose significantly more informative priors with a small number of training examples is an interesting prospect in the Bayesian optimization framework in particular. Indeeed in the next chapter, we will focus on one specific setting where covariance function learning in the limited data regime dramatically im- proves convergence. 59 CHAPTER 5 DISCOVERING AND EPXLOITING ADDITIVE STRUCTURE FOR BAYESIAN OPTIMIZATION 5.1 Introduction In the previous chapter, we developed a technique that extends Bayesian model selection to the active learning setting and allows for automatically selecting an appropriate covariance function for the modeling task at hand, often after only a handful of examples. While this approach is extremely effective when the space of models to choose from is small, it does have the drawback of needing to train a Gaussian process on all observed data with each candidate covariance function. While this is not problematic when the space of candidate models is reasonable, this becomes untenable if the candidate model space becomes gigantic, even when using scalable GP techniques. In this chapter, we turn to a second application of model selection with lim- ited data, but one that must select from a combinatorially large set of covariance functions. In particular, we will investigate whether model selection during Bayesian optimization can be used to more rapidly optimize the target blackbox function than simply using a standard RBF or Matern kernel. It is worth noting that a typical BayesOpt algorithm has only two key factors that must be selected by the user: the acquisition function—how new queries are selected given the posterior belief—and the choice of prior. Whereas the former has been studied extensively, and many reasonable options exist, the choice of prior has received very little attention. Almost all off-the-shelf BayesOpt solvers 60 use general-purpose kernels, and do little in the way of discovering and exploit- ing structure that may underlie a particular objective function. With poor or overly general choices of the kernel, BayesOpt may converge very slowly on complex functions, especially in moderate-to-high dimension. Hence, exploiting structure can have a substantial impact on optimization perfor- mance. For example, [90] argue that for hyperparameter optimization a Matérn- 5/2 kernel provides better results in general than the squared exponential kernel, because the latter models unrealistically smooth functions. Choosing (or learn- ing) a kernel that exploits low-dimensional structure in the function can signif- icantly improve performance when confronted with 10s or 100s of parameters [102, 26]. In this chapter, we will seek to apply model selection to discover and exploit such additive structure: cases where the setting of some variables in the design space does not affect the optimal setting of others. For example, in hyperparam- eter tuning, some groups of parameters may have well-known interactions (e.g., learning rate and momentum), whereas others do not strongly interact (e.g., mo- mentum and regularization weight). Intuitively, if different groups of parameters interact only additively, then they can be optimized independently. This intu- ition has been formalized many times in the context of kernels [3], generalized additive models [35], and various extensions [101]. In the context of Bayesian op- timization specifically, [45] recently showed that knowing the additive structure of a function ahead of time gives exponential reductions in sample complexity. However, it is challenging in general to reason about the structure of arbi- trary black-box functions a priori. The core contribution of our work is an effi- cient mechanism for discovering the underlying structure of the function while 61 BayesOpt is executed. Although in this chapter we focus on additive structure, we believe this technique could be used more broadly in a wide variety of set- tings. Rather than fixing a model a priori, we make use Bayesian model selection to, at each iteration, sample an additive structure consistent with the data observed so far. We deal with the large number of possible additive structures by using a Metropolis–Hastings scheme to sample from the model posterior, and demon- strate that this mechanism quickly samples additive structures that explain the data well. We evaluate our method both in-model and on several benchmark optimiza- tion problems with varying additive structure. The experiments validate empiri- cally that our approach significantly outperforms both standard BayesOpt and the random-bag-of-models exploration scheme introduced by [45]. On synthetic test functions, we show that our approach discovers the correct structure even for functions in a moderately large number of dimensions and is substantially faster than existing structure-discovery techniques. Finally, we evaluate the efficacy of our proposed algorithm on two real-world applications: a matrix completion task and an astrophysics simulation experi- ment to estimate physical constants to high accuracy. While these real-world settings were not known to have any form of additive structure a priori, our method nevertheless converges faster and finds better optimal solutions than prior work. As considering additive structure is a strict generalization of using the fully dependent structure, it is our hope that the methods discussed in this Chapter can be broadly applied: in all of our testing, we did not notice a setting where attempting to discover additive structure was detrimental to performance. 62 5.2 Related Work The idea that modeling functions can be done more efficiently by exploiting un- derlying additive structure is well known in general [3]. Generalized additive models (GAMs) are linear models that explicitly model a response variable as a linear combination of univariate functions [35]. These results have been ex- tended to the setting where some component functions depend on more than one input variable, allowing for general additive structure [101]. In the context of Gaussian process regression in machine learning, several papers exist that exploit general forms of additive structure in the literature. [22] introduced an additional generaliztion of GAMs that allows for “higher-order” additive kernels (GAMs corresponding to first-order additive kernels). [45] demonstrated that Bayesian optimization using additive Gaussian pro- cesses achieves lower regret than fully dependent models. They showed that Bayesian optimization using a prior that exploits additive structure in the objec- tive function has theoretically significantly lower sample complexity than with a kernel that does not exploit this structure. In their paper, the authors proposed evaluating a number of randomly selected additive structures—which we will call the bag-of-models approach—and choosing the structure that explains any observed data the best. The primary goal of our work is to show that this ran- dom search is inefficient and can be improved dramatically by performing an informed search in model space in tandem with the optimization procedure. The problem of choosing covariance functions that result in better models than basic kernels has been well studied in general. One approach to this is to compose simple kernels (for example, the RBF kernel, the linear kernel, etc) using 63 simple operations like addition and multiplication [74]. [19] proposed to search over possible compositions by constructing a grammar of possible kernel compo- sitions, starting with basic base kernels and producing more complex kernels via composition. This tree is then greedily searched over to construct a kernel with high model evidence. [59] extended this idea, replacing the greedy search with Bayesian optimization in model space, defining a novel “kernel kernel” to rea- son about the similarity between data explanations offered by different kernels and speed up the search. Rather than constructing complex models from simple ones, [108] took a different approach, and introduced a highly flexible kernel by using Bochner’s theorem to write any stationary kernel as the Fourier transform of a finite measure, and parameterizing a class of stationary kernels by using a Gaussian mixture as the spectral density. Similar to our setting, [24] consid- ered actively discovering the structure of a function by sequentially querying as few points as possible. However, the techniques in that paper are not used for optimization. Choosing generally more informative priors in Bayesian optimization has received some attention as well. For example, [95] considered using information gained from functions related to the objective to guide optimization. This occurs, for example, in hyperparameter tuning when hyperparameters on subsampled data can be used to inform about the validation error on the full dataset. 5.3 Model selection via MCMC The number of possible additive decomposition models in d dimensions is given by the dth Bell number [105], which satisfy the following recurrence relation (for 64 d > 1) : ∑d−1 ( − )d 1 ∑n ( )n Bd = Bk, Bn+1 = Bk, k k k=0 k=0 with B0 =B1 =1. This grows super-exponentially, and for d=10 already reaches B10 = 115 975. This makes computing the model posterior in eq. (2.13) pro- hibitively expensive, as it involves conditioning a Gaussian process with each of these models to compute the model evidences. However, computing a small (i.e., not super-exponential) number of model evidences is tractable. A natural alter- native to computing the model posterior is to sample from it using Metropolis– Hastings, which requires only one additional model evidence computation per sample. Proposal distribution. As we focus on additive structure throughout this sec- tion, a model is uniquely defined by its partitioning of the underlying dimen- sions, and with a slight abuse of notation we refer to the partitioning and the resulting model both as M. The primary component of Metropolis–Hastings that needs to be specified is the proposal distribution g(M′ | M). Given an ad- ditive structure partitioning M, there are two natural operations that can be performed. First, an existing element of the partition can be split in two. For example, if [1, 2, 3] ∈ M, we can form [1][2, 3], [2][1, 3], [3][1, 2] by splitting. In general, a component of size k can be split in 2k−1 different ways. Second, two existing elements of the partition may be merged. For example, [1, 4] and [2, 3] ca(n)be merged to form [1, 2, 3, 4]. In general, a partition with k components has k possible merges. 2 With these operations in mind, we construct a proposal distribution as illus- 65 M =[1,3] [2] [4] split or merge? split merge pick 1 sub-partition: pick 2 sub-partitions: [1,3] [2] [4] [1,3] [1,3] [2][2] [4] [4] perform split perform merge M0=[1][2][3][4] M0= [1,3,2] [1,3,4] [1,3][4] [2] [2,4] Figure 5.1: An illustration of the proposal distribution g(M′ | M) on a simple input model with three sub-partitions,M = [1, 3][2][4]. Splits with dice repre- sent choices performed uniformly at random (without replacement). See text for details. trated in Figure 5.1. Since there may be more splits than merges or vice versa de- pending on the current state of the Markov chain, we first choose whether to split or merge, each with 50% probability. Next, if we split (left sub-tree), we choose a component of the existing partitionM uniformly at random and split it in two (again uniformly at random). If we instead merge (right sub-tree), we choose two components of M uniformly without replacement and merge them. Given this proposal mechanism, sampling from the proposal distribution g(M′ | M) is straightforward and efficient. 66 5.4 Additive structure discovery The goal of our work is to discover the additive model structure underlying the objective function f(x), while simultaneously exploiting it for BayesOpt. At each iteration i of BayesOpt, we will have collected some dataset of func- tion evaluations Di = {Xi,yi}. BayesOpt, as described in section 2.3, uses this data to update the posterior p(f | Di) and then to identify the new point x∗, which maximizes EI(x∗) and for which f(x∗) should be evaluated next. The expected improvement EI(x∗) is a function of p(f(x∗) | D,x∗). As we are considering multiple models, each providing us with a different posterior over f , we approximately marginalize out the model. We sample k modelsM1, . . . ,Mk from p(M | Di) to obtain ∑k1 p(f(x∗) | D,x∗) ≈ p(f(x∗) | D,x∗,Mj). (5.1) k j=1 MCMC. To obtain these samples, we use the Metropois-Hastings algorithm with the proposal distribution g(M |M′). First observe that the model posterior is proportional to the model evidence p(y | X,M), by eq. (2.13), which can be efficiently evaluated for single models. We can therefore sample k models as follows: Given the current modelMj (initializingM0 to the final model found in the previous iteration), we sample a proposed modelM′ from the proposal distribution g(M′ | Mj). Next, we compute the model evidence forM′ and use this to compute the Metropolis-H(astings acceptance probability): p(yi | Xi,M′)g(M |M′j ) A(M′ | Mj) = min 1, . p(yi | Xi,Mj)g(M′ | Mj) Finally, we update the current state toM′ with probability A(M′ | Mi). 67 Candidate selection. As pointed out at the end of section 2.3, finding the point x∗i to maximize EI(x∗) has exponential sample complexity with the number of dimensions and can be slow. This originates from the fact that the acquisition function is evaluated on a fine grid in the d dimensional space, or is optimized using methods that scale poorly with dimensionality (for example, gradient de- scent with random restarts or DIRECT). In the presence of additive structure, this search can be decomposed and performed for each component individually— leading to exponential speed-ups. To do this, we start with some initial x∗i and iteratively optimize EI in each component of the partition. For example, if the first component of the partition is [1,3,5], we first optimize the first, third, and fifth dimensions of x∗i , holding the other dimensions fixed. If the next component is [4], we optimize EI by varying the 4th dimension (holding other dimensions fixed), and so on. As each modelMi has its own additive structure, we identify the best x∗i for each of the k models, x∗i = arg max EI(xi | Mi). (5.2) x We consider the resulting points x1, . . .xk as candidate points and set x∗ = xi for the particular i that maximizes the marginalized acquisition function EI(x∗i ) using eq. (5.1). 5.5 Results For all the experiments, we used the Python GP toolbox GPy [31]. 68 10d GP Function Model Selection 0.8 Bag of Models Model MCMC 0.6 0.4 0.2 0 50 100 150 200 Number of Model Evaluations Figure 5.2: Plot of the relative distance to the true loglikelihood as a function of the number of model evaluations (see text). For MCMC, this means the number of samples drawn in the Markov chain. For bag of models, this means the number of models in the selected bag. Error bars are standard error averaged over 10 runs. 5.5.1 Model Selection Here we demonstrate the effectiveness of MCMC search in model space. We sample a random partition P of {1, . . . , d} and sample observations from a Gaus- sian process with additive structure corresponding to P . In particular, we take N = 50 points from a scrambled Halton sequence on [0, 1]d, and sample cor- responding function values f(x) from the Gaussian process. We then perform MCMC model search using these N observations. The initial kernel is the fully dependent kernel. Figure 5.2 shows plots of the Markov chain length versus model evidence achieved with that chain length for d = 10. We compare with the bag-of-models (BOM) proposed by [45]. For each method we plot, as a func- tion of the number of models evaluated, the relative distance to the true model. 69 Relative Distance to True Log Likelihood This is given by 1 − log p(y|X,Mi,θi) ∗ log p(y|X,M∗ ∗ , where M is the true model andMi is the,θ ) model evaluated at iteration i. MCMC consistently finds additive decompositions within 15–17% of the ground-truth log likelihood. In 8 out of 10 runs, MCMC samples the ground- truth model exactly. BOM consistently fails to find models with a relative log likelihood distance of less than 30%. This experiment demonstrates that MCMC is an effective technique for exploring the space of additive structures, at least when the underlying function is well modeled by a Gaussian process. Specif- ically, we can expect it to outperform the BOM when used as a subroutine in Bayesian optimization when additive structure is critical to the optimization task. 5.5.2 Optimization We next incorporate MCMC model search into the BayesOpt framework. In all experiments below, we sample k = 50 models using the MCMC techniques discussed above at each iteration. For the Bag of Models (BOM) baseline, we use a bag of 50 models. Thus, the MCMC approach and the bag of models approach perform the same number of MLE optimizations of the log likeli- hood log p(y | X,Mi, θi). Instead of performing burn-in—discarding initial low- likelihood samples—we initialize the Markov chain at each iteration with the fi- nal model sampled at the previous iteration. Therefore, the MCMC and BOM ap- proaches add virtually identical amounts of overhead to the baseline BayesOpt algorithm, and the wallclock time required for each iteration is essentially iden- tical for both methods. 70 10d Styblinski-Tang Bayesopt 10d Michalewicz Bayesopt −100 Fully dependent 0 Fully dependent Bag of Models Bag of Models BayesOpt+Model MCMC − BayesOpt+Model MCMC2 −200 Oracle Oracle −4 1.36x − −6300 1.58x −8 −400 −10 50 100 150 200 100 200 300 400 Number of Iterations Number of Iterations Figure 5.3: Optimizing the 10d Styblinksi-Tang and Michalewicz functions. Shaded error bars are 2 standard errors over 10 runs. BayesOpt+Model MCMC converges faster than BOM in both cases. We do note that the amount of overhead added by performing model selec- tion is not inconsequential for cheap functions f . Indeed, considering k models at each iteration results in roughly a factor k increase in the overhead added by BayesOpt. However, in many real world settings (like the cosmological constants experiment and the matrix factorization experiment below), the overhead of run- ning BayesOpt is insignificant compared to evaluating the objective function f) as long as only a relatively small number of optimization iterations are run. In all experiments, the squared exponential (SE) kernel is used as the base kernel. 5.5.3 Optimization of Benchmark Functions We first consider two standard optimization benchmark functions that have addi- tive structure: the Styblinski–Tang function, and the Michalewicz function. The 71 Minimum Function Value Minimum Function Value d-dimensional Styblinski–Tang function is defined as 1 ∑d Stybtang(x) = x4i − 16x2i + 5xi.2 i=1 The global minimum is approximately −39.166d and is attained at point x∗ ≈ (−2.9, . . . ,−2.9). We restrict the domain of the function to [−4, 4]d for the opti- mization. The d-dimensional Michalewicz fun∑ction is definedd (as )ix Michalewicz − i(x) = sin(x ) sin2mi . π i=1 Here, m controls the steepness of valleys and ridges. For this experiment, we set m = 10. For d = 10, the global minimum is approximately −9.66 over the domain [0, π]d. In addition to these two functions, we extend the Styblinski–Tang function, which fully additively decomposes, to a transformed Styblinski–Tang function: We sample a random partition P and for each part i of P , we sample a random orthonormal matrix Qi over the dimensions of part i. If Q is the block diagonal matrix formed by placing each Qi on a diagonal, then Stybtang(Qx) is no longer fully additive, but instead is additive across the components of P . We use this to investigate performance when the true function is not fully additive across individual dimensions. We compare against three baselines. The fully dependent model uses a sin- gle d-dimensional kernel as the model. This is the approach used by popular BayesOpt packages such as Spearmint [90]. The oracle model uses a sum of squared-exponential kernels whose parts capture the true additive structure of the underlying function. For all experiments, we expect the oracle model to outperform all other methods since it has prior knowledge of the true function 72 10d Transformed Styblinski-Tang Bayesopt −100 Fully dependent Bag of Models BayesOpt+Model MCMC −200 Oracle 1.82x −300 −400 50 100 150 200 Number of Iterations Figure 5.4: Optimizing the 10d transformed Styblinksi-Tang function. Shaded er- ror bars are 2 standard errors over 10 runs. BayesOpt+Model MCMC converges to the same final solution as BOM much faster, and significantly outperforms it in terms of final objective value. structure. Our third baseline is BOM [45]. In all experiments, BayesOpt+Model MCMC significantly outperforms the non-oracle approaches, both in terms of convergence speed and ultimate objective value. Figure 5.3 left shows the plot of optimizing the Styblinski–Tang function. Each curve shows the mean cumulative minimum function evaluation up to the current iteration across 10 runs, while the shaded region shows the standard error. It can be seen that our method attains the true global minimum as quickly as the oracle model, while outperforming BOM. The fully dependent model does not converge within 200 iterations. Figure 5.3 right shows the plot of optimizing the Michalewicz function. Since the function has many steep valleys, it is much more difficult to model with a 73 Minimum Function Value Gaussian process than the Styblinski–Tang function. Nevertheless, both the ora- cle model and our method converge to the global optimum within 300 iterations while BOM has not yet converged after 400 iterations. Again, the fully dependent model shows no sign of convergence. In the transformed Styblinski-Tang experiment (Figure 5.4), the more diffi- cult model selection problem poses a significant challenge to the BOM method, which does not significantly outperform the baseline fully dependent model. In contrast, Model MCMC successfully discovers appropriate additive structure and performs significantly better. Of particular interest is the first 50 iterations where Model MCMC outperforms the baseline. This happens in cases where the ground truth model has large non-additive components, which makes the EI optimization problem significantly more difficult for the oracle method. The Model MCMC method averages over many additive structures which, while incorrect, have much simpler structure and for which EI can be more efficiently optimized. 5.5.4 Determining Cosmological Parameters In this section, we test our method on the task of determining the values of various cosmological constants (e.g., Hubble’s constant, the density of baryonic matter in the universe, etc). These cosmological constants are values that are required in standard physical models of the universe, but their exact values must be determined experimentally. To do this, scientists can run simulations with various settings of these cosmological constants and evaluate how well these simulations model experimentally observed data. 74 Cosmological Constants Experiment 26 Baseline BayesOpt Bag of Models 25 BayesOpt+Model MCMC 24 1.38x 23 22 0 50 100 150 200 Iteration Number Figure 5.5: Setting values of various cosmological constants to match experimen- tal data using BayesOpt. Shaded regions correspond to 2 standard errors over 10 runs. BayesOpt+Model MCMC matches BOM 38% faster, and then outperforms it. We use software released by NASA1 that, given settings of these cosmological constants, computes via simulation the likelihood of a set of experimental data released by the Sloan Digital Sky Survey. As in [45], we tune the 9 parameters that this software takes as input and produces the negative log likelihood. However, unlike [45], we do not introduce placeholder dimensions that have no effect on the objective function value. To set ranges for each of the parameters, we take the values set in a parameter file shipped with the software, and optimize over a range of 75%− 125% of this default value for each parameter. The results of this experiment are in Figure 5.5. The parameter values shipped with the software achieve an objective func- 1https://lambda.gsfc.nasa.gov/toolbox/lrgdr/ 75 Negative Log Likelihood Matrix Completion Tuning 0.4 Original Noise Added BayesOpt BayesOpt (Additive) Baseline BayesOpt Bag of Models BayesOpt+Model MCMC 0.3 Original Noise Added BayesOpt BayesOpt (Additive) 0.2 Original Noise Added BayesOpt BayesOpt (Additive) 0.1 0 10 20 30 40 50 60 Iteration Number Figure 5.6: Optimizing the reconstruction error of images with matrix comple- tion. Corner: Example of image reconstruction on the peppers image using hy- perparameters found by both the baseline and by BayesOpt with additive struc- ture. The additive BayesOpt parameters produce a significantly sharper image. tion value of 23.7. All three methods find statistically significantly better solu- tions by 200 iterations, with the BayesOpt+Model MCMC approach performing the best, consistently converging to an optimal negative log likelihood of 22.17. Furthermore, BayesOpt+Model MCMC converges much faster, achieving the global solution found by the bag of models method nearly 40% faster. 5.5.5 Matrix completion Many of the experiments in [45] and our paper here have focused on functions with 10 or more dimensions. However, there are also many machine learning algorithms with only a small number (3 or fewer) of hyperparameters. Although with these few dimensions, employing the machinery of MCMC to explore the 76 Reconstruction Error model space is not necessarily critical (indeed, BOM reduces to evaluating all possible additive decompositions and choosing the best), we seek to demonstrate two things: First, exploiting additive structure can still have a substantial impact on performance. Second, even in a scenario where BOM is optimal (i.e. the best additive structure is chosen each iteration), our sampling approach matches its performance. We evaluate our method on the task of choosing hyperparameters for a matrix completion algorithm in [9]: a singular-value decomposition-based method for recovering a matrix with missing entries which can be used for image denoising, recommendation, and data interpolation, among other applications. We set the following hyperparameters using standard BayesOpt, our method, and BOM: the regularization trade-off τ , the step-size δ, and the noise-constraint ε. A plot of the average validation reconstruction error on an image reconstruction task as a function of the number of BayesOpt iterations is displayed in figure 5.6. Both additive structure mechanisms significantly outperform the standard BayesOpt baseline, both in terms of convergence speed and in terms of final ob- jective function value. To assess whether the hyperparameters made a difference in terms of image reconstruction quality, we use the hyperparameters found by each method to recover the benchmark “peppers” image [103] after noise has destroyed 50% of the pixels. The corrupted image as well as the reconstructions are in figure 5.6. The image reconstructed using the additive Bayesian optimiza- tion methods is significantly sharper than the image reconstructed using the non-additive approach, with less blurring and much sharper edges. 77 5.6 Discussion Problem structure for Bayesian optimization. It’s important to emphasize that the choice of prior has rarely been considered for Bayesian optimization historically. Largely, this has been due to the fact that it is difficult for an algo- rithm designed to optimize arbitrary unknown functions to take advantage of problem structure. It would be difficult to a priori suppose that montonicity or periodicity exists in a blackbox function, and therefore implementations have largely defaulted to general purpose covariance functions that have universal approximation properties. This can result in very high sample complexities, but makes it more likely that Bayesian optimization will eventually succeed. Beyond additive structure. The primary reason that the methods in this chap- ter are so effective is that additive structure is an extremely strong prior to im- pose on the blackbox function, particularly for optimization: if two variables of the optimization problem exist only in different additive components, then the covariance function only models functions in which the variables do not “interact” at all. In general, one could imagine the set of parameters as a graph, with an edge between two parameters if they exist in the same component. The standard RBF kernel would correspond here to the fully connected graph, while the fully additive model corresponds to the graph with no edges. In general, one could consider arbitrary graphs, with the reasonable expectation that fewer edges correspond to increasingly easier optimization tasks but increasingly more restrictive priors. It would be interesting to see what can be shown for more gen- eral structures. For example, is a function that decomposes in to two connected components with a single edge between them easier to optimize than the fully 78 connected structure? Conclusion. In this chapter we introduced an integrated solution to discover and exploit additive structure while performing BayesOpt. Our algorithm is based on MCMC sampling of model candidates and in practice converges sur- prisingly fast for plausible and useful model decompositions. Given the drastic speed-ups that can be achieved through exploitation of additive structure, we hope that our algorithm will become a standard component of Bayesian opti- mization. 79 CHAPTER 6 EXPLOITING PRODUCT STRUCTURE FOR SCALABLE GAUSSIAN PROCESSES 6.1 Introduction In this chapter, we will look at how covariance function structure can be usefully exploited to dramatically increase the scalability of Gaussian processes in many common settings. Historically, one of the key limitations of Gaussian process regression has been the computational intractability of inference when dealing with more than a few thousand data points. This complexity stems from the need to solve linear systems and compute log determinants involving an n× n sym- metric positive definite covariance matrix K. This task is commonly performed by computing the Cholesky decomposition of K [73], incurring O (n3) complexity. To reduce this complexity, inducing point methods make use of a small set ofm < n points to form a rank m approximation of K [70, 89, 36, 97]. Using the matrix inversion and determinant lemmas, inference can be performed in O (nm2) time [89]. Recently, however, an alternative class of inference techniques for Gaussian processes have emerged based on iterative numerical linear algebra techniques [113, 17]. Rather than explicitly decomposing the full covariance matrix, these methods leverage Krylov subspace methods [28] to perform linear solves and log determinants using only matrix-vector multiples (MVMs) with the covari- ance matrix. Letting µ(K) denote the time complexity of computing Kv given a vector v, these methods provide excellent approximations to linear solves and 80 log determinants in O (kµ(K)) time, where k is typically some small constant.1 This approach has led to scalable GP methods that differ radically from previous approaches – the goal shifts from computing efficient Cholesky decompositions to computing efficient MVMs. Structured kernel interpolation (SKI) [113] is a re- cently proposed inducing point method that, given a regular grid of m inducing points, allows for MVMs to be performed in an impressive O (n+m logm) time. These MVM approaches have two fundamental drawbacks. First, in order for SKI to take advantage of fast MVMs, the number of inducing points m grows exponentially with the dimensionality of the inputs, limiting the applicability of SKI to problems with fewer than 5 input dimensions. Second, the computational benefits of iterative MVM inference methods come at the cost of reduced modularity. If all we know about a kernel is that it decom- poses as K = K1 ◦K2, it is not obvious how to efficiently perform MVMs with K, even if we have access to fast MVMs with both K1 and K2. As we’ve seen throughout this thesis, the ability to compose covariance functions using addi- tion and multiplication is a critical feature of Gaussian processes, as it allows practioniers to exploit problem specific structure for sample efficient learning. In order for MVM inference to truly support the expressive expressive mod- eling choices offered by composing covariance functions, we should be able to perform inference equipped with nothing but the ability to perform MVMs with K. One of the most common kernel compositions is the element-wise product of kernels. This composition can encode different functional properties for each input dimension [73, 29, 21, 111], or express correlations between outputs in multi-task settings [55, 6, 2]. Moreover, the RBF and ARD kernels – arguably the 1In practice, k depends on the conditioning of K, but is independent of n. 81 most popular kernels in use – decompose into product kernels. In this chapter, we will address both of these limitations of iterative meth- ods – improving modularity while simultaneously alleviating the curse of di- mensionality. This is possible by once again exploiting structure inherent in the covariance function: our approach to alleviating the exponential dependence on dimension crucially relies on the fact that many popular kernels decompose as a product of one dimensional covariance functions. We will exploit this inherent product structure for fast MVMs. In particular: 1. We demonstrate that MVMs with product kernels can be approximated effi- ciently by computing the Lanczos decomposition of each component kernel. If MVMs with a kernelK can be performed inO (µ(K)) time, then MVMs with the element-wise product of d kernels can be approximated inO (dkµ(K) + k3n log d) time, where k is typically a very small constant. 2. Our fast product-kernel MVM algorithm, entitled SKIP, enables the use of structured kernel interpolation with product kernels without resorting to the exponential complexity of Kronecker products. SKIP can be applied even when the product kernels use different interpolation grids, and enables GP inference and learning in O (dn+ dm logm) for products of d kernels. 3. We apply SKIP to high-dimensional regression problems by expressing d- dimensional kernels as the product of d one-dimensional kernels. Despite this expression being exact for many popular kernels, this formula(tion affords an) exponential improvement over the standard SKI complexity of O n+ dmd logm , and achieving state of the art performance over popular inducing point methods [36, 97]. 82 ( K(1) (2)XX KXX ) v n⇥ n =n⇥ n n ( K(1) (2)XX v KXX ) n⇥ n n⇥ n n⇥ n ⇡ ( Q⇡(1) T (1) Q(1)> v Q(2) T (2) Q(2)>k ⇥ k k ⇥ n k ⇥ k k ⇥ n ) n⇥ k n⇥ n n⇥ k M (1,2) Figure 6.1: Computing fast matrix-vector multiplies (MVMs) with the product kernel (1)KXX ◦ (2) KXX . 1: Rewrite the element-wise product as the diagonal ∆(·) of a product of matrices. 2: Compute the rank-k Lanczos decomposition of (1)KXX and (2)KXX . 4. We demonstrate that SKIP can reduce the complexity of multi-task GPs (MT- GPs) to O (n+m logm+ s) for a problem with s tasks. We exploit this fast infer- ence, developing a model that discovers cluster of tasks using Gibbs sampling. 83 6.2 Matrix-vector multiplication with product kernels Suppose a kernel separates as a produ∏ct as follows:d k(x,x′) = k(i)(x,x′). (6.1) i=1 Given a training data set X = [x1, . . .xn], the kernel matrix K resulting from the product of kernels in (6.1) can be expressed as (1)K = KXX ◦ · · · ◦ (d) KXX , where ◦ represents element-wise multiplication. In other words: [ ] [ ] [ ] (1) ◦ (2) (1) (2)KXX KXX = KXX KXX . (6.2) ij ij ij The key limitation we must deal with is that, unlike a sum of matrices, vector multiplication does not distribute over the elementwise product: ( ) ( ) ( ) K(1) ◦K(2) v 6= K(1)v ◦ K(2)v . (6.3) We will assume we have access to fast MVMs for each component kernel matrix K(i). Without fast MVMs, there is a trivial solution to computing the elementwise matrix vector product: explicitly compute the kernel matrix K in O (dn2) time and then compute Kv. We further assume that K(i) admits a low rank approximation, following prior work on inducing point methods [89, 97, 113, 36]. A naive algorithm for a two-kernel product. We initially assume for simplicity that there are only d = 2 components kernels in the product. We will then show how to extend the two kernel case to arbitrarily sized product kernels. We seek to perform matrix vector multiplies: (1) (KXX ◦ (2) KXX)v (6.4) 84 Eq. (6.4) may be expressed in terms of matrix-matrix multiplication using the following identity: (1) (2) (1) (2)> Kv = (KXX ◦KXX)v = ∆(KXX Dv KXX ), (6.5) where Dv is a diagonal matrix whose elements are v (6.1), and ∆(M) denotes the diagonal of M . Because Dv is an n × n matrix, computing the entries of Kv naively requires n matrix-vector multip(lies with (1)KXX and (2)KX)X . The time complexity to compute (6.5) is therefore O (1) (2)nµ(KXX) + nµ(KXX) . This refor- mulation does not naively offer any time savings. Exploiting low-rank structure. Suppose however that we have access to rank- k approximations of (1)KXX and (2) KXX : (1) ≈ (1) (1) (1)> (2)K Q T Q , K ≈ Q(2)T (2)XX XX Q(2)>, where Q(1), Q(2) are n× k and T (1), T (2) are k × k (6.1). This rank decomposition makes the MVM significantly cheaper to compute. Plugging these decomposi- tions in to (6.5), we derive: ( Kv = ∆ Q(1)T (1)Q(1)> D Q(2)T (2)Q(2)> ) v . (6.6) We prove the following key lemma in about (6.6) in Section 6.3: Lemma 6.2.1. Suppose that (1) (2)KXX = Q (1)T (1)Q(1)> and K = Q(1)T (1)Q(1)>XX , where Q(1) and Q(2) are n× k matrices and T (1) and T (2) are × (1) (2)k k. Then (KXX ◦KXX)v can be computed with (6.6) in O (k2n) time. Therefore, if we can efficiently compute low-rank decompositions ofK(1) and K(2), then we immediately apply 6.2.1 to perform fast MVMs. 85 Computing low-rank structure. With 6.2.1, we have reduced the problem of computing MVMs with K to that of constructing low-rank decompositions for (1) KXX and (2) KXX . Since we are assuming we can take fast MVMs with these kernel matrices, we now turn to the Lanczos decomposition [53, 68]. The Lanczos decom- position is an iterative algorithm that takes a symmetric matrix A and probe vector b and returns Q and T such that A = QTQ>, with Q orthogonal. This decomposition is exact after n iterations, i.e. A = QTQ>. However, sup- pose we were to only compute k < n columns of Q. Then, QkT >kQk is an effective low-rank approximation of A [66, 86]. Unlike standard low rank approximations (such as the singular value decomposition), the algorithm for computing the Lanczos decomposition (i)K = Q(i)T (i)Q(i)>XX requires only k MVMs, leading to the following lemma: ( ) Lemma 6.2.2. Suppose that MVMs with (i)KXX can be computed inO (i) µ(KXX) time. Th(en the rank)-k Lanczos decomposition (i) ≈ (i) (i) (i)>KXX Qk Tk Qk can be computed in O (i)kµ(KXX) time. The above discussion motivates the following algorithm for computing (1) (2) (KXX ·KXX)v, which is summarized by 6.1: First, compute the rank-k Lanczos decomposition of each m(atrix; then, apply (6.6). Lem)mas 6.2.2 and 6.2.1 together imply that this takes O (1) (2)kµ(K 2XX) + kµ(KXX) + k n time. Extending to product kernels with three components. Now consider a kernel that decomposes as the product of three components, k(x,x′) = k(1)(x,x′)k(2)(x,x′)k(3)(x,x′). An MVM with this kernel is given by Kv = (1) ◦ (2) ◦ (3) Define (1) (1) (2) (2) (3)(KXX KXX KXX)v. K̃XX = KXX ◦KXX and K̃XX = KXX . Then (1) (2) Kv = (K̃XX ◦ K̃XX)v , (6.7) 86 Accuracy of Lanczos Decomposition Scaling with Inducing Points 104 SKIP 10−1 SGPR 1% error 3 KISS-GP10 0.1% error 10−3 10−5 12 terms 10 2 8 terms 4 terms 10−7 101 0 20 40 60 80 100 101 102 103 104 Number of Lanczos iterations (rank) Number of inducing points Figure 6.2: Left: Relative error of MVMs computed using SKIP compared to the exact value Kv. Right: Training time as a function of the number of inducing points per dimension on the Power dataset. KISS-GP scales badly here, because the required total number of inducing points scales exponentially with the number of dimensions. On the power dataset d = 4. reducing the three component problem back to two components. To compute the Lanczos decomposition of (1)K̃XX , we use the method described above for computing MVMs with (1) (2)KXX ◦KXX . Extending to product kernels with many components. The approach for the three component setting leads naturally to a divide and conquer strategy. Given a kernel matrix (1)K = KXX ◦ · · · ◦ (d) KXX we define d (1) (1) K̃XX = KXX ◦ · · · ◦ ( ) K 2XX (6.8) d (2) ( +1) K̃XX = K 2 XX ◦ · · · ◦ (d) KXX , (6.9) which lets us rewrite (1) (2)K = K̃XX ◦ K̃XX . By applying this splitting recursively, we can compute matrix-vector multiplies with K, leading to the following running time complexity, which we prove in Section 6.3: Theorem 6.2.3. Suppose that (1)(◦ · · · ◦ )(d)K = KXX KXX , and that computing a matrix- vector multiply wit(h any (i)KXX requiresO (1µ(KX))X) operations. Computing an MVM with K requiresO dkµ(K(i)) + k3n log d+ k2n time, where k is the rank of the Lanc- zos decomposition used. 87 Average MVM error Training time (s) Sequential MVMs. If we are computing many MVMs with the same matrix, then we can further(reduce this complexit)y by caching the Lanczos decompo- sition. The terms O (i)dkµ(K 3XX) + k n log d represent the time to construct the Lanczos decomposition. However, note that this decomposition is not dependent on the vector that we wish to multiply with. Therefore, if we save the decompo- sition for future computation, we have the following corollary: Corollary 6.2.4. Any subsequent MVMs with K require O (k2n) time. If matrix-vector multiplications with (i)KXX can be performed with signifi- cantly fewer than n2 operations, this results in a significant complexity improve- ment over explicitly computing the full kernel matrix K. 6.2.1 Structured kernel interpolation for products (SKIP) So far we have supposed access to fast MVMs with each constituent kernel matrix of an elementwise (Hadamard) product: (1) (d)K = KXX ◦ · · · ◦KXX . To achieve fast MVMs, we apply the SKI approximation (Section 2.5) to each component as: (i) K = W (i)K W (i)>XX UU . (6.10) When using SKI approximations, the running time of our product kernel infer- ence technique with p iterations of CG becomesO (dk(n+m logm) + k3n log d+ pk2n). The running time of SKIP is compared to that of other inference techniques in 6.1. 88 6.3 Proofs for key results In this section, we briefly provide proofs for the two key results above, Lemma 6.2.1 and Theorem 6.2.3. 6.3.1 Proof of Lemma 6.2.1 Letting (1) denote the th row of (1) and (2)qi i Q qi denote the i th row of Q(2), we can express the ith entry Kv, [Kv]i as: (1) [Kv] = q T (1)Q(1)> (2)> i i D (2) v Q T (2)qi To evaluate this for all i, we first once compute the k × k matrix: M (1,2) = T (1)Q(1)> D Q(2)T (2)v . This can be done in O(nk2) time. T (1)Q(1)> and Q(2)T (2) can each be computed in O(nk2) time, as theQmatrices are n×k and the T matrices are n×k. Multiplying one of the results by Dv takes O (nk) time as it is diagonal. Finally, multiplying the two resulting n× k matrices together takes O (nk2) time. After computing M (1,2), we can compute each element of the matrix-vector multiply as: (1) [Kv] = q M (1,2) (2)> i i qi . Because M (1,2) is k × k, each of these takes O (k) time to compute. Since there are n entries to evaluate in the MVM Kv in total, the total time requirement after computing M (1,2) is O (kn) time. Thus, given low rank structure, we can compute Kv in O (k2n) time total. 89 6.3.2 Proof of Theorem 6.2.3 Given the Lanczos decompositions of (1) (1) ◦· · ·◦ (a)K̃ = KXX KXX and (2) (a+1) K̃ = KXX ◦ · · · ◦ (d)K , we can compute matrix-vector multiplies with K̃(1) ◦ K̃(2)XX in O (k2n) time each. This lets us compute the Lanczos decomposition of K̃(1) ◦ K̃(2) in O (k3n) time. For clarity, suppose first that d = 3, i.e., (1)K = KXX ◦ (2) ◦ (3)KXX KXX . We first Lanczos decompose (1) , (2) and (3)KXX KXX KXX . Assuming for simp(licty that M)VMs with each matrix take the same amount of time, This takes O (i)kµ(KXX) time total. We then use these Lanczos decompositions to compute matrix-vector mul- tiples with (1)K̃ in O (k2XX n)time each. This allows us to Lanczos decompose it in O (k3n) time total. We can then compute matrix-vector multiplications Kv in O (k2n) time. In the most general setting where (1) (d)K =(KXX ◦ · · ·)◦KXX , we first Lanczos de- compose the d component matrices in O dkµ(K(i)) and then perform O (log d) merges as described above, each of which takes O (k3n) time. After computing all necessary Lanczos decompositions, matrix-vector multiplications with K can be performed in O (k2n) time. ( ) As a result, a single matrix-vector multiply withK takesO dkµ(K(i)) + k3n log d+ k2n time. With the Lanczos decompositions precomputed, multiple MVMs in a row can be performed significantly (faster. For example, running )p iterations of conju- gate gradients with K takes O dkµ(K(i)) + k3n log d+ pk2n time. 90 Table 6.1: Asymptotic complexities of a single inference step with n training examples and m inducing points, using k Lanczos iterations for SKIP and p CG iterations for SKIP and KISS-GP. Method Complexity of 1 Inference Step GP (Chol) O (n3) GP (MVM) O (pn2) SVGP O ((nm2 +m3 + dnm)) KISS-GP O pn+ pdmd logm SKIP O (dkn+ dkm logm+ k3n log d+ pk2n) 6.4 Evaluating MVM accuracy We first evaluate the accuracy of our proposed approach with product kernels in a controlled synthetic setting. We draw 2500 data points in d dimensions from N (0, I) and compute an RBF kernel matrix with lengthscale 1 over these data points. We evaluate the relative error of SKIP compared to exact MVMs as a function of k – the number of Lanczos iterators. We perform this test for 4, 8, and 12 dimensional data, resulting in a product kernel with 4, 8, and 12 components respectively. The results, averaged over 100 trials, are shown in 6.2 (left). Even in the 12 dimensional setting, an extremely small value of k is sufficient to get very accurate MVMs: less than 1% error is achieved when k = 30. For a discussion of increasing error with dimensionality, see 6.6. In future experiments, we set the maximum number of Lanczos iterations to 100, but note that the convergence criteria is typically met far sooner. 91 6.5 An exponential improvement to SKI [113] use a Kronecker decomposition of KUU to apply SKI for d > 1 dimensions, which requires a fully connected multi-dimensional grid of inducing points U . Thus if we wish to have m distinct inducing point values for each dimension, the grid require(s md inducing)points – i.e. MVMs with the SKI approximate KXX require O n+ dmd logm time. It is therefore computationally infeasible to apply SKI in more than about five dimensions. However, using the proposed SKIP method of Section 6.2, we can re- du(ce the running) time complexity of SKI in d dimensions from exponential O n+ dmd logm to linearO (dn+ dm logm)! If we express a d-dimensional ker- nel as the product of d one-dimensional kernels, then each component kernel requires only m grid points, rather than md. For the RBF and ARD kernels, de- composing the kernel in this way yields the same kernel function. Datasets. We evaluate SKIP on six benchmark datasets. The precipitation dataset contains hourly rainfall measurements from hundreds of stations around the country. The remaining datasets are taken from the UCI machine learning dataset repository. KISS-GP is not applicable when d > 5, and the full GP is not applicable on the four largest datasets. Methods. We compare against the popular sparse variational Gaussian pro- cesses (SGPR) [97, 36] implemented in GPflow [60]. We also compare to our GPU implementation of KISS-GP where possible, as well as our GPU implemen- tation of the full GP on the two smallest datasets. All experiments were run on an NVIDIA Titan Xp. We evaluate SGPR using 200, 400 and 800 inducing points. 92 Table 6.2: Comparison of SKIP and other methods on higher dimensional datasets. In this table, m is the total number of inducing points, rather than num- ber of inducing points per dimension. (*We use m = 100 for SKIP on all datasets except precipitation, where we use m = 120K.) Dataset Metric Full GP SGPR SGPR SGPR KISS-GP SKIP(m = 200) (m = 400) (m = 800) (m = 120K) (m = 100)* Pumadyn Test MAE 0.721 0.766 0.766 0.766 – 0.766 (n = 8192, d = 32) Train Time (s) 4 28 67 235 – 65 Elevators Test MAE 0.072 0.157 0.157 0.157 – 0.072 (n = 16599, d = 18) Train Time (s) 12 46 122 425 – 23 Precipitation Test MAE – 14.79 – – 9.81 14.08 (n = 628474, d = 3) Train Time (s) – 1432 – – 615 34.16 KEGG Test MAE – 0.101 0.093 0.087 – 0.065 (n = 48827, d = 22) Train Time (s) – 116 299 9926 – 66 Protein Test MAE – 7.219 4.97 4.72 – 1.97 (n = 45730, d = 9) Train Time (s) – 139 397 1296 – 35 Video Test MAE – 6.836 6.463 6.270 – 5.621 (n = 68784, d = 16) Train Time (s) – 113 334 1125 – 57 All models use the RBF kernel and a constant prior mean function. We optimize hyperparameters with ADAM using default optimization parameters. Discussion. The results of our experiments are shown in Table 6.2. On the two smallest datasets, the Full GP model outperforms all other methods in terms of speed. This is due to the overhead added by inducing point methods signifi- cantly outweighing simple solves with conjugate gradients with such little data. SKIP is able to match the error of the full GP model on elevators, and all methods have comparable error on the Pumadyn dataset. On the precipitation dataset, inference with standard KISS-GP is still tractable due to the low dimensionality, and KISS-GP is both fast and accurate. Using SKIP results in higher error than KISS-GP, because we were able to use significantly fewer Lanczos iterations for our approximate MVMs than on other datasets due to the space complexity. We discuss the space complexity limitation further in the discussion section. Nevertheless, SKIP still performs better than SGPR. SGPR results with 400 and 800 inducing points are unavailable due to GPU memory 93 constraints. On the remaining datasets, SKIP is able to achieve comparable or better over- all error than SGPR, but with a significantly lower runtime. There are two im- portant things to note from these results. First, because SKIP interpolates each dimension independently, far fewer inducing points are needed overall: SKIP does not need to cover the full d dimensional space with inducing points. Sec- ond, while SKIP significantly improves on the curse of dimensionality suffered by SKI, the running time of SKIP still depends linearly on the dimensionality, a fact we discuss further later in the chapter. We notice that, while SKIP still performs better on the highest dimensional dataset (KEGG), its running time is significantly larger than for the other datasets. 6.6 Discussion Beyond the exponential improvement to the running time complexity of SKI and the extension of MVM-based inference methods to product kernels, this chapter also raises the following foundational question about scalable GP inference: given the ability to compute Av and Bv quickly for matrices A and B, how do we compute (A ◦ B)v quickly? We have shown an answer to this question can dramatically improve the scalability and general applicability of MVM-based methods for fast Gaussian processes. In particular, we have introduced a general technique for fast MVMs exploiting product kernel structure, enabling SKI inference that scales linearly instead of exponentially with dimension. We conclude the chapter with a discussion of some practical limitations and related work. 94 Other inducing point methods. A variety of inducing point methods other than SKI exist [89, 97]. Like SKI, many of these methods involve approximating the kernel using a set of inducing points. For example, the popular subset of regressors (SoR) method (used, for example, in the SGPR method we compare to above) approximates KSoR −1XX′ = KXUKUUKUX′ . These methods can be efficiently implemented using iterative MVM strategies: calculating [K −1XUKUUKUX′ ]v takes O (nm+m2) time after initial work to decomposeKUU . However, applying SKIP as an improvement over SKI results in both the most dramatic improvement in running time (from exponential in d to linear), and also results in the fastest overall running time complexity we are aware of for d dimensional Gaussian process regression. Recent work extends inducing point methods to be compatible with stochas- tic gradient methods [36, 37]. As demonstrated in [110], these stochastic varia- tional methods are fully compatible with the iterative MVM approach and SKI, and we anticipate these approaches can be used to further accelerate SKIP as well. Stochastic diagonal estimation and small lengthscales. Our method relies pri- marily on quickly computing the diagonal in Equation (6.5). Techniques exist for stochastic diagonal estimation [23, 42, 82, 5]. We found that these methods con- verge slower than the Lanczos approach described here. However, the Lanczos decomposition may provide a poor approximation in cases where the kernel ma- trix is high rank, for example with extremely small lengthscales or nonsmooth kernels. Thus, stochastic diagonal estimation may be more appropriate for eval- uating (6.5) in some cases. 95 Higher-order product kernels. A fundamental property of the Hadamard product is that rank(A ◦ B) ≤ rank(A)rank(B) suggesting that if we apply the Lanczos decomposition to the product of a large number of matrices, we may need more iterations to derive accurate approximations. In the limit, the SKI approximation WK >UUW can be used in place of the Lanczos decomposition in equation (6.5), resulting in an exact algorithm withO (dnm+ dm2 logm) runtime. In practice this adaptation is rarely necessary, as the increasing lengthscale with dimensionality counteracts the rank increase of K. Space complexity. To perform the matrix-vector multiplication algorithm de- scribed above, we must store the Lanczos decomposition of each component kernel matrix and intermediate matrices in the merge step for O (dkn) storage. This is better storage than the O (n2) storage required in full GP regression, or O (nm) storage for standard inducing point methods, but worse than the linear storage requirements of SKI. 96 CHAPTER 7 DISCUSSION AND FUTURE DIRECTIONS The goal of this thesis has been to explore the application of covariance structure to modeling functions, especially in the limited data regime. In particular, select- ing the right covariance function for the task (whether manually or actively) can dramatically reduce sample complexity. Exploiting covariance structure allows for audiometric testing in a fraction of the time required by the standard test, and can lead to exponentially lower sample complexity for Bayesian optimization. By exploiting the product structure inherent in the RBF kernel, we can achieve exponential speed-ups for training Gaussian processes. This last result is a partic- ularly compelling use of covariance function structure, as it leads to the fastest asymptotic complexity for Gaussian process training we are aware of. Large-scale Learning. This thesis primarily deals with restrictive covariance functions: kernels that impose significant structure on the model and signifi- cantly reducing the size of the hypothesis class modeled by the Gaussian process. This is useful or even mandatory when dealing with very small amounts of data. The audiometric testing setting and the Bayesian optimization setting necessar- ily fit this setting, as the cost of acquiring each training example is assumed to be very high. Alongside this thesis, an alternative approach to covariance function selec- tion is kernel learning, in which a highly flexible (i.e., highly parameterized) co- variance function is used. Recent examples of this include the spectral mixture kernel [108] and deep kernel learning [112, 110]. Unlike the types of structure largely dealt with in this thesis, these techniques greatly expand the flexiblity of 97 Gaussian processes, and can be used to absorb large amounts of data. Because most of the applications discussed in this thesis fall in the small data regime, scalability has not been an issue, and we have used standard Gaussian process inference throughout. However, to take full advantage of kernel learning, inference and predictions with massive datasets must be made tractable. SKIP takes a significant step in this direction for product kernels, reducing the run- ning time to linear in both the number of training examples n (as was the case with standard SKI) and the number of features d (instead of exponential). The predictive mean for a single test point can be computed in constant time with SKI, but computing predictive variances efficiently remains an open problem. As present and future work address these scalability concerns and bring the time costs for training and predictions using Gaussian processes in line with other modern machine learning techniques, they become an attractive choice for probabilistic modeling. In low data settings such as active learning and Bayesian optimization, GPs have found significant applicability in large part due to the availabilty of model uncertainty estimates that allow for careful decision-making. If the running time cost of using GPs becomes negligible, this aspect of GP re- gression could prove to be an immensely useful augmentation of existing deep learning architectures by allowing for calibrated model uncertainty estimates without sacrificing accuracy or speed. Small-scale learning. Despite the promise of large scale learning and the grow- ing availability of large datasets, the ability to learn in the presence of very small datasets will quite likely to continue to be important. Bayesian optimization and the active learning setting discussed in our audiometric test are examples of set- 98 tings where learning must start by definition with no data at all, as each instance deals with an entirely new objective function or patient. While work exists that attempts to use multi-task Gaussian processes to transfer knowledge from one optimization task to another (e.g, [94]), this approach is largely applicable only to optimization problems within the same domain: in other words, it would be difficult to transfer knowledge from hyperparameter tuning to experimental design. In these settings, highly parameterized models may not be as useful, as it can be difficult to generalize a very flexible model with only a handful of datasets. As a result, careful thought about more restrictive and informative priors can be greatly advantageous. It has largely been common practice with Bayesian optimization to assume that, because the function is unknown, a general pur- pose covariance function must be used. A central theme of this thesis is that covariance functions can be selected actively. While the function is unknown at first, after 20, 40, or 100 iterations of Bayesian optimization or active learning we may know a great deal about the function: enough to choose a better model. By intelligently discovering and exploiting this structure–even in the limited data setting–we can construct models that are exponentially better than otherwise possible. 99 BIBLIOGRAPHY [1] A. Ali, R. Caruana, and A. Kapoor. Active Learning with Model Selection. In AAAI, 2014. [2] Mauricio A Álvarez and Neil D Lawrence. Computationally efficient con- volved multiple output Gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500, 2011. [3] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950. [4] Thomas C Bailey, Yixin Chen, Yi Mao, Chenyang Lu, Gregory Hackmann, Scott T Micek, Kevin M Heard, Kelly M Faulkner, and Marin H Kollef. A Trial of a Real-Time Alert for Clinical Deterioration in Patients Hospital- ized on General Medical Wards. Journal of Hospital Medicine, 8(5):236–242, 2013. [5] Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11):1214–1229, 2007. [6] Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task Gaussian process prediction. In NIPS, pages 153–160, 2008. [7] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimiza- tion of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010. [8] Adam D Bull. Convergence rates of efficient global optimization algo- rithms. The Journal of Machine Learning Research, 12:2879–2904, 2011. [9] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimiza- tion, 20(4):1956–1982, 2010. [10] Roberto Calandra, Jan Peters, Andre Seyfarth, and Marc P Deisenroth. An experimental evaluation of bayesian optimization on bipedal locomotion. In International Conference on Robotics and Automation, 2014. 100 [11] Raymond Carhart and James Jerger. Preferred Method for Clinical Deter- mination of Pure-Tone Thresholds. Journal of Speech and Hearing Disorders, 24(4):330–345, 1959. [12] Shane Carr, Roman Garnett, and Cynthia Lo. BASC: Applying Bayesian Optimization to the Search for Global Minima on Potential Energy Sur- faces. In Proceedings of the 33rd International Conference on Machine Learning, pages 898–907, 2016. [13] Matt Carter and Jennifer C Shieh. Guide to research techniques in neuroscience. Academic Press, 2009. [14] John P Cunningham, Krishna V Shenoy, and Maneesh Sahani. Fast gaus- sian process methods for point process intensity estimation. In Proceedings of the 25th international conference on Machine learning, pages 192–199. ACM, 2008. [15] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learn- ing. In Proceedings of the 25th international conference on Machine learning, pages 208–215. ACM, 2008. [16] Manuel Don, Jos J Eggermont, and Derald E Brackmann. Reconstruction of the audiogram using brain stem responses and high-pass noise masking. Annals of Otology, Rhinology and Laryngology., 88(3 Part 2, Supplement 57):1– 20, 1979. [17] Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and An- drew Gordon Wilson. Scalable log determinants for gaussian process kernel learning. In NIPS, 2017. [18] David Duvenaud. Automatic Model Construction with Gaussian Processes. PhD thesis, Computational and Biological Learning Laboratory, University of Cambridge, 2014. [19] David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenen- baum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. In Proceedings of the 30th International Conference on Machine Learning, pages 1166–1174, 2013. [20] David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenen- baum, and Zoubin Ghahramani. Structure Discovery in Nonparametric Regression through Compositional Kernel Search. In ICML, pages 1166– 1174, 2013. 101 [21] David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenen- baum, and Zoubin Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. In Proceedings of the 30th International Conference on Machine Learning, pages 1166–1174, 2013. [22] David K Duvenaud, Hannes Nickisch, and Carl E Rasmussen. Additive gaussian processes. In NIPS, 2011. [23] Jack K Fitzsimons, Michael A Osborne, Stephen J Roberts, and Joseph F Fitzsimons. Improved stochastic trace estimation using mutually unbiased bases. arXiv preprint arXiv:1608.00117, 2016. [24] Jacob Gardner, Gustavo Malkomes, Roman Garnett, Kilian Q Weinberger, Dennis Barbour, and John P Cunningham. Bayesian active model selec- tion with an application to automated audiometry. In Advances in Neural Information Processing Systems, pages 2377–2385, 2015. [25] Jacob R Gardner, Xinyu Song, Kilian Q Weinberger, Dennis L Barbour, and John P Cunningham. Psychophysical detection testing with bayesian active learning. In UAI, pages 286–295, 2015. [26] Roman Garnett, Michael A Osborne, and Philipp Hennig. Active Learning of Linear Embeddings for Gaussian Processes. In UAI, pages 230–239, 2014. [27] Roman Garnett, Michael A Osborne, and Stephen J Roberts. Bayesian optimization for sensor set selection. In Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, pages 209–219. ACM, 2010. [28] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012. [29] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of machine learning research, 12(Jul):2211–2268, 2011. [30] Rudolph E Gosztonyi Jr, Lawrence A Vassallo, and Joseph Sataloff. Audio- metric reliability in industry. Archives of Environmental Health: An Interna- tional Journal, 22(1):113–118, 1971. [31] GPy. GPy: A gaussian process framework in python, since 2012. 102 [32] David M Green. A maximum-likelihood method for estimating thresholds in a yes–no task. The Journal of the Acoustical Society of America, 93(4):2096– 2105, 1993. [33] Carlos Guestrin, Andreas Krause, and Ajit Paul Singh. Near-optimal sen- sor placements in gaussian processes. In ICML, 2005. [34] Yuhong Guo and Russell Greiner. Optimistic active-learning using mutual information. In IJCAI, volume 7, pages 823–829, 2007. [35] Trevor J Hastie and Robert J Tibshirani. Generalized additive models, vol- ume 43. CRC press, 1990. [36] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013. [37] James Hensman, Alexander G de G Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. In AISTATS. JMLR, 2015. [38] José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions. In NIPS, pages 918–926, 2014. [39] Neil Houlsby, José M Hernández-Lobato, and Zoubin Ghahramani. Cold- start Active Learning with Robust Ordinal Matrix Factorization. In ICML, pages 766–774, 2014. [40] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011. [41] Waiter Hughson and Harold Westlake. Manual for program outline for re- habilitation of aural casualties both military and civilian. Transactions of the American Academy of Ophthalmology and Otolaryngology, 48(Supplement):1– 15, 1944. [42] Michael F Hutchinson. A stochastic estimator of the trace of the influ- ence matrix for laplacian smoothing splines. Communications in Statistics- Simulation and Computation, 19(2):433–450, 1990. [43] Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani. Active learning 103 for interactive visualization. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, pages 342–350, 2013. [44] James Jerger. Bekesy audiometry in analysis of auditory disorders. Journal of Speech, Language, and Hearing Research, 3(3):275–287, 1960. [45] Kirthevasan Kandasamy, Jeff Schneider, and Barnabas Poczos. High di- mensional bayesian optimisation and bandits via additive models. In David Blei and Francis Bach, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 295–304. JMLR Workshop and Conference Proceedings, 2015. [46] Robert Keys. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981. [47] Igor Kononenko. Machine Learning for Medical Diagnosis: History, State of the Art and Perspective. Artificial Intelligence in Medicine, 23(1):89–109, 2001. [48] Andreas Krause and Carlos Guestrin. Nonmyopic active learning of gaus- sian processes: an exploration-exploitation approach. In ICML 24, 2007. [49] Jouni Kuha. AIC and BIC: Comparisons of Assumptions and Performance. Sociological Methods and Research, 33(2):188–229, 2004. [50] Johannes Kulick, Robert Lieck, and Marc Toussaint. Active Learning of Hyperparameters: An Expected Cross Entropy Criterion for Active Model Selection. CoRR, abs/1409.7552, 2014. [51] S. R. Kulkarni, S. K. Mitter, and J. N. Tsitsiklis. Active learning using arbitrary binary valued queries. Machine Learning, 11:23–35, 1993. [52] Malte Kuss and Carl Edward Rasmussen. Assessing Approximate Infer- ence for Binary Gaussian Process Classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [53] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950. 104 [54] Marjorie R Leek, Judy R Dubno, Ning-ji He, and Jayne B Ahlstrom. Expe- rience with a yes–no single-interval maximum-likelihood procedure. The Journal of the Acoustical Society of America, 107(5):2674–2684, 2000. [55] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, edi- tor, Neural Networks and Machine Learning, pages 133–165. Springer, Berlin, 1998. [56] David JC MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. [57] Ted Madison et al. Guidelines for manual pure-tone threshold audiometry. 2005. [58] Nimalan Mahendran, Ziyu Wang, Firas Hamze, and Nando D Freitas. Adaptive mcmc with bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pages 751–760, 2012. [59] Gustavo Malkomes, Chip Schaff, and Roman Garnett. Bayesian optimiza- tion for automated model selection. In Advances in Neural Information Processing Systems 29, 2016. [60] Alexander G de G Matthews, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. Gpflow: A gaussian process library using tensorflow. Journal of Machine Learning Research, 18(40):1–6, 2017. [61] D McBride and S Williams. Audiometric notch as a sign of noise induced hearing loss. Occupational Environmental Medicine, 58(1):46–51, 2001. [62] Christian Meyer-Bisch. Audioscan: a high-definition audiometry tech- nique based on constant-level frequency sweeps-a new method with new hearing indicators. International Journal of Audiology, 35(2):63–72, 1996. [63] Thomas P Minka. Expectation Propagation for Approximate Bayesian Inference. In UAI, pages 362–369, 2001. [64] J. Mockus, V. Tiesis, and A. Zilinskas. The application of bayesian methods for seeking the extremum. Towards Global Optimization, 2:117–129, 1978. [65] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. 105 [66] Hannes Nickisch, Rolf Pohmann, Bernhard Schölkopf, and Matthias Seeger. Bayesian experimental design of magnetic resonance imaging sequences. In NIPS, pages 1441–1448, 2009. [67] Özcan Özdamar, Rebecca E Eilers, Edward Miskiel, and Judith Widen. Classification of audiograms by sequential testing using a dynamic bayesian procedure. The Journal of the Acoustical Society of America, 88(5):2171–2179, 1990. [68] Christopher C Paige. Computational variants of the lanczos method for the eigenproblem. IMA Journal of Applied Mathematics, 10(3):373–381, 1972. [69] Alex Pentland. Maximum likelihood estimation: The best pest. Attention, Perception, & Psychophysics, 28(4):377–379, 1980. [70] Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005. [71] Adrian E Raftery. Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear Models. Biometrika, 83(2):251–266, 1996. [72] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [73] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006. [74] C.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning. MIT Press, 2006. [75] DW Robinson. Long-term repeatability of the pure-tone hearing threshold and its relation to noise exposure. British journal of audiology, 25(4):219–235, 1991. [76] Nicholas Roy and Andrew McCallum. Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, pages 441–448, 2001. [77] Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012. 106 [78] Suchi Saria, Anand K. Rajani, Jeffrey Gould, Daphne L. Koller, and Anna A Penn. Integration of Early Physiological Responses Predicts Later Illness Severity in Preterm Infants. Science Translational Medicine, 2(48):48ra65, 2010. [79] U Schiefer, J Pätzold, and F Dannheim. Konventionelle perimetrie. Der Ophthalmologe, 102(6):627–646, 2005. [80] Nicolas Schmuziger, Rudolf Probst, and Jacek Smurzynski. Test-retest reliability of pure-tone thresholds from 0.5 to 16 khz using sennheiser hda 200 and etymotic research er-2 earphones. Ear and hearing, 25(2):127–132, 2004. [81] Gideon Schwarz. Estimating the Dimension of a Model. Annals of Statistics, 6(2):461–464, 1978. [82] Marco Selig, Niels Oppermann, and Torsten A Enßlin. Improving stochas- tic estimates with inference methods: Calculating matrix diagonals. Physi- cal Review E, 85(2):021134, 2012. [83] Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012. [84] Josef Shargorodsky, Sharon G Curhan, Gary C Curhan, and Roland Eavey. Change in Prevalence of Hearing Loss in US Adolescents. Journal of the American Medical Association, 304(7):772–778, 2010. [85] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradi- ent method without the agonizing pain, 1994. [86] Horst D Simon and Hongyuan Zha. Low-rank matrix approximation us- ing the lanczos bidiagonalization process with applications. SIAM Journal on Scientific Computing, 21(6):2257–2274, 2000. [87] Cas Smits, Theo S Kapteyn, and Tammo Houtgast. Development and validation of an automatic speech-in-noise screening test by telephone. International journal of audiology, 43(1):15–28, 2004. [88] Alex J Smola and Bernhard Schölkopf. Learning with kernels. GMD- Forschungszentrum Informationstechnik, 1998. 107 [89] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In NIPS, pages 1257–1264, 2006. [90] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In NIPS 2012, pages 2960–2968. 2012. [91] X. D. Song, B. M. Wallace, J. R. Gardner, N. M. Ledbetter, K. Q. Weinberger, and D. L Barbour. Fast, continuous audiogram estimation using machine learning. Ear and Hearing, 2015. [92] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experi- mental design. arXiv preprint arXiv:0912.3995, 2009. [93] De Wet Swanepoel, Hermanus C Myburgh, David M Howe, Faheema Ma- homed, and Robert H Eikelboom. Smartphone hearing screening with integrated quality control and data management. International journal of audiology, 53(12):841–849, 2014. [94] K. Swersky, J. Snoek, and R. P. Adams. Multi-task bayesian optimization. In NIPS 2013, pages 2004–2012. 2013. [95] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In NIPS, pages 2004–2012, 2013. [96] MiM Taylor and C Douglas Creelman. Pest: Efficient estimates on probabil- ity functions. The Journal of the Acoustical Society of America, 41(4A):782–787, 1967. [97] Michalis K Titsias. Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009. [98] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001. [99] Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr (f (a)) via stochastic lanczos quadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017. [100] Marcel SMG Vlaming, Robert C MacKinnon, Marije Jansen, and David R 108 Moore. Automated screening for high-frequency hearing loss. Ear and hearing, 35(6):667, 2014. [101] Grace Wahba. Spline models for observational data. SIAM, 1990. [102] Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. de Freitas. Bayesian op- timization in a billion dimensions via random embeddings. arXiv preprint arXiv:1301.1942, 2013. [103] Zheng Wang, Ming-Jun Lai, Zhaosong Lu, Wei Fan, Hasan Davulcu, and Jieping Ye. Orthogonal rank-one matrix pursuit for low rank matrix com- pletion. SIAM Journal on Scientific Computing, 37(1):A488–A514, 2015. [104] Charles S Watson, Gary R Kidd, James D Miller, Cas Smits, and Larry E Humes. Telephone screening tests for functionally impaired hearing: Cur- rent use in seven countries and development of a us version. Journal of the American Academy of Audiology, 23(10):757–767, 2012. [105] Herbert S. Wilf. Generatingfunctionology. A. K. Peters, Ltd., Natick, MA, USA, 2006. [106] C. Williams and C. Rasmussen. Gaussian processes for regression. In NIPS, 1996. [107] Victoria Williams-Sanchez, Rachel A McArdle, Richard H Wilson, Gary R Kidd, Charles S Watson, and Andrea L Bourne. Validation of a screen- ing test of auditory function using the telephone. Journal of the American Academy of Audiology, 25(10):937–951, 2014. [108] Andrew Wilson and Ryan Adams. Gaussian process kernels for pattern discovery and extrapolation. In ICML, pages 1067–1075, 2013. [109] Andrew G Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham. Fast Kernel Learning for Multidimensional Pattern Extrapolation. In NIPS, pages 3626–3634, 2014. [110] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Informa- tion Processing Systems, pages 2586–2594, 2016. [111] Andrew Gordon Wilson. Covariance kernels for fast automatic pattern dis- 109 covery and extrapolation with gaussian processes. University of Cambridge, 2014. [112] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016. [113] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In ICML, pages 1775– 1784, 2015. [114] F Zhao, D Stephens, and C Meyer-Bisch. The audioscan: a high frequency resolution audiometric technique and its clinical applications. Clinical Otolaryngology & Allied Sciences, 27(1):4–10, 2002. [115] Fei Zhao and Dafydd Stephens. Analyses of notches in audioscan and dpoaes in subjects with normal hearing. International Journal of Audiology, 37(6):335–343, 1998. 110