Expressive Models In Online Learning
We study the online learning model: a widely applicable model for making repeated choices in an interactive environment. In standard online learning model (or an online learning problem), the decision-maker is provided with a set of alternatives, and selects one alternative in each of the T sequential trials, deriving a reward for each selection. After T trials, the total reward of the decision-maker is compared with the best "single-arm" strategy which has the maximum reward in hindsight. The difference between the reward of the best single-arm strategy and that of the algorithm is called the regret , and one seeks decision-making algorithms whose regret is sublinear in T and running time is polynomial in the problem size. In this thesis, we extend the basic online learning model in two important ways. In the first extension, we model sponsored search auctions as a multi-armed bandit problem (a type of online learning problem), and allow the alternatives (or advertisers in this case) to be strategic which can report possibly wrong rewards (in order to make personal gains). We seek to provide incentives to advertisers so as to get good solutions (socially efficient solutions). We prove that any socially efficient solution that provides right incentives to advertisers (being dominant strategy truthful) must suffer much higher regret than the regret suffered by algorithms for multi-armed problem without incentive issues. In the second extension, also motivated by sponsored search and resource selection in distributed systems, we allow the set of available alternatives to vary over time, provide a natural way to define the regret, and give policies for the decision-maker that suffer low regret. We also prove that the regret suffered by our policies is information-theoretically the lowest possible.
dissertation or thesis