SEQUENTIAL RESOURCE ALLOCATION UNDER UNCERTAINTY: AN INDEX POLICY APPROACH A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Weici Hu August 2017 ©c 2017 Weici Hu ALL RIGHTS RESERVED SEQUENTIAL RESOURCE ALLOCATION UNDER UNCERTAINTY: AN INDEX POLICY APPROACH Weici Hu, Ph.D. Cornell University 2017 We consider a class of stochastic sequential allocation problems - restless multi-armed bandits (RMAB) with a finite horizon and multiple pulls per period. Leveraging the La- grangian relaxation of the problem, we propose an index-based policy that uses the opti- mal Lagrange multipliers to index individual arms, and prove that the policy is asymptot- ically optimal as the number of arms tends to infinity. We also demonstrate numerically that this index-based policy outperforms state-of-the-art heuristics in several instances of RMAB. In addition, we study two other applications of sequential resource allocation problems which are extensions of the RMAB problem, and demonstrate how our index policy can be adapted to these settings. BIOGRAPHICAL SKETCH Born as a mainlander, Weici Hu spent most of her childhood in a southern city of China. She then went further south to Singapore to complete her secondary school and high school education. Tired of the tropical climate, she spent the past 8 years acquiring a college degree and trying to acquire a PhD degree in the Northeast of the United States. iii This document is dedicated to my mother. iv ACKNOWLEDGEMENTS I would like to first thank my advisor for his guidance and extreme patience and for not kicking me out of the program. I would also like to thank my parents for their non- interference policy which has allowed me to experience life in its truest states. Lastly I would like to thank two of my best pals in graduate school: Wei Qian and Kenneth Chong, for handing me water and energy bars during my Marathon race. v TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction 1 2 An Asymptotically optimal Index policy for RMAB 7 2.1 Problem Description and Notation . . . . . . . . . . . . . . . . . . . . 7 2.2 Lagrangian Relaxation and Upper Bounds . . . . . . . . . . . . . . . . 9 2.3 An Index Based Heuristic Policy . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Pre-computations . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Index policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Proof of Asymptotic Optimality . . . . . . . . . . . . . . . . . . . . . 15 2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 Multi-armed bandit . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 Project assignment problem . . . . . . . . . . . . . . . . . . . 26 2.5.3 Subset selection problem . . . . . . . . . . . . . . . . . . . . . 27 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Parallel Bayesian Policies For Finite Multiple Comparisons With a Known Standard 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Index Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4 Bayes-Optimal Effort Allocation in Crowdsourcing: Bounds and Index Policies 50 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Dynamic Programming Formulation . . . . . . . . . . . . . . . . . . . 56 4.5 Upper Bound on the Bayes-Optimal Policy . . . . . . . . . . . . . . . 61 4.6 Index Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.7 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.7.1 Simulation using simulated data . . . . . . . . . . . . . . . . . 68 4.7.2 Simulation using real data . . . . . . . . . . . . . . . . . . . . 71 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 vi A 73 A.1 Notation of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.2 Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.3 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A.4 Show arg infλ∈RT P(λ) is non-empty . . . . . . . . . . . . . . . . . . . . 76 A.5 Proof of the existence of π∗∗ . . . . . . . . . . . . . . . . . . . . . . . 77 A.6 Proof of T∗maxs,a,t rt(s, a) upper bounds βt(s) . . . . . . . . . . . . . . 78 A.7 A result that justifies using bisection . . . . . . . . . . . . . . . . . . . 79 A.8 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.9 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.10 Lemma 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 A.11 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.12 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 A.13 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.14 Bellman’s recursion for T = ∞ . . . . . . . . . . . . . . . . . . . . . . 87 Bibliography 88 vii LIST OF TABLES A.1 List of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 viii LIST OF FIGURES 2.1 Upper bound and simulation results of MAB . . . . . . . . . . . . . . 25 2.2 Upper bound and simulation result of project assignment problem . . . 27 2.3 Upper bound and simulation result of subset selection . . . . . . . . . 29 3.1 This figure illustrates how zn is chosen by the index-based policy with k = 2 systems, m = 2 parallel computing resources, and N = 20 sim- ulation batches. Figure (a) plots zλ2,2, the optimal number of samples to take in batch 3 from system 1 when it is in state (2,2), against the value of λ; Figure (b) plots zλ3,3, the optimal number of samples to take in batch 3 from system 2 when it is in state (3,3). Figure (c) plots zλ λ2,2 + z3,3, the optimal total number of samples to take across both sys- tems. The dashed line in (c) shows the constraint m = 2, and λ∗ will be the left endpoint of the solid line overlapping this dashed line. The ∗ ∗ number of samples taken from each system will be zλ λ2,2 = 1 and z3,3 = 1 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 This figure shows the upper bound on the performance of the optimal policy for the MCS problem (dashed line with squares) normalized by dividing by k, as well as the estimated performance of two sub-optimal policies: the index policy from Section 4.6 (thinner lines and dots); and the equal allocation policy (thicker lines and dots). The setting pictured uses m = k, dx = 0.2, α0x = β0x = 1, cx = 0. We use 10,000 independent replications to estimate the value of the index policy, and 50,000 for the equal allocation policy. The plot shows that the index policy is substan- tially better than equal allocation, and is statistically indistinguishable from optimal given the number of replications performed. . . . . . . . 48 4.1 Semi-log plot of K against average per task reward (R/K) for K = 10, 102, 103. 69 4.2 Histogram of number of workers assigned to a task . . . . . . . . . . . 70 4.3 Semi-log plot of K against accuracy score for K = 10, 100, 750. . . . . . . . 72 A.1 Plot of Rounding-c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 ix CHAPTER 1 INTRODUCTION We consider a general class of dynamic resource allocation problems with average-case criteria. Such problems enjoy a variety of applications in a wide range of industries. Examples include: • Facebook displays ads in the suggested posts section every time its users browse their personal pages. Among the ads that have been shown, some are known to attract more clicks than others. But there are also many ads which have yet to be shown and they may attract even more clicks. Given that the slots for display are limited, a policy is required to select ads to maximize total clicks. • In a multi-stage clinical trial, a medical group starts with a number of new treat- ments and an existing treatment with reliable performance. In each stage, a few treatments are selected from the pool to test, with the goal to identifying the new treatments that perform better than the existing one with high confidence. A strat- egy is required to select which treatments to test at every stage to most effectively support their judgment at the end of the trial. • A data analyst wishes to label a large number of images using crowdsourced effort from low-cost but potentially inaccurate workers. Each label given by the crowd- workers comes with a cost and the analyst has limited budget. Hence she needs to carefully assign tasks to workers so as to maximize the likelihood of correct labeling. We formulate such problems as instances of the restless multiarmed bandit (RMAB) problem [48] with a finite horizon and multiple pulls per period, which, in turn, is closely related to a broader class of problem called Weakly Coupled Dynamic Programs 1 (WCDP) [25]. In the RMAB, we have a collection of “arms”, each of which is endowed with a state that evolves independently. If the arm is “pulled” or “engaged’ in a time period then it advances stochastically according to one transition kernel, and if not then it advances according to a different kernel. Rewards are generated with each transition, and our goal is to maximize the expected total reward over a finite horizon, subject to a constraint on the number of arms pulled in each time period. The RMAB forms a generalization of the more famous multi-armed bandit (MAB) [41] by allowing arms that are not engaged to change state and multiple pulls per period. Theoretically an optimal solution of a RMAB can be obtained by leveraging the Bell- man equation and solving it as a dynamic program (DP) [40]. However, this approach becomes computationally infeasible when the state space grows large. In particular, the state space grows exponentially with the number of arms in a RMAB, and the number of arms are often large in practice. This approach therefore suffers the so-called “curse of dimensionality” [39]. Much research has been dedicated to efficiently finding “good” solutions. Gittin in 1979 proposed a tractable-to-compute optimal policy for the infinite horizon MAB with one pull per time period, which is famously known as the Gittins index policy [21]. This policy is appealing because it can be computed by considering the state space for only a single arm, making it computationally tractable for problems with many arms. This policy loses its optimality properties, however, when modifying the problem in any problem dimension: when allowing arms that are not engaged to change state; when moving to a finite horizon [6]; or when allowing multiple pulls per period. Thus, the Gittins index does not apply to our problem setting. While the RMAB is not known to have a computable optimal policy, [48] proposed a heuristic called the Whittle index for the infinite-horizon RMAB with multiple pulls per period, which is well-defined when arms satisfy an indexability condition. This policy 2 is derived by considering a Lagrangian relaxtion of the RMAB in which the constraint on the number of arms pulled is replaced by a penalty paid for pulling an arm. An arm’s Whittle index is then the penalty that makes a rational player indifferent between pulling and not pulling that arm. The Whittle index policy then pulls those arms with the highest Whittle indices. Appealingly, the Whittle index and the Gittins index are identical when applied to the MAB problem with a single pull per period. [48] further conjectured that if the number of arms and the number of pulls in each time period go to infinity at the same rate in an infinite-horizon RMAB, then the Whittle index policy is asymptotically optimal when arms are indexable. [46, 47] gave a proof to Whittle’s conjecture with a difficult-to-verify condition: that the fluid approximation has a globally asymptotically stable equilibrium point. This condition was shown to hold when each arm’s state space has at most 3 states, but this condition does not hold in general and [46] provides a counterexample with 4 states. Our contribution in this dissertation is to (1) create an index policy for finite horizon RMABs with multiple pulls per period, and (2) show that it is asymptotically optimal in the same limit considered by Whittle. Like the Whittle index, our approach is com- putationally appealing because it requires considering the state space for only a single arm, and its computational complexity does not grow with the number of arms. Un- like the Whitle index, our index policy does not require an indexability condition to be well-defined, and in contrast with [46, 47] our proof of asymptotic optimality holds re- gardless of the number of states, and does not depend on hard-to-verify conditions. We further demonstrate our index policy numerically on problems from the literature that can be formulated as finite-horizon RMABs, and show that it provides finite-sample performance that improves over the state-of-the-art. We also use our framework to de- velop policies for two major RMAB-like applied problems: Multiple comparison with a standard (MCS) in the field of simulation and crowdsourcing in the field of artificial 3 intelligence, and demonstrate numerically that our index-based policies out-perform the state-of-art using both real and synthetic data. In addition to building on [48, 46, 47], our work builds on the literature in weakly coupled dynamic programs (WCDP), that itself builds on RMABs. Indeed, at the end of his paper, Whittle pointed out that his relaxation technique can be applied to a more gen- eral class of problems in which sub-problems are linked by constraints on actions, but are otherwise independent. Hawkins in his thesis [25] formally termed these problems (but with a more general type of constraints) as WCDPs and proposed a general decou- pling technique. Moreover, he proposed a minimal-lambda policy for infinite horizon WCDPs which, like the Whittle index policy, is derived by considering a Lagrangian relaxation of the WCDP. The minimal-lambda policy finds the smallest Lagrange mul- tiplier so that the current optimal decision for the Lagrangian relaxation is also feasible for the original WCDP. The minimal-lambda policy then pulls arms according to this optimal decision. In the case of RMAB, the minimal-lambda policy is equivalent to the Whittle index with the smallest of the indices of all pulled arms for the former being the same as the smallest Lagrange multiplier to attain a feasible solution for the latter. Another major work in WCDP is [1] which shows that the ADP relaxation is tighter than the Lagrangian relaxation but is also computationally more expensive. It gives necessary and sufficient conditions for the Lagrangian relaxation to be tight and proves that the optimality gap is bounded by a constant when the Lagrange multipliers are allowed to be state dependent. The last result that the optimality gap is bounded by a constant implies that the per arm gap goes to zero as the number of arms grows. We achieve a similar result in the dissertation by showing the per arm reward of our index- based heuristic policy goes to the per-arm reward of the Lagrangian bound, despite that our Lagrange multipliers not being state-dependent. While there are similarities between 4 the two works, the focus differs: while our work focuses on offering an asymptotically optimal heuristic policy, [1] examines the ordering and tightness of different bounds. The heuristic proposed in [1] is based on an ADP technique, and is different from our index-based policy. Other work on WCDP also includes [52] who proposes an even tighter bound by incorporating an information relaxation on the non-anticipative constraints in addition to the existing relaxation methods. [24] considers two classes of large-scale WCDPs in which the state and action space in each sub-problem also grows exponentially and uses an ADP technique to approximate the value functions of individual sub-MDPs in addition to employing a Lagrangian relaxation for the overall problem. In this thesis, we use Chapter 2 to present the main theoretical work on RMAB and propose an index-based policy. Chapter 3 and 4 present applied problems that are similar to RMAB but with some variations, and index-based policies for these problems. Below is a brief outline of the chapters in the dissertation: • In Chapter 2, we consider the RMAB with a finite horizon and multiple pulls per period. Leveraging a Lagrangian relaxation, we approximate the RMAB with a problem that can be decomposed into a collection of single arm problems. We then propose an index-based policy that uses optimal solutions of the single arm problems to index individual arms, and offer a proof that it is asymptotically op- timal as the number of arms tends to infinity. We also use simulation to show that this index-based policy performs better than the state-of-art heuristics in various problem settings. • In Chapter 3, we consider the problem of multiple comparisons with a known stan- dard, in which we wish to allocate simulation effort efficiently across a finite num- ber of simulated systems, to determine which systems have mean performance 5 exceeding a known threshold. We suppose that parallel computing resources are available, and that we are given a fixed simulation budget. We consider this prob- lem in a Bayesian setting, and formulate it as a stochastic dynamic program. The set-up of the problem is the same as a RMAB except that every simulated sys- tem (the arms) is allowed to have multiple computing resources (the pulls) in a time period. For simplicity, we focus on Bernoulli sampling, with a linear loss function. Using links to restless multi-armed bandits, we provide a computation- ally tractable upper bound on the value of the Bayes-optimal policy, and an index policy motivated by these upper bounds. This chapter has been published in the proceedings of the 2014 Winter Simulation Conference. • In Chapter 4, we consider effort allocation in crowdsourcing, where we wish to as- sign labeling tasks to imperfect homogeneous crowd workers to maximize overall accuracy in a continuous-time Bayesian setting, subject to budget and time con- straints. The Bayes-optimal policy for this problem is the solution to a partially observable Markov decision process, but the curse of dimensionality renders the computation infeasible. Based on the Lagrangian Relaxation technique in [1], we provide a computationally tractable instance-specific upper bound on the value of this Bayes-optimal policy, which can in turn be used to bound the optimality gap of any other sub-optimal policy. In an approach similar in spirit to the Whittle index for restless multi-armed bandits, we provide an index policy for effort allo- cation in crowdsourcing and demonstrate numerically that it outperforms the state of the art and is near-optimal. This chapter has been published in the proceedings of the 2016 Artificial Intelligence and Statistics Conference. 6 CHAPTER 2 AN ASYMPTOTICALLY OPTIMAL INDEX POLICY FOR RMAB Chapter 2 presents the major theoretical work of this dissertation. In this chapter, Section 2.1 formulates the problem, Section 2.2 discusses the Lagrangian relaxation of the problem, Section 2.3 states our index-based policy and provides computational methods, Section 2.4 gives a proof of asymptotic optimality, Section 2.5 numerically evaluates our index policy, and Section 7 concludes. 2.1 Problem Description and Notation We consider an MDP (SK ,AK ,P,R) which is created by a collection of K sub-processes (S,A, P, r). The sub-processes are independent of each other except that the actions taken by each sub-process have to jointly satisfy some constraints at each time step. These sub-processes are also referred to as arms in the bandit literature and shall be indexed by x ∈ {1, ...,K}. Following a standard construction for MDPs, both the larger joint MDP and the sub-processes will be constructed on the same measurable space (Ω,F). Random variables on this measurable space will correspond to states, actions, rewards, and each policy will induce a probability measure over this space. We describe the MDP to consider formally as comprising: • The time horizon T < ∞. • The state space SK is the cross product of K sub-processes’ state space S, which is assumed to be finite. We use s = (s1, ..., sK) to denote an element in SK and S when the state is random. We also use St to emphasize that the state is at time t. Likewise, we use s to denote an element in S, and S or S t when it is random. 7 • The action space AK is the cross product of K sub-processes’ action space A = {0, 1}. We use a to denote a generic element of A, and A when it is random. We use a = (a1, a2, ..., aK) to denote a generic element in AK and A when it is random. In the context of bandit problems, a = 1 is called “pulling” an arm (sub-process). • ∑The reward function Rt : SK × AK →7 R for each 1 ≤ t ≤ T . Rt(s, a) =K x=1 rt(sx, ax), where rt(sx, ax) is the reward obtained by a sub-process when ac- tion ax is taken in state sx at time t. We assume rewards are non-negative and finite. ∏ • The transition kernel Pa(s′, s) = K ax ′x=1 P (sx, sx), where Pa(s′, s) is the probabil- ity of a sub-process transitioning from s′ to s if action a is taken, i.e., P(s|s′, a). The product implies that the K sub-processes evolve independently. RMAB differ from MAB in that MABs require P0(s, s) = 1 while RMABs allows P0(s, s) < 1. Since we are considering both cases, we do not restrict the value of P0(s, s). Next we describe the set of policies for our MDP problem. Since the state and action space defined above are finite, it is sufficient to consider the set of Markov policies Π [40]. Define a policy π ∈ Π as a function SK × AK × {1, ...,T } → [0, 1] that determines th∑e probability of choosing action a in state s at time t. Subse- quently we have Ka∈AK π(s, a, t) = 1, ∀s ∈ S ,∀1 ≤ t ≤ T . A policy π and the transition kernel P·(·, ·) together defines a probability distribution Pπ on all possible paths of the process {s1a1...sT : s K Kt ∈ S , at ∈ A }. Starting at a fixed state s1, i.e., Pπ(S1 = s1) = 1, we have the conditional distributions of St and At defined recursively by Pπ(S = s′t+1 |St = s,At = a) = Pa(s, s′) and Pπ(At = a|St = s) = π(s, a, t). The MDP we are considering allows exactly m sub-processes to be set active at each time step. Hence a feasible policy, π ∈ Π, has to satisfy Pπ(|At| = m) = 1, ∀t ∈ {1, ...,T }. Here we use | · | as an operator that sums all the elements in a vector. 8 The objective of our MDP is: maximize Eπ ∑T ( )Rt St,At ∈   π Π t=1 (2.1) subject to Pπ(|At| = mt) = 1, ∀1 ≤ t ≤ T. Since we will discuss other MDPs in the process of solving this one, (2.1) will be re- ferred to as the original MDP in the rest of the chapter to avoid confusion. For conve- nience, we summarize our notation in Appendix A.1. The original MDP (2.1) suffers from the “curse of dimensionality”, and hence solv- ing it is computationally intractable. In the remainder of the chapter we build a compu- tationally feasible index-based heuristics with a performance guarantee. 2.2 Lagrangian Relaxation and Upper Bounds In this section we discuss the Lagrangian relaxation of the original MDP and the cor- responding single process problem. These single process problems together with the Lagrange multipliers form the building blocks of our index-based policy, which will be formally introduced in Section 2.3. The Lagrangian relaxation considers an uncon- strained problem whose objective is obtained by augmenting the objective of (2.1):∑Tπ ( )P(λ) = maxE R S ,A  t t t − Eπ ∑T λt(|At| − m ∈ t) , (2.2)π Π t=1 t=1 for any λ = {λ1, ..., λT } ∈ RT . This unconstrained problem has the following property: Lemma 1. For any λ ∈ RT , P(λ) is an upper bound to the optimal value of the original MDP. [1] gave a proof to Lemma 1 using the Bellman equation. We provide a more 9 straightforward proof by viewing P(λ) as the Lagrange dual function of a relaxed prob- lem of the original MDP; see Appendix A.2. This Lagrangian relaxation then decomposes into K smaller MDPs, which we can easily solve to optimality. To elaborate on this idea of decomposition, we construct a sub-MDP problem based on tuple (S,A, P·(·, ·), r(·, ·)). Again we consider only the set of Markov policies, Π, for this problem. Similarly a policy π ∈ Π is a function that determines the probability of choosing action a in state s at time t, i.e., π : S × A × {1, ...,T } → [0, 1]. The sub-MDP starts at a fixed state s1. Subsequently we can define distributions of S πt and At under P in a similar manner as we did for St and At in the previous section. The objective of the sub-MDP is: π ∑T − Q(λ) = maxE rt(S t, At) λtAt . (2.3) π∈Π t=1 We are now ready to present the decomposition of the Lagrangian relaxation. Lemma 2. The optimal value of the relaxed problem satisfies ∑ P(λ) = KQ(λ) + mtλt, (2.4) t [1] also gave a proof to Lemma 2, and we again provide a different proof in Appendix A.3. Since the state space of the sub-MDP is much smaller, we can solve it directly by using backward induction and the optimality equation. The existence of such an optimal Markov deterministic policy follows from that the state and action spaces of the sub- MDP being finite [40]. Let Π∗(λ) be the set of optimal Markov deterministic policies of the sub-MDP for a given λ. The relaxed problem can be solved by combining the solutions of individual sub-MDPs, that is,∏we can construct an optimal policy of the relaxed problem πλ by setting πλ(s, a, t) = K λ λx=1 π (sx, ax, t), where π is an element in Π∗(λ). Moreover, P(λ) is convex and piecewise linear in λ [1]. 10 2.3 An Index Based Heuristic Policy Our index based heuristic policy assigns an index to each sub-process, based upon its state and current time. At each time step, we set active the m sub-processes with the highest indices. Before carrying out the process of sequential decision-making, our index policy calls for pre-computation of 1) λ∗ ∈ arg infλ P(λ), as defined in Section 2.2; 2) a set of indices, β, that will later be used for decision-making at every time step; 3) an optimal policy π∗∗ for the sub-MDP problem in (2.3). In the first part of this section we discuss how we carry out such computations. 2.3.1 Pre-computations Dual optimal λ∗ We use subgradient descent to solve infλ P(λ), which converges to its solution λ∗ by convexity of λ 7→ P(λ) (Theorem 7.4 in [42]). By (2.3) and (2.4), a sub-gradient of P(λ) λ with respect to λ is given by (−KEπ [A λt] + m : 1 ≤ t ≤ T ), where π is any policy in Π∗(λ). To compute this sub-gradient, we compute a policy πλ in Π∗(λ) and then use exact λ computation or simulation with a large number of replications to compute Eπ [At]. To compute a policy in Π∗(λ), we first compute the value function Vλ : S × {1, ...,T } 7→ R of sub-MDP Q(λ). We accomplish this using backward induction [40]: Vλ(s, t) = maxa∈A{rT (s, a) − aλT } ∑ if t = T , (2.5)maxa∈A{rt(s, a) − aλ + Pat s′∈S (s, s′)Vλ(s′, t + 1)} otherwise. Recalling that Π∗(λ) includes only deterministic policies, a policy πλ in Π∗(λ) are constructed by determining for each s and t the action a whose one-step lookahead value 11 ∑ rt(s, a)−aλ at+ s′∈S P (s, s′)Vλ(s′, t+1) is equal to Vλ(s, t), and then setting πλ(s, a, t) = 1 for this a. For those s and t for which both actions a have one-step lookahead values equal to Vλ(s, t), one may set πλ(s, a, t) = 1 for either such action. Thus, the cardinality of Π∗(λ) is 2 raised to the power of the number of s, t for which the one-step lookahead values for playing and not playing are tied. When we construct a policy in Π∗(λ) for the purpose of computing a sub-gradient of P(λ), we choose to play in those s, t with tied one-step lookahead values. While our subgradient descent algorithm would converge for other choices, making this choice better supports computation of indices in section 2.3.1. Indices βt(s) Define the vector v[a, t] to be v + (a − vt) ∗ et, that is, the vector v with the tth element replaced by a ∈ R. We define the index of state s ∈ S at time t as βt(s) = sup{β : ∃ π ∈ Π∗(λ∗[β, t]) s.t. π(s, 1, t) = 1}. (2.6) Instead of computing the entire set Π∗(λ∗[β, t]), we only need to compute a policy in Π∗(λ∗[β, t]) using the method discussed in section 2.3.1, i.e., always choose the active action when there are ties. Intuitively, this index is the maximum price we are willing to pay to set a sub-process active in state s at t. By leveraging the monotonicity of optimal actions with respect to rewards, as shown in Lemma 13 in Appendix A.7, we compute βt(s) via bisection search in the interval [0,U], where U upper bounds the largest possible value of βt(s). For example, we can set U as T ∗ maxs,a,t rt(s, a) when λ∗ ≥ 0 (we show in Appendix A.6 that βt(s) cannot be greater than this value in this case). We pre-compute the set β = {βt(s) : s ∈ S, 1 ≤ t ≤ T } before running the actual algorithm. 12 Occupation measure ρ∗ and its corresponding optimal policy π** Our tie-breaking policy involves constructing an optimal Markov policy π∗∗ for the sub- ∗ π∗∗MDP Q(λ ) such that E [At] = mK , ∀1 ≤ t ≤ T . The existence of π∗∗ is shown in Appendix A.5. To compute π∗∗, we borrow the idea of the occupation measure [18]. Define the occupation measure, ρ(s, a, t), induced by a policy π to be the probability of being in state s and taking action a at time t under π. Subsequently π∗∗ can be computed by solving the following linear program (LP): ∑T ∑∑ max ρ(s, a, t)rt(s, a) {ρ(s,a,t):y∈S,a∈A,t∈{1,...,T }} ∑t=1 a∈A s∈S m subject to ρ(s, 1 t, t) = ,∀t = 1. . . . ,T ∑∈ Ks S ∑∑ ρ(s, a, t) − ρ(s′, a, t − 1)Pa(s′, s) = 0, a∈A a∈A s′∈ (2.7)S ∑ ∀s ∈ S, 2 ≤ t ≤ T ρ(s, a, 1) = 1(s = s1) ∀s ∈ S a∈A ρ(s, a, t) ≥ 0, ∀s ∈ S, a ∈ A, t = 1, . . . ,T, ∗∗ The first constraint ensures that Eπ [At] = mK . The second constraint ensures flow bal- ance. The third∑constraint shows that we start at state s1. The second and third constraint together imply a∈A,s∈S ρ(s, a, t) = 1, i.e., that (ρ(s, a, t) : a ∈ A, s ∈ S) is a probability distribution for each t. The fourth and fifth constraints ensure that ρ is a valid probability measure. Let ρ∗ be an optimal solution to (2.7). π∗∗ can then be constructed by ∑ ρ∗ (s,a,t) ∑  ∗ , if ρ ∗(s, a, t) > 0  a∈A ρ (s,a,t) ∑ a∈Aπ∗∗(s, a, t) = 1(a = 1), if ∗ ∑a∈A ρ (s, a, t) = 0 and β (s) ≥ λ ∗ t (2.8)  t 1(a = 0), if ∗a∈A ρ (s, a, t) = 0 and βt(s) < λ∗t , 13 for all s ∈ S, a ∈ A, 1 ≤ t ≤ T. Here we also make an observation thatλ∗ ∈ arg min P(λ∗) is the optimal dual variable corresponding to the first constraint in 2.7. 2.3.2 Index policy Let {βt(S t,x) : x ∈ {1, ...,K}} be the indices associated with the K sub-processes at time t. We define β̄t(St) to be the largest value β in {βt(S t,x) : x ∈ {1, ...,K}} such that at least m sub-processes have indices of at least β. Our index policy then sets the actions of sub- processes with indices strictly greater than β̄t to 1 (active), and those with indices strictly less than β̄t to 0 (inactive). When more than m sub-processes have indices greater than or equal to β̄t, a tie-breaking a rule is needed. For simplicity, in the following discussion we use the term remaining resources to refer to the remaining number of the arms to be set active after we activate all the arms with indices greater than β̄t(St), and use It = {S t,x : 1 ≤ x ≤ K, βt(S t,x) = β̄t(St)} to denote the set of states occupied by the sub-processes with tied indices. Our tie-breaking rule allocates the remaining resources across It according to the probability distribution induced by π∗∗ over S at time t. More specifically, we allocate  ∗ ∑ρ (S t,x,1,t) ∑ ∗ ′  s′∈ ∗ ′I ρ (s ,1,t) , if s′∈I ρ (s , 1, t) > 0 t tqt(S t,x) =  (2.9)∑ Nt(S t,x)′ N (s′) , otherwises ∈It t fraction of the remaining resources to each of the state in It, where Nt(s) denote the number of sub-processes in state s at time t. We then use the function Rounding(total, frac, avail) in Algorithm 2 to deal with situations where the products between the desired fractions and the remaining resources 14 are not integers. Here total represents the number of remaining resources, f rac is a vector of the fractions of the remaining resources to be approximated allocated to each tied state, and avail is a vector of the number of sub-processes in each tied state. The function also allows the number of sub-processes in a tied state s to be less than the number of resources we would like to assign to s according to the fraction in (2.9). We note the following property of this function Rounding, which we will rely on in our proof in Section 6. Remark 1. When total, avail, frac satisfy availi ≥ total ∗ fraci, the output vector b = Rounding(total, frac, avail) satisfies |bi − total ∗ fraci| < 1 for all i. This tie-breaking ensures asymptotic optimality of the index policy as it enforces that the fraction of sub-processes in each state s is equal to the distribution induced by π∗∗ in the limit. This idea shall become clear in Section 2.4 where the proof of asymptotic optimality is presented. We formally present our index policy in Algorithms 1 and 2. 2.4 Proof of Asymptotic Optimality Our index policy π̂ achieves asymptotic optimality when we let the number of sub- processes K go to infinity, while holding α mtt = K constant for all t. Let Z(π,m,K) to denote the expected reward of the original MDP obtained by policy π with K sub- processes and m = (m1, ...,mT ) constraints at each time. We use Πm,K to denote the set of all feasible Markov policies for the such an MDP. Lastly, it should be understood that whenever we use π̂ to denote our index policy there is a dependency of π̂ on m and K that is not explicitly stated. We are now ready to state the main result of this chapter, 15 Algorithm 1: Index Policy π̂ Pre-compute: λ∗; β; ρ∗. (Refer to section 2.3.1 for computational details) for t = 1, ...,T do Let βt,[i] be the ith largest element in the list βt(S t,1), ..., βt(S t,K), so βt,[1] ≥ . . . ≥ βt,[K]. Let β̄t = βt,[m] Let It = {s : βt(s) = β̄t and s = S t,x for some x} Let Nt(s) = |{x : S t,x = s}|, for all s. For s ∈ It, let  ∑ ρ∗(s,1,t) ∑, if ρ∗ ′ ′ ∗(s′ 1 t) s′∈I (s , 1, t) > 0ρ , , tqt(s) =  s ∈It ∑ Nt(s) ∑ ′s′∈It Nt(s ) , otherwise Let b = Rounding(m − ′s′:βt(s′)>β̄ Nt(s ), (qt(s) : s ∈ It), (Nt(s) : s ∈ It))t for all s do If βt(s) > β̄t, set all Nt(s) sub-processes in s active. If βt(s) = β̄t, set b(s) sub-processes in s active. If βt(s) < β̄t, set 0 sub-processes in s active. end for end for which shows that the per arm gap between the upper bound and the index policy goes to zero under the limit assumption.: Theorem 1. For any α ∈((0, 1) T , ) 1 lim Z(π̂, bαKc,K) − max Z(π, bαKc,K) = 0, (2.10) K→∞ K π∈ΠbαKc,K where bαKc = (bα1Kc, ..., bαT Kc) 16 Algorithm 2: Rounding(total, frac, avail) ∑ Inputs: total (a scalar), frac (a vector∑satisfying i fraci = 1), avail (a vector of the same length as frac satisfying total ≤ i availi) ∑ Output: b (a vector of the same length as the inputs satisfying i bi = total, bi ≤ availi) Let n = length(frac) Let bi = min{availi, btotal ∗ fracic}, for i = 1, ..., n. Let j = 1 ∑ while total > ni=1 bi do Let b j = b j + 1(avail j > b j) Let j = ( j mod n) + 1 end while return b We first point out that the optimal solutions of maxπ∈Πb Kc K Z(π, bαKc,K) are trivialα , when α = 0 or 1. So we do not include these two cases when we consider convergence of the index policy π̂. To formalize the notations that will be used throughout the proofs, we augment P(λ) to P(λ,m,K) to indicate the values of m and K assumed in the Lagrangian relaxation. We use λ∗ to denote one and any element in arg infλ P(K,αK, λ) and let π∗∗ π∗∗be the optimal policy constructed in (2.8) using mt = αtK, which satisfies E (At) = αt for all t. Note λ∗ and π∗∗ depend on only α (not on K). As before, we let Nt(s) be the number of sub-processes in state s at time t under π̂. We additionally define Mt(s) to be the number of sub-processes in state s at time t that are set active by π̂. These quantities depend on K and m, but for simplicity we do not include this dependence in the notation. We always assume m = bαKc and we rely on context to make clear the value of K assumed. We also define Vt(s) to be the set of states with the same index value as s, including s, and Ut(s) to be the set of states with index 17 value greater than that of s, for each time t. These quantities depend on α but not on K or m. We prove Theorem 1 by first demonstrating below in Theorem 2 that for each time t, the proportion of the sub-processes that are in state s under our index policy π̂, Nt(s)K , approaches Pt(s) as K → ∞. In other words, our index policy π̂ recreates the behavior of π∗∗ in the large K limit. Theorem 2. For every s ∈ S and 1 ≤ t ≤ T , Nt(s)lim = Pt(s), Pπ̂ − a.s., (2.11) K→∞ K and M (s) lim t = P (s) ∗ π∗∗t (s, 1, t), Pπ̂ − a.s., (2.12) K→∞ K Before proving Theorem 2, we first present two intermediate results, whose proofs are given in Appendix A.8 and A.9. Lemma 3. At time 1 ≤ t ≤ T , for all s ∈ S , we have (1) If βt(s) > λ∗t , then π ∗∗(s, 1, t) = 1. (2) If βt(s) < λ∗t , then π ∗∗(s, 1, t) = 0. Lemma 4. For any state s ∈ S and time 1 ≤ t ≤ T , ∑ (1) If αt −∑s′∈U (s)∪V (s) Pt(s ′) ≥ 0, then π∗∗(s, 1, t) = 1. t t (2) If αt − ′s′∈U (s) Pt(s ) ≤ 0, then π∗∗(s, 1, t) = 0.t Now we are ready to prove Theorem 2. 18 Proof. We prove (2.11) and (2.12) simultaneously via induction over the time periods. When t = 1, all sub-processes starts in state s1, and we have N1(s) Klim = lim = 1 = P1(s) if s = s1, K→∞ K K→∞ K N1(s) 0lim = lim = 0 = P1(s) otherwise. K→∞ K K→∞ K By the set-up of the original MDP, M1(s) = bα ∗ Kc, and we have M1(s) bα ∗ Kclim = lim = α = π∗∗(s, 1, t) = P1(s) ∗ π∗∗(s, 1, t), if s = s1, K→∞ K K→∞ K M (s) 0 lim 1 = = 0 = P (s) ∗ π∗∗1 (s, 1, t), otherwise, K→∞ K K so we have proved the base case of the induction. Now assume (2.11) and (2.12) hold up until time t. Fix a state s ∈ S and time 1 ≤ t ≤ T , define Y ′t(s , s) to be the number of sub-processes set active by π̂ in s′ at time t which transition to state s at time t + 1, and Xt(s′, s) to be the number of sub-processes set inactive by π̂ in s′ at time t which transition to s at time t + 1. Note that Y ′t(s , s) and X ′t(s , s) also depend on K. We can subsequently express Nt+1(s) as∑ N ′ ′t+1(s) = Yt(s , s) + Xt(s , s). s′∈S Dividing both sides by K, and taking K to a limit, we get Nt+1(s) ∑ 1 ∑ 1lim = lim Y ′t(s , s) + lim Xt(s′, s). (2.13) K→∞ K K→∞ ′ K K→∞∈ ′∈ Ks S s S Note Y (s′, s) is a binomial random variable with M (s′t t ) trials and success probability P1(s′, s). Similarly, X ′t(s , s) is a binomial random variable with Nt(s′)−Mt(s′) trials and success probability P0(s′, s). We can rewrite the RHS of (2.13) by applying Lemma 14, 19 which is stated in Appendix∑A.10:N ′ lim t+1 (s) Mt(s ) = lim ∗ P1(s′, s) (2.14) K→∞ K ∑ K→∞ Ks′∈S N (s′t ) − M ′lim t(s )+ ∗ P0(s′, s) s∑′ K→∞ K x∈S = Pt(s′) ∗ π∗∗(s′, 1, t + 1) ∗ P1(s′, s) (2.15) s′∈S∑ + Pt(s′)(1 − π∗∗(s′, 1, t + 1)) ∗ P0x(s′, s) a.s. s′∈S = Pt+1(s). a.s. (2.16) The last equality follows as we have exhausted all the ways of getting to s at time t + 1. Hence we have shown (2.11) holds for time t + 1. Next we show (2.12) holds for time t + 1. We define sets Pt = {Pt(s) : s ∈ S}, and Nt = {Nt(s) : s ∈ S}. Recall Vt(s) is the set of states with the same index value as s, including∑s, and Ut(s) is the set of states with index value greater than that of s, we let N+(s) = N (s′ ∑ t s′∈U (s) t ) and N = t (s) = ′ s′∈V (s) Nt(s ). We use notation Nt K for the set whicht t consists of all elements in Nt divided by K. Lastly we use the function fs(Nt, bαtKc) to represent the number of sub-processes set active at time t in state s: fs(Nt, bαtKc) =1([bαtKc−N+t (s)]+≥N=t (s)) ∗ Nt(s) +1([bα Kc−N+(s)]+ ni=1 bi do ∑ Let t = max{t ≥ 0 : b + ti |L| ≤ availi, ∀i ∈ L and n i=1(b + t i |L| ) ≤ total} Let b = {b ti + 1(i∈L) |L| : 1 ≤ i ≤ n} Let L = {i : 1 ≤ i ≤ n, bi < availi} end while return b Moreover, we use f̄s(Nt, bαtKc) =1([bα + =tKc−N + ∗ Nt (s)] ≥Nt (s)) t(s) +1([bα Kc−N+t (s)]+ 1. 43 Hence we can compute∑(3.14). Finally, since B(S0, λ) is the sum of the single-system values Vλ N0,x(S 0,x) and m n=1 λn, we have the following subgradient of B(S0, λ): ∑k ∂Vλ0,x(S 0,x) + m ∈ ∂B(S0, λ) (3.16) ∂λ x=1 n Equation (3.16) allows us to compute the upper bound (3.13) by first-order convex opti- mization. 3.4 Index Policy In this section, we describe an index-based policy based on the same decomposition used to develop the upper bound. The intuition behind this policy is based on an unproven conjecture, but the policy is well-defined whether or not this conjecture is true. We demonstrate in numerical experiments in Section 2.5 that this policy performs well. This index-based policy considers the relaxed problem (3.2) with a Lagrange mul- tiplier λ = λe, where e = (1, . . . , 1) is the vector of all 1s and λ is a real number. This index-based policy is based on the intuition that, for any state S n,x, as we increase λ and thus λ we should see that an optimal policy corresponding to λ should take fewer samples, because the samples are more expensive. While we conjecture that this is true, and our numerical experiments support it, we have not confirmed it theoretically. To calculate the number of samples taken in a given state, our index-based policy varies λ until we find a value in which an optimal policy for the relaxed problem takes m samples (or, if no such λ exists, it should take as many samples as possible without taking more than m). We then sample according to the optimal policy for this λ. When we get a new state, we repeat this process, finding a new λ vector, and a new sampling allocation. 44 We define this index-based policy more formally as follows. We first introduce some notation. Let Qλ be the set of single-arm policies that are optimal for (3.3) at the given value of λ. Let zπn,x(S n−1,x) be the number of samples taken under a single-arm policy π at time n in state S n−1,x. At each time step n = 1, . . . ,N, this policy computes zn based on Sn−1 in the follow- ing way: { } 1. Let zλ π λn,x(S n−1,x) ∈ zn,x(S n−1,x) : π ∈ Q be the number of samples taken under an optimal single-arm policy with the given set of Lagrange multipliers λ, breaking ties arbitrarily. 2. Let λ∗ { ∑ } = inf λ : x zλn,x(S n−1,x) ≤ m, λ = λe . 3. Set λ∗ = λ∗e. ∗ ∑ 4. Let z πn,x ∈ {zn,x(S ) : π ∈ Qλn−1,x }, so as to satisfy kx=1 zn,x ≤ m, breaking ties arbitrarily between different allocations zn that satisfy this constraint. This index-based policy leaves free the tie-breaking rule used when choosing zλn,x(S n−1,x), and also when choosing zn,x in the final step. We conjecture that the first tie-breaking rule has no impact on the value for λ∗, although we have not confirmed this theoretically. We conjecture that the best tie-breaking rule used∑to choose zn,x in the final step is one that minimizes the num∑ber of unused cores, m − kx=1 zn,x, and that it is always possible to find one with m = kx=1 zn,x, but again we have not confirmed this theoretically. Figure 3.1 illustrates the above procedure with two systems and two parallel re- sources. Figure 3.1(a) and 3.1(b) show how the optimal number of samples varies with the value of λ for each system, given their current states (2, 2) and (3, 3). Figure 3.1(c) 45 3 3 2 2 1 1 0 0 −1 −1 0 0.1 0.2 0 0.1 0.2 λ λ (a) (b) 5 4 3 2 1 0 −1 0 0.1 0.2 λ (c) Figure 3.1: This figure illustrates how zn is chosen by the index-based policy with k = 2 systems, m = 2 parallel computing resources, and N = 20 simulation batches. Figure (a) plots zλ2,2, the optimal number of sam- ples to take in batch 3 from system 1 when it is in state (2,2), against the value of λ; Figure (b) plots zλ3,3, the optimal number of samples to take in batch 3 from system 2 when it is in state (3,3). Figure (c) plots zλ2,2 +z λ 3,3, the optimal total number of samples to take across both systems. The dashed line in (c) shows the constraint m = 2, and λ∗ will be the left endpoint of the solid line overlapping this dashed line. ∗ The number of samples taken from each system will be zλ2,2 = 1 and ∗ zλ3,3 = 1 respectively. shows the optimal total number of samples across both systems. Since there are two parallel resources, the constraint is m = 2. Hence in this case λ∗ is the left end point of the interval at height zλ2,2 + z λ 3,3 = 2. 46 zλ 2,2 zλ +zλ 3,3 2,2 zλ 3,3 3.5 Numerical Results In this section, we present numerical results illustrating the upper bound and the index- based policy, as well as the baseline equal allocation policy. We consider 4 different value of k: k = 2, 4, 8, 16. For each value of k, we set the time horizon N = 5, the threshold value dx = 0.2 for all x ∈ {1, ..., k}, and the number of parallel computing resources m = k. We set the initial state to be the same for every system: S 0,x = (1, 1). We first calculated the upper bounds according to Theorem 3 for each value of k. The squared dots in Figure 3.2 represents the values of the upper bounds (shown on a scale in which we divide by k). We then simulated the index- based policy described in Section 4.6 for 10000 iterations respectively for k = 2, 4, 8, 16 respectively. The thinner lines in Figure 3.2 show 95% confidence intervals for the mean performance of this policy. As a baseline, we also simulate the equal allocation policy in which the m parallel computing resources are distributed equally to the k systems. The equal allocation policy is simulated for 50, 000 replications for each value of k, and 95% confidence intervals for the mean performance are shown as the thicker lines in Figure 3.2. Confidence intervals for the index policy are wider than those for the equal allocation policy because fewer samples are taken when estimating the expected value of the index policy. This is because each simulation of the index policy takes a substantial amount of time in our current implementation. For the same reason, the index policy’s confidence intervals still overlap the upper bounds. If we took more samples, we expect that the upper limit of the confidence interval would eventually fall below the upper bound, because we do not think that our policy is optimal. Nevertheless, our results show that the index policy performs substantially better than the equal allocation policy in the 47 0.334 Upper Bound Index Policy 0.332 Equal Allocation Policy 0.33 0.328 0.326 0.324 0.322 0.32 0.318 0 2 4 6 8 10 12 14 16 18 Number of Alternatives (k) Figure 3.2: This figure shows the upper bound on the performance of the optimal policy for the MCS problem (dashed line with squares) normalized by dividing by k, as well as the estimated performance of two sub- optimal policies: the index policy from Section 4.6 (thinner lines and dots); and the equal allocation policy (thicker lines and dots). The setting pictured uses m = k, dx = 0.2, α0x = β0x = 1, cx = 0. We use 10,000 independent replications to estimate the value of the index policy, and 50,000 for the equal allocation policy. The plot shows that the index policy is substantially better than equal allocation, and is statistically indistinguishable from optimal given the number of repli- cations performed. setting studied. In the figure, the upper bound, divided by k, initially increases, and then levels out. This can be understood as follows. Because our initial state S 0,x[is identical for e∑ach sys]- tem, our upper bound 3.13 can be rewritten as UB(S0) = inf λ Nλ≥0 kV0,x(S 0,x) +[ k n=1 λn . w∑here x] is any arbitrary x. Dividing by k provides 1k UB(S0) = infλ≥0 Vλ0,x(S 0,x) +N n=1 λn . Vλ0,x does not depend on k directly, but it does depend on m which is the constraint on the maximum number of samples that can be taken from system x in any batch, and 48 Expected Total Reward / k we set m = k in our experiments. Thus, when we increase k, we loosen this constraint. We believe this is why 1k UB(S0) increases initially with k. Then, as k grows large, this constraint is no longer binding, as the optimal value of λ causes us to take less than m = k samples in each batch. We believe this is why 1k UB(S0) levels off as k becomes large. 3.6 Conclusion We offered a computationally feasible way to obtain the upper bound on the total ex- pected reward of the finite-horizon MCS problem through Lagrangian relaxation. We then proposed an index-based policy using this Lagrangian relaxation. Using the upper bound as a reference, we showed this index policy performs close to optimal by running numerical experiments on a specific set of parameters. 49 CHAPTER 4 BAYES-OPTIMAL EFFORT ALLOCATION IN CROWDSOURCING: BOUNDS AND INDEX POLICIES In this chapter, we consider another RMAB-like problem: classifications with crowdsourcing. In this problem, one has a group of classification tasks and wishes to hire crowd-workers to determine, for each among a finite pool of simulatable systems, which ones have an expected output measure that exceeds a known threshold. In this chapter, we use Bayesian statistics and dynamic programming to study how one should allocate simulate effort in the MCS problem, so as to best support this final determi- nation. This chapter is organized as follows: In Section 4.1 and 4.2, we introduce the background of the problem and give a literature review; In Section 4.3 and 4.4, we for- mally state the problem and formulate the problem as a dynamic program; In Section 4.5, we provide a computationally tractable upper bound on the value of a Bayes-optimal procedure; In Section 4.6 we present an index-based heuristic policy derived from the computation of the upper bound, and which is similar to the general index-based policy proposed in Chapter 2; In Section 4.7, we present numerical results in which we demon- strate that this index-based policy performs close to an optimal policy and outperforms the state-of-art; Lastly we make a conclusion in Sections 4.8. 4.1 Introduction Crowdsourcing can accomplish large-volume tasks such as image classification or doc- ument relevance assessment by using large pool of amateur workers at much less ex- pense than is possible by hiring experts or by developing an automatic machine learn- ing method [29]. Moreover, online platforms such as Amazon Mechanical Turk make crowdsourcing service widely accessible by providing a marketplace in which requesters 50 may post tasks, which crowd-workers may complete in exchange for money. These fac- tors are making crowdsourcing increasingly important. Although crowdsourcing is less expensive than hiring experts, the number of images or other tasks that a requester can correctly label or process is nonetheless limited by his or her budget. This fact is compounded by the noise and variability inherent to crowd- workers’ responses, which typically requires a single item to be processed independently several times by multiple workers. In this part of the dissertation, our goal is to find a sequential allocation of workers to tasks that most accurately supports a correct aggregated label for each task, subject to a limited budget (which in turn limits the number of workers that a requester can hire) and a limited time horizon. In this chapter we focus on binary labeling tasks, but our approach can also be extended to multi-class labeling. Intuitively, much can be accomplished through a sophisticated allocation of worker effort: When budgets are large relative to the overall difficulty of the tasks to be ac- complished, a good scheme should allocating more workers to those tasks that are more difficult, so that uniform quality can be ensured. When budgets are small, however, those most difficult tasks should be abandoned so that the bulk of the budget can be used to ensure that at least those easy tasks are done correctly. We adopt a Bayesian approach, which is natural in crowdsourcing because: 1) It allows us to leverage prior information about the tasks to be accomplished, which may be learned in the crowdsourcing setting from features associated with each task and the typically large collections of historical data collected in previous crowdsourcing cam- paigns; 2) It seeks to maximize average-case performance with respect to the prior distri- bution, which is natural in crowdsourcing where requesters typically tolerate some vari- 51 ability in quality, and are most interested in maximizing aggregate performance across a large volume of tasks, rather than ensuring robustness to some worst-case distribution over task characteristics, or studying asymptotic behaviors that do not become relevant until the number of workers working on each task grows large. Within this Bayesian framework, we formulate and study sequential effort allocation as a partially observable Markov decision process, using tools from dynamic program- ming. While the curse of dimensionality [38] prevents solving this dynamic program to optimality, we provide a computationally tractable upper bound on the expected per- formance under any Bayes-optimal effort allocation policy. Upper bounds are useful because they allow evaluating the optimality gap for any given heuristic on any prob- lem instance, simply by simulating the heuristic and comparing its performance to the bound. The technique we use to obtain such upper bound is the Lagrangian Relaxation on weakly coupled dynamic programs discussed in [1] and [25]. The proofs we present in Section 4.5 are very similar in spirit to [1], but while Adelman based his proof on the value functions of the DP formulation in a infinite horizon setting, we offer a proof based on the initial objective function of the problem in a finite horizon setting. Nonetheless, our crowdsourcing model is a specific application of the more general formulation in [1] and [25]. Then, using Lagrange multipliers that appear in this upper bound, we de- rive an index-based heuristic policy that is similar in spirit to the Gittins index policy for multi-armed bandits [22] and the Whittle index policy for restless bandits [48]. We then show that this index policy has performance close to the upper bound in numerical experiments, and also outperforms other state-of-art policies for resource allocation . Although the primary novelty and contribution of our work is that it is the first to characterize the performance of the Bayes-optimal policy for effort allocation in crowd- sourcing, and to develop Bayesian bandit-style index policies, our work is also novel 52 is modeling the asynchronous nature of crowd-work in a continuous-time setting, in contrast with previous work on effort allocation in crowdsourcing that assumed instant completion of tasks [51], [11], [28]. This model is inspired by how crowd-workers are employed on Amazon Mechanical Turk; allowing an asynchronous process thus gives a closer proximity to the real situations. 4.2 Related Work There are two major strands of former works to which our work is related. The first is the work on effort allocation and crowd labeling. Much of this work adopts a fre- quentist viewpoint and focuses on error bounds for inference [30, 20, 29, 45, 27]. [30] proposed an allocation algorithm based on a random graph, and while its performance asymptotically order-optimal, one needs a very large number of workers to make this relevant. [45] incorporates a limited budget, but lacks the notion of optimality. None of the work above considers a finite time horizon. There is also work with more of a Bayesian flavor([51, 4]). While they focused on the efficiency of allocation, they did not consider an optimal solution. Among the work that adopt a Bayesian framework, our work is similar to [10] in that we both form an optimal policy in the form of a stochastic dynamic program. Although they also provide a well-motivated heuristic policy, our work pushes further by deriving an upper bound based on this formulation of optimal policy. The second strand resides in the literature of Multi-armed bandit (MAB) and stochastic dynamic programming. The formulation a Bayesian-optimal procedure as a dynamic program is considered in [34, 35]. Our use of Lagrangian relaxation is an application of the relaxation method of weakly coupled dynamic program discussed in 53 [1]. The setting in this work differs from the previous works by that only one task is to be assigned when a worker enters and the completion of task is not instant. The index-based policy proposed in this work, which uses Lagrangian Multipliers to assign indices, draws inspiration from [48]. 4.3 Problem Statement We consider a requester of crowdsourcing service with K independent binary labeling tasks. Due to a budget constraint, the requester allows a maximum of U workers to work on these tasks, and requires all work to be completed by a time horizon T . We model the arrival of workers to the crowdsourcing system by a Poisson process with rate r. (Our model can be generalized to non-homogeneous Poisson processes with little additional effort.) As each worker enters the system, the requester selects one of the K tasks for the worker to label. We let z` ∈ {1, . . . ,K} indicate the task assigned to the `th worker. (We use [K] to denote {1, . . . ,K} for the rest of the paper.) The worker spends a random Exponential(µ) amount of time on the task x, independent of all else, and then provides a binary label y`. Workers do not always give the correct label because the task may be ambiguous and thus hard to categorize, or workers may be careless or lack background information when they conduct the labeling process. We suppose that workers are “homogeneous” (a term used in [11]), and give noisy but unbiased labels. More specifically, each task x has an associated unknown value θx ∈ [0, 1], which is the underlying probability that it will be labeled as positive by a worker. The distribution of the label generated by the `th 54 worker given θ1, ...θK and z` is y`|θ1:K , z` ∼ Bernoulli(θz ). (4.1)` We set a known threshold value dx, and consider the label for task x being positive if θx > dx. We let B = {x : θx > dx} be the set of tasks whose correct label is positive. Note B is unknown as θx are unknown. For analytical convenience we use the Beta distribution, which is the conjugate prior of the Bernoulli distribution, as the prior for each θx independent across all x. θx ∼ Beta(α0,x, β0,x). With the assumption of this independent beta prior on each θx, and the conditionally independent Bernoulli responses as in (4.1), the posterior on θx after some number of workers have provided responses will remain beta-distributed, with first parameter equal to the sum of α0,x and the number of positive responses, and the second parameter equal to the sum of β0,x and the number of negative responses. In practice, one can estimate appropriate values for the parameters α0,x and β0,x from historical data on tasks pre- viously labeled by the crowd. We discuss this further in section 4.7 where numerical experiments are performed. Note the assumption of a Beta distribution can be relaxed without a great deal of dif- ficulty, as the posterior distribution will remain in an exponential family parameterized by the number of positive and negative labels observed for the instance. The assumption of independence cannot be easily generalized, as it is necessary for the decomposition in our Lagrangian relaxation, without which the upper bound in Section 4.5 much more challenging to compute. Thus, after the worker budget U has been exhausted or the time horizon T has 55 elapsed, the requester will have a posterior distribution on each θx which remains beta- distributed. Let α′x, β ′ x be the posterior parameter for this time. At this time, we model the requester as choosing, for each task x, an estimated label based on the responses of the crowd-workers, and then receiving a reward of 1 for each correctly labeled task, and 0 for the incorrectly labeled tasks. (Our approach can be easily generalized to other reward or loss structures that are additive across tasks, and depend only on θx and some task-specific estimate based on the crowd’s feedback.) The expected reward under the posterior that the requester will obtain is P(θx > dx|α′x, β′x), if s/he chooses a positive label, and P(θ ′ ′x < dx|αx, βx) if s/he chooses a neg- ative label (θx has a density, and so θx = dx with a posterior probability of 0). Thus, the requester chooses the label giving the larger reward, and achieves a reward whose expected value under the posterior {is, } R(α′x, β ′ x) = max P[θ ′ x>dx|αx, β′x],P[θx 0, we obtain Theorem 4.  Calculating the supremum term in Theorem 4 directly by dynamic programming is again computationally infeasible, because the state space of this dynamic program again is over all of S, which has 3K + 1 dimensions. We avoid this issue by decomposing this 63 supremum term into the sum of the optimal values for K dynamic programs, one for each task, each of which has a much more manageable 4 dimensions. To support this decomposition, we write the state S n ∈ S for the whole system (K tasks) as S n = (S n,1, . . . , S n,K), where S n,x is the state for task x when n events have occurred, and includes αx, βx, t,wx, and the global counter `. We let S(x) = R × R × R × N × N be the set of possible values for this single-task state S n,x. Following a development identical to that in Section 4.4, but for the single task x, we may define a space of policies π(x) that map the single-task state S n,x and the elapsed time since the last event (x)∆n (counting worker arrivals over the whole system, and completions of task x only) onto a binary decision of whether or not to allocate an incoming worker to task x, so that π(x)(S n,∆n) ∈ {0, 1}. Following this development, we construct K independent Markov chains, one for each task, where each one is controlled by its respective single-task policy π(x). We define N as before, to be the first time that the time horizon elapses, or our worker budget has been exhausted and all outstanding workers have completed their work. We then let Rx = R(αx(S N), βx(S N)) be the reward obtained from this the single task at this time. The following theorem shows that the bound in Theorem 4 can be re-written in terms of the sum of solutions of single-task dynamic programming problems, where each obtains the reward Rx, and is penalized for assigning workers to its task. Theorem 5. ∑K ∑U (x)[ ] ∑U inf sup Eπ Rx − λ`a≥ `,x + λ` (4.13)λ 0 x=1 π (x)∈Π(x) `=1 `=1 forms an upper bound on R0. 64 Proof. For any λ ≥ 0: ∑U ∑ supEπ[R − (λ` a`,x)] π∈Π ∑ `=1 ∑xK K ∑U = supEπ[ Rx − λ`a`,x] π∈Π [ x∑=1 ( x=K ∑1 `=1U )] = supEπ Rx − λ`a`,x π∈Π ∑ x=1K [ ∑`=1U ] = sup Eπ Rx − λ`a`,x . π∈Π x=1 `=1 This is bounded above by, ∑K [ ∑U ] supEπ Rx − λ`a`,x ∑x=1 π∈Π `=1K ∑U π(x) [ ] = sup E Rx − λ`a`,x . (4.14) π(x)∈Π(x)x=1 [ `=1 ∑ ] The equality at (4.14) is because sup ππ∈Π E R − Ux `=1 λ`a`,x depends only on (αt,x, β (x)t,x,wt,x : 0 ≤ t ≤ T ), which is in turn governed by π . By Theorem 4, for any λ ≥ 0, ∑K [ ∑U(x) ] ∑U sup Eπ Rx − λ`a`,x + λ` (x) (x) x=1 π ∈Π `=1 `=1 forms an upper bound on R0. This hold for any λ ≥ 0, and so we have thus proved Theorem 5.  Since the state space is much smaller for a single-task system, we can use dynamic programming to solve for the supremum term (x)[ ∑U ] sup Eπ Rx − λ`a`,x , (4.15) π(x)∈Π(x) `=1 for any λ value. What remains in the computing of the upper bound is to solve for the infimum in Theorem 5. We ex∑plore the convexity[prope∑rty of the ]prob∑lem follow by a(x) binary search. Define B(λ) = Kx=1 sup π U U π(x)∈Π(x) E0 Rx − `=1 λ`a`,x + `=1 λ`, which is 65 the upper bound derived in Theorem 5 without the infimum. First we prove that B(λ) is convex in λ. Lemma 12. B(λ) is convex in λ. ∑ [ ](x) ∑ Proof. First note U`=1 λ` is convex in λ. To prove sup π U π(x)∈Π(x) E Rx − l=1 λlal,x is convex in λ, pick any λ1, λ2 ≥ 0 and t ∈ [0, 1] . Let∑U ′ (x) [ ] π = sup Eπ Rx − (tλ1,l + (1 − t)λ2,l)al,x (4.16) π(x)∈Π(x) l=1 We have ∑U π(x) [ ] t sup E Rx − λ1,lal,x π(x)∈Π(x) [l=1 ∑U(x) ] +(1 − t) sup Eπ Rx − λ2,lal,x [ π(x)∈∑Π(x) ] l=1U [ ∑U ] ≥tEπ′ R − λ a + (1 − t)Eπ′x 1,l l,x Rx − λ2,lal,x [ ∑l=1 ] l=1U′ =Eπ Rx − (tλ1,l + (1 − t)λ2,l)al,x l= (x)[1 ∑U ] = sup Eπ Rx − (tλ1,l + (1 − t)λ2,l)al,x . π([x)∈Π(x) ∑ l]=1(x) Hence sup π Uπ(x)∈Π(x) E [Rx − l=1 λlal,x ]is convex in λ for any x ∈ [K], subsequently the sum of sup Eπ (x) − ∑R Uπ(x)∈Π(x) x l=1 λlal,x across x is also convex in λ. Thus we complete the proof that B(λ) is convex in λ.  With the convexity in λ, we approximate λ′ that achieves the infimum by setting λ′ = λ′ × (1, , , 1), and use a Fibonacci search to find the infimum. Here we constrain all the units of λ′ to be the same for simpler computation, a tighter bound can be obtained by allowing each unit of λ′ to vary and use sub-gradient descent to locate the infimum. 66 4.6 Index Policy We introduce an index-based heuristic policy built on the Lagrangian relaxation we used in proving the upper bound. In this policy, we compute some λ∗x for each task x based on its state S ∗n,x, such that λx is the greatest value of λ that the optimal policy will decide to hire the worker on state S n,x when solving (4.15) with λ = λ1, 1 = (1, . . . , 1). We then assign the incoming worker to the task with the highest λ∗. The intuition behind this policy is that in a single-task problem described by (4.15), we view λ` as a cost of employing the `th worker. As λ` increases, our decision switches from hiring the worker to not hiring, where the switching point is at λ∗x. Hence tasks with a high λ ∗ x are the tasks that are worth hiring more workers to work on. Below we present the algorithm in a more formal way. A useful technique to reduce the amount of computation is to Algorithm 4: Index Policy 1: while ` < U do 2: For each x ∈ {1, . . . ,K}, compute λ∗x = inf{λ ∈ R+ : aλ`,x = 1}, where aλx,` is the optimal decision from (4.15) when λ = λ1. 3: Let x∗ = arg maxx λ∗x. Break tie arbitrarily. 4: Assign task x th∗ to the ` worker. 5: end while put a cap on the total number of workers that can be assigned to a task, for this reduces the size of the state space of the dynamic program involved in solving for (4.15). This additional cap does not affect the decision made by the Index Policy. Intuitively it is unlikely for any reasonable policy to assign all the U number of workers to one task, so it is unlike for any task to get more than a certain number of workers assigned. One can check the validity of the cap by running simulations with the capped index policy, and 67 see whether there are tasks that use all the workers that the cap allows. We show in section 4.7 that this policy’s performance is close to optimal. 4.7 Numerical Experiment For numerical experiment we concentrate on the case in which T = ∞. In this case we stop when we reach the maximum number of workers the budget allows. The Bellman’s recursion to compute 4.15 in the computation of upper bound for this special case is given in the supplement. In the first set of simulation study, we compare the perfor- mances of different policies on simulated data against the corresponding upper bounds. In the second set of simulation study, we use a real dataset for simulation. 4.7.1 Simulation using simulated data In the first set of simulations we evaluate the performance of the Index Policy using simulated data, and compared the total reward given in (4.2) generated by the Index Policy to the upper bound 4.11. We also compare the performance of the Index Policy to Optimistic Knowledge Gradient(OKG) method [10], which is a state-of-art Bayesian allocation policy. A round of simulated process includes generating either an arrival of worker or a completion of task based on the arrival rate r = 0.1 and and completion rate µ = 0.4 with distributions specified in Section 4.3. If it is a completion of task, we generate a label based on the posterior parameters. The process stops when we exhaust all the budget, i.e., the number of workers that are allowed to hire, and we get a reward as in (4.2). We vary the number of tasks to be K = 10, 100, 1000, and set the budget to be U = 1.2K. We use a non-informative prior with α = 1 and β = 1. We use a threshold 68 dx = 0.5 for all the tasks. For each value of K = 10, 100, 1000, we simulate the process 5000 times, and obtain a 95% confidence interval for the simulated total reward. In Figure 4.1 we show a Semi-log plot of the number of tasks K against the average reward per task with the corresponding confidence intervals. 0.760 0.755 0.750 Upper bound 0.745 Index Policy Okg Policy 101 102 103 K Figure 4.1: Semi-log plot of K against average per task reward (R/K) for K = 10, 102, 103. We would like to make a note that in this numerical study, we truncate the time horizon of the single-task DP from 1.2K to 6 to reduce computational complexity. This truncation preserves the original dynamics as it is very unlikely for a task to be assigned 6 workers when the total number of workers is just 1.2 times the number of tasks. We use simulation to demonstrate that 6 is still a loose cap numerically. We run a simulation with simulated data with 1000 replications for number of workers K = 10 and 100, and U = 1.2K and count the number of tasks that uses 0 worker, 1 worker and up to 6 workers. The results are shown in Figure 4.2. The results show that a task uses at most 69 total reward/K 4 workers, hence set a cap at 6 does not affect the performance of the index policy. 4500 60000 4000 50000 3500 3000 40000 2500 30000 2000 1500 20000 1000 10000 500 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 No. of workers assigned to a task No. of workers assigned to a task Figure 4.2: Histogram of number of workers assigned to a task The performance of the index policy is consistently better than the OKG policy, especially for smaller number of tasks. Moreover, the gap between the upper bound and the simulated value gets smaller as K increases, which demonstrates both that the upper bound is tight and the Index Policy performs close to an optimal policy as the number of tasks gets larger. We emphasize that the improved performance over OKG offered by our Index Policy is only one aspect of the contribution of this work. The other aspect is the tightness of the upper bound, especially for problem instances with many tasks. This tight upper bound for K=1200 shows that the index policy is within 0.03% of optimal, and that continued algorithmic development will not provide significantly increased performance for large- scale crowdsourcing problems with characteristics matching this particular simulated dataset. The ability to bound the improvement from continued algorithmic development for a particular problem instance, or class of problem instances, is useful for managers at companies that use crowdsourcing and wish to allocate engineering\R&D effort. 70 No. of tasks No. of tasks 4.7.2 Simulation using real data This set of simulations uses a real dataset, PASCAL RTE-1[43], which consists of 800 tasks, each comes with 10 labels obtained from crowdworkers and a gold standard label. (A gold standard label of a classification task is its true label.) We evaluate the perfor- mance of the Index Policy against the OKG policy, the Thompson Sampling [8] and a widely used frequentist approach - the upper confidence bound(UCB) policy [3]. (The specific version of UCB used here is UCB1-tuned.) The metric used to evaluate the performance is the accuracy score. More specifically, it is the number of correctly pre- dicted labels over the total number of tasks. In each round of simulation process, we still simulate events (either arrival or completion) the same way as we did in Section 4.7.1. If the event is a completion, we read the most recent label for that task in the dataset. At the end of each process, predicted labels are compared against the gold standard labels. dx is still 0.5 for all tasks. For each value of K = 10, 100, 750, we simulate the process 5000 times, and obtain a 95% confidence interval for the simulated total reward. Before simulating, we use the remaining 50 tasks from the RTE dataset as the ‘his- torical data’ to estimate the parameters of the Beta prior, which is set to be the same across all tasks. This comes with an assumption that all tasks are homogeneous, hence a subset of them are representative of a larger population. We first obtain an estimate of θx for each of the 45 tasks, then use these empirical values of θx to fit a Beta distri- bution by Method of moments. In Figure 4.3 we show a Semi-log plot of the number of tasks K against the accuracy score with the corresponding confidence intervals. All the Bayesian policies see an smaller optimality gap when K gets larger. Index Policy performing consistently the best among all the policies. It is proven in [10] that OKG policy is consistent: it achieve 100% accuracy almost surely when number of work- ers goes to infinity. We demonstrate numerically that the Index Policy performs better 71 0.80 0.79 0.78 0.77 0.76 Upper bound 0.75 Index Policy Okg Policy 0.74 UCB Thompson 0.73 101 102 103 K Figure 4.3: Semi-log plot of K against accuracy score for K = 10, 100, 750. than the OKG policy. It is thus reasonable to anticipate that the Index Policy is not only consistent, but is asymptotically optimal when both the number of workers and the number of tasks goes to infinity, while keeping the ratio of the number of workers and the number of tasks constant. 4.8 Conclusion We formulated the effort-allocation problem in crowdsourcing in a continuous time set- ting with budget constraint and time constraint. We also provide a computationally feasible upper bound on value of the Bayes-optimal policy using Lagrangian relaxation. Using the Lagrange multiplier used in proving the upper bound, we also derived an index-based policy and showed in numerical experiments that it performs close to opti- mal. 72 Accuracy Score APPENDIX A A.1 Notation of Chapter 2 SK ,AK ,P·(·|·),R(·, ·) State space, action space, transition kernal and reward function of the original MDP. S,A, P·(·, ·), r(·, ·) State space, action space, transition kernal and reward function of the sub-processes of the original MDP. s, S Generic element and random element of SK . s, S Generic element and random element of S. K Number of sub-processes. T Time horizon mt Number of sub-processes to be set active at time step t Π Set of all Markov policies of the original MDP π(s, a, t) the probability of choosing action a in state s under policy π at time t. Π Set of all Markov policies of the sub-MDP π(s, a, t) the probability of choosing action a in state s under policy π at time t. Π∗(λ) Set of Markov deterministic optimal policies for sub-MDP Q(λ), given λ ∈ RT . πλ An element in Πλ. πλ A deterministic optimal policy for the relaxed problem which ob- tained by the decomposition method in Lemma 2, given λ ∈ RT . P(λ) Optimal value of the relaxed problem, given λ ∈ RT . λ∗ An value that attains infλ P(λ) 73 π∗∗ An optimal markov policy for the sub-MDP which satisfies π∗∗E [A mt] = K , ∀1 ≤ t ≤ T . π̂ The index based policy proposed by this paper β̄t Indices of the tied sub-processes. It The set of states occupied by the tied sub-processes. Nt(s) The number of sub-processes in state s at time t under index policy π̂. Pt(s) The probability of an individual sub-process landing in state s at ∗∗ time t under π∗∗. Pt(s) = Pπ [S t = s] α The ratio between the number of sub-processes set active, m, and the total number of sub-processes K. Mt(s) The number of sub-processes in state s at time t that are set active under our index policy π̂. Yt(s′, s) The number of sub-processes set active by π̂ in s′ at time t which transition to state s at time t + 1. X ′t(s , s) The number of sub-processes set inactive by π̂ in s′ at time t which transition to s at time t + 1. Ut(s) The set of states whose indices are greater than the index of state s at time t. Ut(s) = {s′′ ∈ S : βt(s′′) > βt(s)}. Vt(s) The set of states whose indices are equal to that of s. V (s) = {s′′t ∈ S : βt(s′′) = βt(s)}. |v| An operation that sums all the elements in vector v. H1, H2 random variables due to the rounding rules in Algorithm 2 Z(π,m,K) the expected reward of the original MDP obtained by policy π Table A.1: List of notation 74 A.2 Upper Bound Proof. Proof of Lemma 1 Let ΠP = {π ∈ Π : Pπ(|At| = mt) = 1, ∀1 ≤ t ≤ T }. Let ΠE = {π ∈ Π : Eπ[|At|] = m Tt, ∀1 ≤ t ≤ T }. For any λ ∈ R , we have P(λ) = maxEπ  ∑T ( ) ∑ Rt St,A  − Eπ t   λt(|At| − m∈ t)  π Π ∑ t=1 t T ≥ ( )maxEπ R S ,A  − Eπ ∑ λ (|A | − m )t t t t t t  π∈ΠE t=1 t = maxEπ ∈  ∑T ( )Rt St,At  π ΠE ∑t=1T  ≥  ( )maxEπ  Rt St,A ∈ t  ,π ΠP t=1 which is the optimal value of the original MDP. The first inequality is due to ΠE ⊆ Π. The first equality is due to the fact that any policy π in ΠE satisfies Eπ[At|] = mt. The last inequality is due to ΠP ⊆ ΠE.  75 A.3 Decomposition Proof. Proof of Lemma 2: π ∑T ( ) ∑T maxE R S ,A − Eπ ( t t t  λt |A | − ) m  ∈ t t π Π ∑t=1 ( ) t=1T  ∑T = maxEπ  R S ,A − λ |A | + m ∈  t t t t t   tλtπ Π t=1 t=1 = maxEπ ∑T ∑K − r (S , A ) λ A  ∑T+ m ∈   t t,x t,x t t,x tλt∑π Π t=1∑x=1 t=1K T = maxEπ  rt(S t, At) − λtAt ∑T+ mtλt π∈Π x=1 t=1 t=1  The first equality is due to linearity of expectation. The second equality is obtained by the definition of rt(·, ·) and | · |. The third equality is obtained by the independence of the process under policies in Π. A.4 Show arg infλ∈RT P(λ) is non-empty ∑ ∑ Proof. When λ ≥ 0, P(λ) = x Rx(λ) + m t λt ≥ 0 + 0 = 0. Rx(λ) is bounded below by 0 since a policy of not playing at all gives a total reward of 0. When λ < 0, the cost of play∑ing is negative, ∑an optimal policy will always play at all time steps. Hence P(λ) ≥ m ( t(0 − λt)) + m t λt = 0. For the case in which λ contains both positive and negative entries, writing λ as a convex combination of λ1 > 0 and λ2 < 0 and we have that P(λ) is still bounded below by zero, since P(λ) is convex in λ. Hence we 76 can conclude that infλ∈RT P(λ) exists (note here we make no claim about whether this infimum is attained by any finite λ) and denote this value by h∗. Recall we have assumed in the setup that all the rewards are bounded and non- negative, let r̄ be an upper bound for all the reward values. For any λ with λt ≥ T r̄, the corresponding optimal policies for a single-arm problem will be not play at time t, for T r̄ is at least the maximum reward∑obtainable by the single-a∑rm problem.∑Hence P(λ) ≥ 0+ mTr̄. For any λ ≥ 0, P = mE[ t rt,x(S t,x, 1) − λt|s1,x] + m t λt = mE[ t rt,x(S t,x, 1)|s1,x], which is independent of λ. Hence the infimum is attained on the set H = {λ : λt ≥ ∀t and maxt λt ≤ tr̄}. Since H is compact, there exists a λ∗ ∈ H s.t. P(λ∗) = h∗. Hence arg infλ∈RT P(λ) is non-empty.  A.5 Proof of the existence of π∗∗ Let L(π,λ) be the Lagrangian of the sub-problem ∑T maximize Eπ[ rt(S t, At)] π t=1 m subject to Eπ[At] t = , ∀1 ≤ t ≤ T. K Then ∑T ∑T m ∑T m max L(π,λ) = maxEπ[ r (S , A )] − λ (Eπt t t t [At] − t ) = Q( ) tλ + λt . π π K K t=1 t=1 t=1 77 Let π∗∗ be a policy that attains Q(λ∗). Hence ∑T Q(λ∗ m ) t+ λ∗ =L(π∗∗, λ∗t )K t=1 = inf max L(π,λ) λ π = max inf L(π,λ) π λ ∑T ∑T = max inf Eπ m [ r π tt(S t, At)] − λt(E [At] − ). π λ K t=1 t=1 We are allowed to switch inf and max in the third equality because L(π,λ) satisfies Slater’s condition, which is implied by the existence of a feasible po∑licy of Q(λ). The fourth equality is just a re-write of what L(π,λ) is. Since Q(λ∗) + T λ∗mtt=1 t K is finite, maxπ infλ L(π,λ) is finite. Any policy π that attains maxπ infλ L(π,λ) has to satisfy that Eπ[At] − mtK is zero for all t, otherwise infλ L(π,λ) will be negative infinity. Hence ∗∗ Eπ [At] = mtK . A.6 Proof of T∗maxs,a,t rt(s, a) upper bounds βt(s) It is sufficient to show that for any λ ≥ 0 with λ λt > T ∗maxs,a,t rt(s, a), V (s, t) is attained by choosing a = 0. When t = T , rT (s, 1) − λT < rT (s, 1) − T ∗ maxs,a,t rt(s, a) ≤ 0. On the other hand rT (s, 0) ≥ 0 as all rewards are non-negative by the set- ting of our or∑iginal MDP. Hence it is optimal to choose a = 0. When t < T , rt(s, 1) − λ a ′t + s′∈S P (s, s )Vλ(s′, t + 1) < maxs,a,t rt(s, a) − T ∗ maxs,a,t rt(s, a) + (T − t) ∗ maxs,a,t rt(s, a) ≤ 0 ≤ rt(s, 0). Hence it is also optimal to choose a = 0. Therefore βt(s) ≤ T ∗maxs,a,t rt(s, a) for all s, t. 78 A.7 A result that justifies using bisection Lemma 13. If there exists an optimal policy π that takes action a = 1 in state s ∈ S at time t for a sub-MDP (S,A, r, P·), and satisfies Pπ[S t = s] > 0, then a = 1 is strictly optimal in state s at time t under a modified sub-MDP (S,A, r′, P·) with r′t (s, 1) > rt(s, 1), and r′t equals rt otherwise. Proof. Proof of Lemma 13 We prove by contradiction. Let V(π, r) denote the total expected reward obtained by policy π with reward function r. Assume that there exists an optimal policy π′ for sub-MDP (S,A, r′(·, ·), P·(·, ·)) such that π′(s, 0, t) = 1. Since neither r (s, 1) nor r′t t (s, 1) contributes to the total expected reward, V(π ′, r) = V(π′, r′). Let π be an optimal policy for (S,A, r(·, ·), P·(·, ·)). Then we have V(π, r) ≥ V(π′, r). On the other hand, V(π, r′) is greater than V(π, r) by (r′t (s, 1) − r πt(s, 1))P [S t = s] > 0. Hence we get that V(π, r′) > V(π, r) ≥ V(π′, r) = V(π′, r′) contradicting that π′ is an optimal policy of sub-MDP (S,A, r′(·, ·), P·(·, ·)).   A.8 Proof of Lemma 3 Proof. Proof of Lemma 3 To prove (1), when β ∗t(s) > λt , by definition of the index in (2.6), there exists an  > 0 such that there is a π ∈ Π∗(λ∗[λ∗t + , t]) and π(s, 1, t) = ∗ λ∗[λ∗1. Recall how we construct set Π (λ) in Section 2.3.1, the value function V t +,t] corresponding to sub-MDP Q(λ∗[λ∗t + , t]) has to satisfy∑ r (s, 1) − λ∗ − ∗ + Vλ [λ∗t +,t](s′, t + 1)P1t t (s, s′)∑ s′∈S ≥ λ∗r (s, 0) + V [λ∗t +,t] ′t (s , t + 1)P0(s, s′). s′∈S 79 Since λ∗[λ∗t + , t] and λ ∗ share the same elements from the (t + 1)th position onwards, Vλ ∗[λ∗ ∗t +,t](s, t′) = Vλ (s, t′), for all s ∈ S and t′ ≥ t + 1. Hence ∑ ∑ ∗ r (s, 1) − λ∗ + Vλ (s′, t + 1)P1 ∗t t (s, s′) > rt(s, 0) + Vλ (s′, t + 1)P0(s, s′). (A.1) s′∈S s′∈S Next we consider two separate cases: 1) State s is visited with positive probability under ∗∗ ∗∗ π∗∗, that is, Pπ (S t = s) > 0; 2) State s is visited with zero probability, i.e., Pπ (S t = ∗∗ s) = 0. If 1) Pπ (S t = s) > 0, since π∗∗ is an optimal policy∑for the unconstrained sub- MDP Q(λ∗ ∗ ) in (2.3), and a = 1 attains max{r ∗t(s, a) − aλt + s′∈S Pa(s, s′)Vλ (s′, t + 1)} ∗∗ alone, hence π∗∗(s, 1, t) = 1. If 2) Pπ (S t = s) = 0, we get π∗∗(s, 1, t) = 1 directly from the construction of π∗∗ in (2.8). Statement (2) can be proven using a similar argument. We therefore skip the proof to avoid redundancy.   A.9 Proof of Lemma 4 Proof. Proof of Lemma 4 To prove (1), suppose, for the sake of contradiction, that π∗∗(s, 1, t) < 1. By Lemma 3, we have βt(s) ≤ λ∗t . Therefore Ut(s) ∪ Vt(s) forms a superset to the set of states with indices of at least λ∗t . We also know that π ∗∗ takes active action with probability α at time t. Hence we can write α as the sum of the probabilities of taking the active action in all states s′ with π∗∗(s′, 1, t) = 1 and the probabilities of taking the active action in all states s′ with 0 < π∗∗(s′, 1, t) < 1: ∑ ∑ α = P (s′) ∗ 1 + P (s′) ∗ π∗∗(s′t t , 1, t) s′∈{s′′:π∑∗∗(s′′,1,t)=1} s′∈{s∑′′:0<π∗∗(s′′,1,t)<1} < P (s′t ) + P ′t(s ). (A.2) s′∈{s′′:π∗∗(s′′,1,t)=1} s′∈{s′′:0<π∗∗(s′′,1,t)<1} 80 Taking the contrapositives of both statements in Lemma 3, we get if 0 < π∗∗(s′, 1, t) < 1 then βt(s′) = λ∗t . Hence ∑ ∑ (A.2) = P (s′t ) ≤ Pt(s′) ≤ α s′∈{s′′:β (s′′)≥1} s′t ∈Ut(s)∪Vt(s) We get α < α, which is a contradiction, as desired. To prove (2), we again use contradiction. Assume π∗∗(s, 1, t) > 0; by the contraposi- tive of the second statement of Lemma 3 we know βt(s) ≥ λ∗t . Then Ut+1(s) is a subset of {s′ : β (s′t ) > λ∗t }, wh∑ich in turn is a subset of {s′ :∑π∗∗(s′, 1, t) = 1} by Lemma 3. Hence by the fact that α = ′ ′ ∗∗ ′s′∈{s′′:π∗∗(s′′,1,t)=1} Pt(s ) ∗ 1 + s′∈{s′′:0<π∗∗(s′′,1,t)<1} Pt(s ) ∗ π (s , 1, t), we must have either 1) ∑ ∑ α > P ′t(s ) ≥ Pt(s′) s′∈{s′′:π∗∗(s′′,1,t)=1} s′∈Ut(s) when there exists some s′ ∈ {s′′ : 0 < π∗∗(s′′, 1, t) < 1} such that Pt(s′) > 0, or 2)∑ ∑ α ≥ Pt(s′) > P (s′t ) s′∈{s′′:π∗∗(s′′,1,t)=1} s′∈Ut(s) ∑otherwise, as we must have that π∗∗(s, 1, t) = 1. In either case we get α >′ s′∈U (s)=1 Pt(s ), which again forms a contradiction.  t A.10 Lemma 14 Lemma 14. Let X(k) be a sequence of non-negative random variables such that (k) lim 1 X(k)k→∞ k = γ, a.s.. If Y (k)|X(k) ∼ Bin(X(k), p), then lim Yk→∞ k = γp, a.s.. 81 Proof. Proof of Lemma 14 We consider two cases: 1) X(k) → ∞; 2) X(K) is bounded. When 1) X(k) → ∞, we have Y (k) lim k→∞ k Y (k) X(k) = lim k→∞ X(k) k Y (k) X(k) = lim (k) limk→∞ X k→∞ k =p ∗ γ If 2) X(K) is bounded, then γ = 0. Y (k) is also bounded, so limk→∞ = 0.   A.11 Proof of Lemma 5 Proof. Proof of Lemma 5 Before attempting the proof, we make some observations on the properties of function Rounding-c. These observations will also be useful for proofs that come later in Appendix A.12. Based on how Rounding-c works in Algorithm 3, ∑Figure A.11 plots the number of sub-processes allocated to the ith state b∑i against total− j,i min{total ∗ frac j, avail j}. bi is a piecewise linear function of total − j,i min{total ∗ frac j, avail j} with the changes of gradients occur when the value of some avail j drops to total∗ frac j. The function line intersect with y-axis at min{total∗ fraci, availi} when there is no left-over after assigning each state min{total ∗ frac j, avail j}. The line ends when it reaches min{total, availi} as that is the largest possible assignment to state i. To give an explicit form, we first sort for j , i and j ∈ {1, ..., n} by c j = avail j − total ∗ frac j. Let c′ , ..., c′1 ` be the sorted values of c js from the smallest to the largest with no repetition, and let n j denote the number of states with c′j. Let x denote the value 82 Figure A.1: Plot of Rounding-c ∑ represented on the x-axis total − j,i min{total ∗ frac j, avail j}, we can write bi asmin{total ∗ fraci, availi} + x n n 1 for 0 ≤ x ≤ n c′ 1+...+ + 1 ` 1  ′ ′ min{total ∗ fraci, avail } c1 x−n1c1 i + n1 n + n for n c ′ ≤ x ≤ n c′  +...+ ` 2+...+n +1 1 1 2` 2bi =  (A.3) . .. min{total ∗ frac avail } c′1 x−n ′`ci, i + n1 + ... + ` ′+...+n` n`+1 for n`c` ≤ x ≤ total Let bi be the ith component of the output by Rounding(total, frac, avail) and bci be the ith component of the output by Rounding-c(total, frac, avail). We first show that |bi − bci| ≤ 2 by comparing the algorithms of Rounding and Rounding-c. If min{total ∗ fraci, availi} is attained by availi, then bi = bci.∑If availi is larger than btotal ∗ fracic (hence also larger than total ∗ fraci), and total = j min{btotal ∗ frac jc, avail j} (meaning no residue left to distribute in Rounding), then the difference between bi and bci is no 83 ∑ more than 1. When total > j min{btotal ∗ frac j, avail jc} and there is residue left to be distributed, the amount distributed to the ith state in Rounding may be smaller or larger than the that in Rounding-c, depending on the position of i in an array of 1 to n. This together with the possible difference between btotal ∗ frac jc and total ∗ frac j gives a total difference of at most 2. Subsequently we have |bs(Nt, bαtKc) − b̄s(Nt, bαtKc)| ≤ 2, and | fs(Nt, bαtKc) − f̄ Nt bαtKc)s(Nt, bαtKc)| = | fs(Nt, bαtKc) − K ∗ f̄s( K , K |.  A.12 Proof of Lemma 6 Proof. Proof of Lemma 6 First note that b̄s(Nt, bαtKc) is continuous as illustrated in the proof of Lemma 5 in Appendix A.11. Hence f̄s(Nt, bαtKc) is continuous. This is because when bαtKc − N+t (s) , N=t (s), f̄s(Nt, bαtKc) is continuous as Nt(s) and b̄s(Nt, bαtKc) are continuous. When bα + =tKc − Nt (s) = Nt (s), we have Nt(s) = b̄s(Nt, bαtKc). By the continuous mapping theorem, and the fact that lim Nt(s)K→∞ K = Pt(s) almost surely and lim bαtKcK→∞ K = αt, we have Nt bαlim f̄ ( tKc N bα Kcs , ) = f̄s( lim t t, lim ) = f̄s(P→∞ K K →∞ K →∞ K t, αt), a.s.K K K  84 A.13 Proof of Lemma 7 Proof. Proof of Lemma 7 We first prove that f̄s(Nt, bαtKc) equals to  ∑min{Pt(s), [α − ′ P (s ′)]+ ∑ ρ(s,1,t) },  s ∈Ut(s) t s′∈ ′Vt (s) ρ(s ,1,t) gs(Pt) =  ∑ if s′∈V ρ(s′, 1, t) > 0t(s)  ∑ (A.4) min{P (s), [α − P (s′)]+ ∑ Pt(s) ∑t s′∈U (s) t ′ P (s′) }, t s ∈Vt (s) t if s′∈V (s) ρ(s′, 1, t) = 0,t then we show gs(Pt) = Pt(s)π∗∗(s, 1, t). To prove the former, We divide our discussion into two main cases. • When ∑ ∑ [α − P ′ + ′t(s )] − Pt(s ) ≥ 0, (A.5)∑ s′∈Ut(s) s′∈Vt(s) if α − s′∈U (s) Pt(s′) ≤ 0, then gs(Pt) = 0 in both cases. Moreover, (A.5) impliest Pt(s) =∑0 by Lemma 4, hence fs(Pt, α) = 0. If α − ′s′∈U (s) Pt(s ) > 0, (A.5) implies that Pt(s) = 1 by Lemma 4.t ∑ ∑ ∑ α − Pt(s′) ≥ P (s′t ) = ρ(s′, 1, t) ≥ Pt(s). s′∈Ut(s) s′∈Vt(s) s′∈Vt(s) Hence Pt(s) attains the minimum in both cases of gs(Pt), and fs(Pt, α) = Pt(s) = gs(Pt). • When ∑ ∑ [α − P (s′)]+t − P (s′t ) < 0, (A.6)∑ s′∈Ut(s) s′∈Vt(s) if α − ′∑s′∈U (s) Pt(s ) ≤ 0, then both fs(Pt, α) and gs(Pt) are zero.t If α − s′∑∈U (s) Pt(s′) > 0, fs(Pt, α) is eithert min{(α − s′∈U (s) Pt(s′))∑ ρ(s,1,t)t s′∈Vt (s)ρ(s′ , Pt(s)} or,1,t) 85 ∑ min{(α − P (s′))∑ Pt(s)s′∈U (s) t ′ ′ , Pt(s)} ∑depending on whether s′∈V (s) ρ(s′, 1, t)t s ∈Vt (s)Pt (s ) t is greater than or equal to zero, which matches exactly the two cases in gs(Pt). ∑We first consider the case when ρ(s′s′∈V (s) , 1, t) > 0. This can be further divided into two sub-cases,t • Case 1:∑When gs(Pt) = Pt(s), if α − P (s′) ≤ 0, by Lemma 4, π∗∗s′∈U (s) t (s, 1, t) = 0. Since Pt(s) attains thet minimu∑m in this case, Pt(s) = 0. Hence gs(Pt) = 0 = Pt(s)π ∗∗(s, 1, t). If α − s′∈U (s) Pt(s′) > 0, this leads to two cases by Lemma 4: π∗∗(s, 1, t) = 1t or 0 < π∗∗(s, 1, t) < 1. If∑it is the latter, we have the second term in gs(Pt) be- comes ρ(s, 1, t), since α− ′s′∈U (s) Pt(s ) cancels with the denominator. ρ(s, 1, t) =t Pt(s)π∗∗(s, 1, t) < Pt(s), which contradicts that Pt(s) attains the minimum. Hence π∗∗(s, 1, t) = 1. We have gs(P∑t) = Pt(s) = Pt(s) ∗ π ∗∗(s, 1, t). • Case 2:∑When gs(P ) ρ(s,1,t)t = [α − ′ +s′∈Ut(s) Pt(s )] ∑ ′ ,s ∈Vt (s)ρ(s′ ,1,t) if α − ′s′∈U (s) Pt(s ) ≤ 0, again by Lemma 4, π∗∗(s, 1, t) = 0, gt s(Pt) = 0 = Pt(s)π∗∗∑(s, 1, t). If α − ′ ∗∗s′∈U (s) Pt(s ) > 0, again we have two ∑cases: π (s, 1, t)∑= 1 or 0 0: Vx(αx, βx,wx, l) αx = Vx(αx + 1, βx,wx − 1, l)+ αx + βx βx Vx(αx, βx + 1,wx − 1, l); (A.7b) αx + βx If l < U: Vx(αx, βx,wx, l) r = ma[x {Vx(αx, βx,wx + al,x, l + 1) − λq l+1al,x}x al,x∈{0,1}µwx αx + Vx(αq x + 1, βx,wx − 1, l) x αx + βx β ]x + Vx(αx, βx + 1,wx − 1, l) . αx + βx 87 BIBLIOGRAPHY [1] D Adelman and A Mersereau. Relaxations of Weakly Coupled Stochastic Dynamic Programs. Operations Research, 2008. [2] S Andradóttir and S H Kim. Fully Sequential Procedures for Comparing Con- strained Systems via Simulation. Naval Research Logistics, 57(5):403–421, 2010. [3] P Auer, N Cesa-Bianchi, and P Fischer. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002. [4] Y Bacharach, T Grapple, T Minka, and J Guiver. How To Grade A Test Without Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing And Aptitude Testing. ICML, 2013. [5] D Batur and S H Kim. Finding Feasible Systems in the Presence of Constraints on Multiple Performance Measures. ACM Transactions on Modeling and Computer Simulation, 20(3):13:1–13:26, 2010. [6] D A Berry and B Fristedt. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). Springer, 1985. [7] E Bofinger and G J Lewis. Two-Stage Procedures for Multiple Comparisons with a Control. American Journal of Mathematical and Management Sciences, 12(4):253–275, 1992. [8] O Chapelle and L Li. An Empirical Evaluation of Thompson Sampling. NIPS, 2002. [9] C Chen, D He, M Fu, and L Lee. Efficient Simulation Budget Allocation for Selecting an Optimal Subset. Informs Journal on Computing, 2008. [10] X Chen, Q Lin, and D Zhou. Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling. arXiv, 2011. [11] Xi Chen, Qihang Lin, and D Y Zhou. Optimistic Knowledge Gradient Policy for Optimal Budget Allocation in Crowdsourcing. ICML, 2013. [12] Xi Chen, Qihang Lin, and Dengyong Zhou. Optimistic Knowledge Gradient Pol- icy for Optimal Budget Allocation in Crowdsourcing. In Proceedings of the 30th International Conference on Machine Learning, pages 64–72, 2013. 88 [13] H Damerdji and M K Nakayama. Two-Stage Procedures for Multiple Comparisons with a Control in Steady-State Simulations. In Proceedings of the 1996 Winter Simulation Conference, pages 372–375, Piscataway, New Jersey, 1996. Institute of Electrical and Electronics Engineers, Inc. [14] M H DeGroot. Optimal Statistical Decisions. McGraw Hill, New York, 1970. [15] E J Dudewicz and S R Dalal. Multiple Comparisons with a Control When Vari- ances Are Unknown and Unequal. American Journal of Mathematics and Man- agement Sciences, 3(4):275–295, 1983. [16] C W Dunnett. A Multiple Comparison Procedure for Comparing Several Treatments with a Control. Journal of the American Statistical Association, 50(272):1096–1121, 1955. [17] E B Dynkin and A A Yushkevich. Controlled Markov Processes. Springer, New York, 1979. [18] E.B Dynkin and Yushkevish A.A. Controlled Markov Processes. Springer, 1979. [19] P I Frazier. Learning with dynamic programming. In Wiley Encyclopedia of Op- erations Research and Management Science, pages 1–13, Hoboken, New Jersey, 2011. Wiley. [20] A Ghosh, S Kale, and P McAfee. Who moderates the moderators? Crowdsourcing Abuse detection in user-generated content. ACM, 2011. [21] C Gittins, J. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B, 1979. [22] J C Gittins. Multi-Armed Bandit Allocation Indices. John Wiley and Sons, New York, 1989. [23] John Gittins, Kevin Glazebrook, and Richard Weber. Multi-armed Bandit Alloca- tion Indices. Wiley, Hoboken, New Jersey, 2nd edition, 2011. [24] Y Gocgun and A Ghatey. Lagrangian Relaxation and Constraint Generation For Allocation And Advanced Scheduling. Computer and Operations Research, 2011. [25] J Hawkins. A Lagrangian decomposition approach to weakly coupled dynamic optimization problems and its applications. Ph.D Thesis, Center of Operations Research, MIT, 2002. 89 [26] Christopher Healey, Sigrún Andradóttir, and Seong-Hee Kim. Selection proce- dures for simulations with multiple constraints under independent and correlated sampling. ACM Transactions on Modeling and Computer Simulation, 24(3):14:1– 14:25, 2014. [27] CJ Ho, S Jabbari, and J W Vaughan. Adaptive Task Assignment For Crowdsourced classification. ICML, 2013. [28] D Karger, S Oh, and D Shah. Budget-Optimal Task Allocation for Reliable Crowd- sourcing Systems. Operations Research, 2013. [29] D R Karger, S Oh, and D Shah. Iterative Learning for Reliable Crowdsourcing System. NIPS, 2011. [30] D R Karger, S Oh, and D Shah. Efficient crowdsourcing for multi-class labeling. ACM, 2013. [31] S H Kim. Comparison with a Standard via Fully Sequential Procedures. ACM Transactions on Modeling and Computer Simulation, 15(2):155–174, 2005. [32] S H Kim and B L Nelson. Recent Advances in Ranking and Selection. In Pro- ceedings of the 2007 Winter Simulation Conference, pages 162–172, Piscataway, New Jersey, 2007. Institute of Electrical and Electronics Engineers, Inc. [33] L. W. Koenig and A. M. Law. A procedure for selecting a sub- set of size m con- taining the l best of k independent normal populations. Comm. Statist.Simulation Comm. 14 719734, 1985. [34] W S Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28(1):47–65, 1991. [35] G E Monahan. A survey of partially observable Markov decision processes: The- ory, models, and algorithms. Management Science, 28(1):1–16, 1982. [36] B L Nelson and D Goldsman. Comparisons with a Standard in Simulation Exper- iments. Management Science, 47(3):449–463, 2001. [37] E Paulson. On the Comparison of Several Experimental Categories with a Control. The Annals of Mathematical Statistics, 23(2):239–246, 1952. [38] W B Powell. Approximate Dynamic Programming: Solving the curses of dimen- sionality. John Wiley and Sons, New York, 2007. 90 [39] Warren B Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley, 2011. [40] M.L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro- gramming. Wiley, 2008. [41] H Robbins. A note on gambling systems and birth statistics. American Mathemat- ical Monthly, 1952. [42] A Ruszczynski. Nonlinear Optimization. Princeton University Press, 206. [43] R Snow, B OConnor, D Jurafsky, and A Ng. Cheap and Fast But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP, 2008. [44] R Szechtman and E Yücesan. A New Perspective on Feasibility Determination. In Proceedings of the 2008 Winter Simulation Conference, pages 273–280, Piscat- away, New Jersey, 2008. Institute of Electrical and Electronics Engineers, Inc. [45] T L Thanh, M Venanzi, A Rogers, and N R Jennings. Efficient budget allocation with accuracy guarantees for crowdsourcing classification tasks. ACM, 2013. [46] R Weber and G Weiss. On An Index Policy For Restless Bandits. Journal of Applied Probability, 1990. [47] R Weber, R and G Weiss. Addendum to ’On An Index Policy For Restless Bandit’. Advances in Applied Probability, 1991. [48] P. Whittle. Restless Bandits: Activity Allocation in a Changing World. Journal of Applied Probability, 1988. [49] J Xie and P I Frazier. Sequential Bayes-Optimal Policies for Multiple Comparisons with a Known Standard. Operations Research, 61(3):1174–1189, 2013. [50] J. Xie and P.I. Frazier. Upper Bounds on the Bayes-Optimal Procedure for Ranking & Selection with Independent Normal Priors. Proceedings of the 2013 Winter Simulation Conference, 2013. [51] Y Yan, R Rosales, G Fung, and J Dy. Active Learning From crowds. ICML, 2011. [52] F Ye, H Zhu, and E Zhou. Weakly Coupled Dynamic Program: Information and Lagrangian Relaxations. arXiv preprint arXiv:1405.3363, 2014. 91