SEQUENTIAL RESOURCE ALLOCATION UNDER
UNCERTAINTY: AN INDEX POLICY APPROACH
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Weici Hu
August 2017
©c 2017 Weici Hu
ALL RIGHTS RESERVED
SEQUENTIAL RESOURCE ALLOCATION UNDER UNCERTAINTY: AN INDEX
POLICY APPROACH
Weici Hu, Ph.D.
Cornell University 2017
We consider a class of stochastic sequential allocation problems - restless multi-armed
bandits (RMAB) with a finite horizon and multiple pulls per period. Leveraging the La-
grangian relaxation of the problem, we propose an index-based policy that uses the opti-
mal Lagrange multipliers to index individual arms, and prove that the policy is asymptot-
ically optimal as the number of arms tends to infinity. We also demonstrate numerically
that this index-based policy outperforms state-of-the-art heuristics in several instances
of RMAB. In addition, we study two other applications of sequential resource allocation
problems which are extensions of the RMAB problem, and demonstrate how our index
policy can be adapted to these settings.
BIOGRAPHICAL SKETCH
Born as a mainlander, Weici Hu spent most of her childhood in a southern city of China.
She then went further south to Singapore to complete her secondary school and high
school education. Tired of the tropical climate, she spent the past 8 years acquiring a
college degree and trying to acquire a PhD degree in the Northeast of the United States.
iii
This document is dedicated to my mother.
iv
ACKNOWLEDGEMENTS
I would like to first thank my advisor for his guidance and extreme patience and for not
kicking me out of the program. I would also like to thank my parents for their non-
interference policy which has allowed me to experience life in its truest states. Lastly
I would like to thank two of my best pals in graduate school: Wei Qian and Kenneth
Chong, for handing me water and energy bars during my Marathon race.
v
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 1
2 An Asymptotically optimal Index policy for RMAB 7
2.1 Problem Description and Notation . . . . . . . . . . . . . . . . . . . . 7
2.2 Lagrangian Relaxation and Upper Bounds . . . . . . . . . . . . . . . . 9
2.3 An Index Based Heuristic Policy . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Pre-computations . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Index policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Proof of Asymptotic Optimality . . . . . . . . . . . . . . . . . . . . . 15
2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 Multi-armed bandit . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.2 Project assignment problem . . . . . . . . . . . . . . . . . . . 26
2.5.3 Subset selection problem . . . . . . . . . . . . . . . . . . . . . 27
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Parallel Bayesian Policies For Finite Multiple Comparisons With a Known
Standard 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Index Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Bayes-Optimal Effort Allocation in Crowdsourcing: Bounds and Index
Policies 50
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Dynamic Programming Formulation . . . . . . . . . . . . . . . . . . . 56
4.5 Upper Bound on the Bayes-Optimal Policy . . . . . . . . . . . . . . . 61
4.6 Index Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7.1 Simulation using simulated data . . . . . . . . . . . . . . . . . 68
4.7.2 Simulation using real data . . . . . . . . . . . . . . . . . . . . 71
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
vi
A 73
A.1 Notation of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.2 Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.3 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.4 Show arg infλ∈RT P(λ) is non-empty . . . . . . . . . . . . . . . . . . . . 76
A.5 Proof of the existence of π∗∗ . . . . . . . . . . . . . . . . . . . . . . . 77
A.6 Proof of T∗maxs,a,t rt(s, a) upper bounds βt(s) . . . . . . . . . . . . . . 78
A.7 A result that justifies using bisection . . . . . . . . . . . . . . . . . . . 79
A.8 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.9 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.10 Lemma 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.11 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.12 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.13 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.14 Bellman’s recursion for T = ∞ . . . . . . . . . . . . . . . . . . . . . . 87
Bibliography 88
vii
LIST OF TABLES
A.1 List of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
viii
LIST OF FIGURES
2.1 Upper bound and simulation results of MAB . . . . . . . . . . . . . . 25
2.2 Upper bound and simulation result of project assignment problem . . . 27
2.3 Upper bound and simulation result of subset selection . . . . . . . . . 29
3.1 This figure illustrates how zn is chosen by the index-based policy with
k = 2 systems, m = 2 parallel computing resources, and N = 20 sim-
ulation batches. Figure (a) plots zλ2,2, the optimal number of samples
to take in batch 3 from system 1 when it is in state (2,2), against the
value of λ; Figure (b) plots zλ3,3, the optimal number of samples to take
in batch 3 from system 2 when it is in state (3,3). Figure (c) plots
zλ λ2,2 + z3,3, the optimal total number of samples to take across both sys-
tems. The dashed line in (c) shows the constraint m = 2, and λ∗ will
be the left endpoint of the solid line overlapping this dashed line. The
∗ ∗
number of samples taken from each system will be zλ λ2,2 = 1 and z3,3 = 1
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 This figure shows the upper bound on the performance of the optimal
policy for the MCS problem (dashed line with squares) normalized by
dividing by k, as well as the estimated performance of two sub-optimal
policies: the index policy from Section 4.6 (thinner lines and dots); and
the equal allocation policy (thicker lines and dots). The setting pictured
uses m = k, dx = 0.2, α0x = β0x = 1, cx = 0. We use 10,000 independent
replications to estimate the value of the index policy, and 50,000 for the
equal allocation policy. The plot shows that the index policy is substan-
tially better than equal allocation, and is statistically indistinguishable
from optimal given the number of replications performed. . . . . . . . 48
4.1 Semi-log plot of K against average per task reward (R/K) for K = 10, 102, 103. 69
4.2 Histogram of number of workers assigned to a task . . . . . . . . . . . 70
4.3 Semi-log plot of K against accuracy score for K = 10, 100, 750. . . . . . . . 72
A.1 Plot of Rounding-c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
ix
CHAPTER 1
INTRODUCTION
We consider a general class of dynamic resource allocation problems with average-case
criteria. Such problems enjoy a variety of applications in a wide range of industries.
Examples include:
• Facebook displays ads in the suggested posts section every time its users browse
their personal pages. Among the ads that have been shown, some are known to
attract more clicks than others. But there are also many ads which have yet to be
shown and they may attract even more clicks. Given that the slots for display are
limited, a policy is required to select ads to maximize total clicks.
• In a multi-stage clinical trial, a medical group starts with a number of new treat-
ments and an existing treatment with reliable performance. In each stage, a few
treatments are selected from the pool to test, with the goal to identifying the new
treatments that perform better than the existing one with high confidence. A strat-
egy is required to select which treatments to test at every stage to most effectively
support their judgment at the end of the trial.
• A data analyst wishes to label a large number of images using crowdsourced effort
from low-cost but potentially inaccurate workers. Each label given by the crowd-
workers comes with a cost and the analyst has limited budget. Hence she needs
to carefully assign tasks to workers so as to maximize the likelihood of correct
labeling.
We formulate such problems as instances of the restless multiarmed bandit (RMAB)
problem [48] with a finite horizon and multiple pulls per period, which, in turn, is
closely related to a broader class of problem called Weakly Coupled Dynamic Programs
1
(WCDP) [25]. In the RMAB, we have a collection of “arms”, each of which is endowed
with a state that evolves independently. If the arm is “pulled” or “engaged’ in a time
period then it advances stochastically according to one transition kernel, and if not then
it advances according to a different kernel. Rewards are generated with each transition,
and our goal is to maximize the expected total reward over a finite horizon, subject to
a constraint on the number of arms pulled in each time period. The RMAB forms a
generalization of the more famous multi-armed bandit (MAB) [41] by allowing arms
that are not engaged to change state and multiple pulls per period.
Theoretically an optimal solution of a RMAB can be obtained by leveraging the Bell-
man equation and solving it as a dynamic program (DP) [40]. However, this approach
becomes computationally infeasible when the state space grows large. In particular, the
state space grows exponentially with the number of arms in a RMAB, and the number
of arms are often large in practice. This approach therefore suffers the so-called “curse
of dimensionality” [39]. Much research has been dedicated to efficiently finding “good”
solutions. Gittin in 1979 proposed a tractable-to-compute optimal policy for the infinite
horizon MAB with one pull per time period, which is famously known as the Gittins
index policy [21]. This policy is appealing because it can be computed by considering
the state space for only a single arm, making it computationally tractable for problems
with many arms. This policy loses its optimality properties, however, when modifying
the problem in any problem dimension: when allowing arms that are not engaged to
change state; when moving to a finite horizon [6]; or when allowing multiple pulls per
period. Thus, the Gittins index does not apply to our problem setting.
While the RMAB is not known to have a computable optimal policy, [48] proposed a
heuristic called the Whittle index for the infinite-horizon RMAB with multiple pulls per
period, which is well-defined when arms satisfy an indexability condition. This policy
2
is derived by considering a Lagrangian relaxtion of the RMAB in which the constraint
on the number of arms pulled is replaced by a penalty paid for pulling an arm. An arm’s
Whittle index is then the penalty that makes a rational player indifferent between pulling
and not pulling that arm. The Whittle index policy then pulls those arms with the highest
Whittle indices. Appealingly, the Whittle index and the Gittins index are identical when
applied to the MAB problem with a single pull per period. [48] further conjectured that
if the number of arms and the number of pulls in each time period go to infinity at the
same rate in an infinite-horizon RMAB, then the Whittle index policy is asymptotically
optimal when arms are indexable. [46, 47] gave a proof to Whittle’s conjecture with a
difficult-to-verify condition: that the fluid approximation has a globally asymptotically
stable equilibrium point. This condition was shown to hold when each arm’s state space
has at most 3 states, but this condition does not hold in general and [46] provides a
counterexample with 4 states.
Our contribution in this dissertation is to (1) create an index policy for finite horizon
RMABs with multiple pulls per period, and (2) show that it is asymptotically optimal
in the same limit considered by Whittle. Like the Whittle index, our approach is com-
putationally appealing because it requires considering the state space for only a single
arm, and its computational complexity does not grow with the number of arms. Un-
like the Whitle index, our index policy does not require an indexability condition to be
well-defined, and in contrast with [46, 47] our proof of asymptotic optimality holds re-
gardless of the number of states, and does not depend on hard-to-verify conditions. We
further demonstrate our index policy numerically on problems from the literature that
can be formulated as finite-horizon RMABs, and show that it provides finite-sample
performance that improves over the state-of-the-art. We also use our framework to de-
velop policies for two major RMAB-like applied problems: Multiple comparison with
a standard (MCS) in the field of simulation and crowdsourcing in the field of artificial
3
intelligence, and demonstrate numerically that our index-based policies out-perform the
state-of-art using both real and synthetic data.
In addition to building on [48, 46, 47], our work builds on the literature in weakly
coupled dynamic programs (WCDP), that itself builds on RMABs. Indeed, at the end of
his paper, Whittle pointed out that his relaxation technique can be applied to a more gen-
eral class of problems in which sub-problems are linked by constraints on actions, but
are otherwise independent. Hawkins in his thesis [25] formally termed these problems
(but with a more general type of constraints) as WCDPs and proposed a general decou-
pling technique. Moreover, he proposed a minimal-lambda policy for infinite horizon
WCDPs which, like the Whittle index policy, is derived by considering a Lagrangian
relaxation of the WCDP. The minimal-lambda policy finds the smallest Lagrange mul-
tiplier so that the current optimal decision for the Lagrangian relaxation is also feasible
for the original WCDP. The minimal-lambda policy then pulls arms according to this
optimal decision. In the case of RMAB, the minimal-lambda policy is equivalent to the
Whittle index with the smallest of the indices of all pulled arms for the former being the
same as the smallest Lagrange multiplier to attain a feasible solution for the latter.
Another major work in WCDP is [1] which shows that the ADP relaxation is tighter
than the Lagrangian relaxation but is also computationally more expensive. It gives
necessary and sufficient conditions for the Lagrangian relaxation to be tight and proves
that the optimality gap is bounded by a constant when the Lagrange multipliers are
allowed to be state dependent. The last result that the optimality gap is bounded by a
constant implies that the per arm gap goes to zero as the number of arms grows. We
achieve a similar result in the dissertation by showing the per arm reward of our index-
based heuristic policy goes to the per-arm reward of the Lagrangian bound, despite that
our Lagrange multipliers not being state-dependent. While there are similarities between
4
the two works, the focus differs: while our work focuses on offering an asymptotically
optimal heuristic policy, [1] examines the ordering and tightness of different bounds.
The heuristic proposed in [1] is based on an ADP technique, and is different from our
index-based policy.
Other work on WCDP also includes [52] who proposes an even tighter bound by
incorporating an information relaxation on the non-anticipative constraints in addition
to the existing relaxation methods. [24] considers two classes of large-scale WCDPs
in which the state and action space in each sub-problem also grows exponentially and
uses an ADP technique to approximate the value functions of individual sub-MDPs in
addition to employing a Lagrangian relaxation for the overall problem.
In this thesis, we use Chapter 2 to present the main theoretical work on RMAB and
propose an index-based policy. Chapter 3 and 4 present applied problems that are similar
to RMAB but with some variations, and index-based policies for these problems. Below
is a brief outline of the chapters in the dissertation:
• In Chapter 2, we consider the RMAB with a finite horizon and multiple pulls per
period. Leveraging a Lagrangian relaxation, we approximate the RMAB with a
problem that can be decomposed into a collection of single arm problems. We
then propose an index-based policy that uses optimal solutions of the single arm
problems to index individual arms, and offer a proof that it is asymptotically op-
timal as the number of arms tends to infinity. We also use simulation to show that
this index-based policy performs better than the state-of-art heuristics in various
problem settings.
• In Chapter 3, we consider the problem of multiple comparisons with a known stan-
dard, in which we wish to allocate simulation effort efficiently across a finite num-
ber of simulated systems, to determine which systems have mean performance
5
exceeding a known threshold. We suppose that parallel computing resources are
available, and that we are given a fixed simulation budget. We consider this prob-
lem in a Bayesian setting, and formulate it as a stochastic dynamic program. The
set-up of the problem is the same as a RMAB except that every simulated sys-
tem (the arms) is allowed to have multiple computing resources (the pulls) in a
time period. For simplicity, we focus on Bernoulli sampling, with a linear loss
function. Using links to restless multi-armed bandits, we provide a computation-
ally tractable upper bound on the value of the Bayes-optimal policy, and an index
policy motivated by these upper bounds. This chapter has been published in the
proceedings of the 2014 Winter Simulation Conference.
• In Chapter 4, we consider effort allocation in crowdsourcing, where we wish to as-
sign labeling tasks to imperfect homogeneous crowd workers to maximize overall
accuracy in a continuous-time Bayesian setting, subject to budget and time con-
straints. The Bayes-optimal policy for this problem is the solution to a partially
observable Markov decision process, but the curse of dimensionality renders the
computation infeasible. Based on the Lagrangian Relaxation technique in [1], we
provide a computationally tractable instance-specific upper bound on the value of
this Bayes-optimal policy, which can in turn be used to bound the optimality gap
of any other sub-optimal policy. In an approach similar in spirit to the Whittle
index for restless multi-armed bandits, we provide an index policy for effort allo-
cation in crowdsourcing and demonstrate numerically that it outperforms the state
of the art and is near-optimal. This chapter has been published in the proceedings
of the 2016 Artificial Intelligence and Statistics Conference.
6
CHAPTER 2
AN ASYMPTOTICALLY OPTIMAL INDEX POLICY FOR RMAB
Chapter 2 presents the major theoretical work of this dissertation. In this chapter,
Section 2.1 formulates the problem, Section 2.2 discusses the Lagrangian relaxation
of the problem, Section 2.3 states our index-based policy and provides computational
methods, Section 2.4 gives a proof of asymptotic optimality, Section 2.5 numerically
evaluates our index policy, and Section 7 concludes.
2.1 Problem Description and Notation
We consider an MDP (SK ,AK ,P,R) which is created by a collection of K sub-processes
(S,A, P, r). The sub-processes are independent of each other except that the actions
taken by each sub-process have to jointly satisfy some constraints at each time step.
These sub-processes are also referred to as arms in the bandit literature and shall be
indexed by x ∈ {1, ...,K}. Following a standard construction for MDPs, both the larger
joint MDP and the sub-processes will be constructed on the same measurable space
(Ω,F). Random variables on this measurable space will correspond to states, actions,
rewards, and each policy will induce a probability measure over this space.
We describe the MDP to consider formally as comprising:
• The time horizon T < ∞.
• The state space SK is the cross product of K sub-processes’ state space S, which
is assumed to be finite. We use s = (s1, ..., sK) to denote an element in SK and S
when the state is random. We also use St to emphasize that the state is at time t.
Likewise, we use s to denote an element in S, and S or S t when it is random.
7
• The action space AK is the cross product of K sub-processes’ action space A =
{0, 1}. We use a to denote a generic element of A, and A when it is random. We
use a = (a1, a2, ..., aK) to denote a generic element in AK and A when it is random.
In the context of bandit problems, a = 1 is called “pulling” an arm (sub-process).
• ∑The reward function Rt : SK × AK →7 R for each 1 ≤ t ≤ T . Rt(s, a) =K
x=1 rt(sx, ax), where rt(sx, ax) is the reward obtained by a sub-process when ac-
tion ax is taken in state sx at time t. We assume rewards are non-negative and
finite. ∏
• The transition kernel Pa(s′, s) = K ax ′x=1 P (sx, sx), where Pa(s′, s) is the probabil-
ity of a sub-process transitioning from s′ to s if action a is taken, i.e., P(s|s′, a).
The product implies that the K sub-processes evolve independently. RMAB differ
from MAB in that MABs require P0(s, s) = 1 while RMABs allows P0(s, s) < 1.
Since we are considering both cases, we do not restrict the value of P0(s, s).
Next we describe the set of policies for our MDP problem. Since the state and
action space defined above are finite, it is sufficient to consider the set of Markov
policies Π [40]. Define a policy π ∈ Π as a function SK × AK × {1, ...,T } → [0, 1]
that determines th∑e probability of choosing action a in state s at time t. Subse-
quently we have Ka∈AK π(s, a, t) = 1, ∀s ∈ S ,∀1 ≤ t ≤ T . A policy π and the
transition kernel P·(·, ·) together defines a probability distribution Pπ on all possible
paths of the process {s1a1...sT : s K Kt ∈ S , at ∈ A }. Starting at a fixed state s1, i.e.,
Pπ(S1 = s1) = 1, we have the conditional distributions of St and At defined recursively
by Pπ(S = s′t+1 |St = s,At = a) = Pa(s, s′) and Pπ(At = a|St = s) = π(s, a, t).
The MDP we are considering allows exactly m sub-processes to be set active at each
time step. Hence a feasible policy, π ∈ Π, has to satisfy Pπ(|At| = m) = 1, ∀t ∈ {1, ...,T }.
Here we use | · | as an operator that sums all the elements in a vector.
8
The objective of our MDP is:
maximize Eπ ∑T ( )Rt St,At ∈ 

π Π
t=1 (2.1)
subject to Pπ(|At| = mt) = 1, ∀1 ≤ t ≤ T.
Since we will discuss other MDPs in the process of solving this one, (2.1) will be re-
ferred to as the original MDP in the rest of the chapter to avoid confusion. For conve-
nience, we summarize our notation in Appendix A.1.
The original MDP (2.1) suffers from the “curse of dimensionality”, and hence solv-
ing it is computationally intractable. In the remainder of the chapter we build a compu-
tationally feasible index-based heuristics with a performance guarantee.
2.2 Lagrangian Relaxation and Upper Bounds
In this section we discuss the Lagrangian relaxation of the original MDP and the cor-
responding single process problem. These single process problems together with the
Lagrange multipliers form the building blocks of our index-based policy, which will
be formally introduced in Section 2.3. The Lagrangian relaxation considers an uncon-
strained problem whose objective is obtained by augmenting the objective of (2.1):∑Tπ ( )P(λ) = maxE R S ,A  t t t − Eπ ∑T λt(|At| − m ∈ t) , (2.2)π Π
t=1 t=1
for any λ = {λ1, ..., λT } ∈ RT . This unconstrained problem has the following property:
Lemma 1. For any λ ∈ RT , P(λ) is an upper bound to the optimal value of the original
MDP.
[1] gave a proof to Lemma 1 using the Bellman equation. We provide a more
9
straightforward proof by viewing P(λ) as the Lagrange dual function of a relaxed prob-
lem of the original MDP; see Appendix A.2.
This Lagrangian relaxation then decomposes into K smaller MDPs, which we can
easily solve to optimality. To elaborate on this idea of decomposition, we construct a
sub-MDP problem based on tuple (S,A, P·(·, ·), r(·, ·)). Again we consider only the set
of Markov policies, Π, for this problem. Similarly a policy π ∈ Π is a function that
determines the probability of choosing action a in state s at time t, i.e., π : S × A ×
{1, ...,T } → [0, 1]. The sub-MDP starts at a fixed state s1. Subsequently we can define
distributions of S πt and At under P in a similar manner as we did for St and At in the
previous section. The objective of the sub-MDP is:
π ∑T − Q(λ) = maxE rt(S t, At) λtAt . (2.3)
π∈Π
t=1
We are now ready to present the decomposition of the Lagrangian relaxation.
Lemma 2. The optimal value of the relaxed problem satisfies
∑
P(λ) = KQ(λ) + mtλt, (2.4)
t
[1] also gave a proof to Lemma 2, and we again provide a different proof in Appendix
A.3. Since the state space of the sub-MDP is much smaller, we can solve it directly by
using backward induction and the optimality equation. The existence of such an optimal
Markov deterministic policy follows from that the state and action spaces of the sub-
MDP being finite [40]. Let Π∗(λ) be the set of optimal Markov deterministic policies
of the sub-MDP for a given λ. The relaxed problem can be solved by combining the
solutions of individual sub-MDPs, that is,∏we can construct an optimal policy of the
relaxed problem πλ by setting πλ(s, a, t) = K λ λx=1 π (sx, ax, t), where π is an element in
Π∗(λ). Moreover, P(λ) is convex and piecewise linear in λ [1].
10
2.3 An Index Based Heuristic Policy
Our index based heuristic policy assigns an index to each sub-process, based upon its
state and current time. At each time step, we set active the m sub-processes with the
highest indices. Before carrying out the process of sequential decision-making, our
index policy calls for pre-computation of 1) λ∗ ∈ arg infλ P(λ), as defined in Section 2.2;
2) a set of indices, β, that will later be used for decision-making at every time step; 3)
an optimal policy π∗∗ for the sub-MDP problem in (2.3). In the first part of this section
we discuss how we carry out such computations.
2.3.1 Pre-computations
Dual optimal λ∗
We use subgradient descent to solve infλ P(λ), which converges to its solution λ∗ by
convexity of λ 7→ P(λ) (Theorem 7.4 in [42]). By (2.3) and (2.4), a sub-gradient of P(λ)
λ
with respect to λ is given by (−KEπ [A λt] + m : 1 ≤ t ≤ T ), where π is any policy in
Π∗(λ).
To compute this sub-gradient, we compute a policy πλ in Π∗(λ) and then use exact
λ
computation or simulation with a large number of replications to compute Eπ [At]. To
compute a policy in Π∗(λ), we first compute the value function Vλ : S × {1, ...,T } 7→ R
of sub-MDP Q(λ). We accomplish this using backward induction [40]:
Vλ(s, t) = maxa∈A{rT (s, a) − aλT } ∑ if t = T , (2.5)maxa∈A{rt(s, a) − aλ + Pat s′∈S (s, s′)Vλ(s′, t + 1)} otherwise.
Recalling that Π∗(λ) includes only deterministic policies, a policy πλ in Π∗(λ) are
constructed by determining for each s and t the action a whose one-step lookahead value
11
∑
rt(s, a)−aλ at+ s′∈S P (s, s′)Vλ(s′, t+1) is equal to Vλ(s, t), and then setting πλ(s, a, t) = 1
for this a. For those s and t for which both actions a have one-step lookahead values
equal to Vλ(s, t), one may set πλ(s, a, t) = 1 for either such action. Thus, the cardinality
of Π∗(λ) is 2 raised to the power of the number of s, t for which the one-step lookahead
values for playing and not playing are tied.
When we construct a policy in Π∗(λ) for the purpose of computing a sub-gradient
of P(λ), we choose to play in those s, t with tied one-step lookahead values. While
our subgradient descent algorithm would converge for other choices, making this choice
better supports computation of indices in section 2.3.1.
Indices βt(s)
Define the vector v[a, t] to be v + (a − vt) ∗ et, that is, the vector v with the tth element
replaced by a ∈ R. We define the index of state s ∈ S at time t as
βt(s) = sup{β : ∃ π ∈ Π∗(λ∗[β, t]) s.t. π(s, 1, t) = 1}. (2.6)
Instead of computing the entire set Π∗(λ∗[β, t]), we only need to compute a policy in
Π∗(λ∗[β, t]) using the method discussed in section 2.3.1, i.e., always choose the active
action when there are ties. Intuitively, this index is the maximum price we are willing
to pay to set a sub-process active in state s at t. By leveraging the monotonicity of
optimal actions with respect to rewards, as shown in Lemma 13 in Appendix A.7, we
compute βt(s) via bisection search in the interval [0,U], where U upper bounds the
largest possible value of βt(s). For example, we can set U as T ∗ maxs,a,t rt(s, a) when
λ∗ ≥ 0 (we show in Appendix A.6 that βt(s) cannot be greater than this value in this
case). We pre-compute the set β = {βt(s) : s ∈ S, 1 ≤ t ≤ T } before running the actual
algorithm.
12
Occupation measure ρ∗ and its corresponding optimal policy π**
Our tie-breaking policy involves constructing an optimal Markov policy π∗∗ for the sub-
∗ π∗∗MDP Q(λ ) such that E [At] = mK , ∀1 ≤ t ≤ T . The existence of π∗∗ is shown in
Appendix A.5. To compute π∗∗, we borrow the idea of the occupation measure [18].
Define the occupation measure, ρ(s, a, t), induced by a policy π to be the probability of
being in state s and taking action a at time t under π. Subsequently π∗∗ can be computed
by solving the following linear program (LP):
∑T ∑∑
max ρ(s, a, t)rt(s, a)
{ρ(s,a,t):y∈S,a∈A,t∈{1,...,T }} ∑t=1 a∈A s∈S m
subject to ρ(s, 1 t, t) = ,∀t = 1. . . . ,T
∑∈ Ks S ∑∑
ρ(s, a, t) − ρ(s′, a, t − 1)Pa(s′, s) = 0,
a∈A a∈A s′∈ (2.7)S
∑ ∀s ∈ S, 2 ≤ t ≤ T
ρ(s, a, 1) = 1(s = s1) ∀s ∈ S
a∈A
ρ(s, a, t) ≥ 0, ∀s ∈ S, a ∈ A, t = 1, . . . ,T,
∗∗
The first constraint ensures that Eπ [At] = mK . The second constraint ensures flow bal-
ance. The third∑constraint shows that we start at state s1. The second and third constraint
together imply a∈A,s∈S ρ(s, a, t) = 1, i.e., that (ρ(s, a, t) : a ∈ A, s ∈ S) is a probability
distribution for each t. The fourth and fifth constraints ensure that ρ is a valid probability
measure.
Let ρ∗ be an optimal solution to (2.7). π∗∗ can then be constructed by
∑ ρ∗ (s,a,t)
∑
 ∗ , if ρ
∗(s, a, t) > 0
 a∈A
ρ (s,a,t) ∑ a∈Aπ∗∗(s, a, t) =
1(a = 1), if
∗
∑a∈A ρ (s, a, t) = 0 and β (s) ≥ λ
∗
t (2.8)

t
1(a = 0), if ∗a∈A ρ (s, a, t) = 0 and βt(s) < λ∗t ,
13
for all s ∈ S, a ∈ A, 1 ≤ t ≤ T.
Here we also make an observation thatλ∗ ∈ arg min P(λ∗) is the optimal dual variable
corresponding to the first constraint in 2.7.
2.3.2 Index policy
Let {βt(S t,x) : x ∈ {1, ...,K}} be the indices associated with the K sub-processes at time t.
We define β̄t(St) to be the largest value β in {βt(S t,x) : x ∈ {1, ...,K}} such that at least m
sub-processes have indices of at least β. Our index policy then sets the actions of sub-
processes with indices strictly greater than β̄t to 1 (active), and those with indices strictly
less than β̄t to 0 (inactive). When more than m sub-processes have indices greater than
or equal to β̄t, a tie-breaking a rule is needed. For simplicity, in the following discussion
we use the term remaining resources to refer to the remaining number of the arms to
be set active after we activate all the arms with indices greater than β̄t(St), and use
It = {S t,x : 1 ≤ x ≤ K, βt(S t,x) = β̄t(St)} to denote the set of states occupied by the
sub-processes with tied indices. Our tie-breaking rule allocates the remaining resources
across It according to the probability distribution induced by π∗∗ over S at time t. More
specifically, we allocate
 ∗
∑ρ (S t,x,1,t)
∑ ∗ ′
 s′∈ ∗ ′I ρ (s ,1,t)
, if s′∈I ρ (s , 1, t) > 0
t tqt(S t,x) =  (2.9)∑ Nt(S t,x)′ N (s′) , otherwises ∈It t
fraction of the remaining resources to each of the state in It, where Nt(s) denote the
number of sub-processes in state s at time t.
We then use the function Rounding(total, frac, avail) in Algorithm 2 to deal with
situations where the products between the desired fractions and the remaining resources
14
are not integers. Here total represents the number of remaining resources, f rac is a
vector of the fractions of the remaining resources to be approximated allocated to each
tied state, and avail is a vector of the number of sub-processes in each tied state. The
function also allows the number of sub-processes in a tied state s to be less than the
number of resources we would like to assign to s according to the fraction in (2.9). We
note the following property of this function Rounding, which we will rely on in our
proof in Section 6.
Remark 1. When total, avail, frac satisfy availi ≥ total ∗ fraci, the output vector b =
Rounding(total, frac, avail) satisfies |bi − total ∗ fraci| < 1 for all i.
This tie-breaking ensures asymptotic optimality of the index policy as it enforces that
the fraction of sub-processes in each state s is equal to the distribution induced by π∗∗
in the limit. This idea shall become clear in Section 2.4 where the proof of asymptotic
optimality is presented.
We formally present our index policy in Algorithms 1 and 2.
2.4 Proof of Asymptotic Optimality
Our index policy π̂ achieves asymptotic optimality when we let the number of sub-
processes K go to infinity, while holding α mtt = K constant for all t. Let Z(π,m,K)
to denote the expected reward of the original MDP obtained by policy π with K sub-
processes and m = (m1, ...,mT ) constraints at each time. We use Πm,K to denote the set
of all feasible Markov policies for the such an MDP. Lastly, it should be understood that
whenever we use π̂ to denote our index policy there is a dependency of π̂ on m and K
that is not explicitly stated. We are now ready to state the main result of this chapter,
15
Algorithm 1: Index Policy π̂
Pre-compute: λ∗; β; ρ∗. (Refer to section 2.3.1 for computational details)
for t = 1, ...,T do
Let βt,[i] be the ith largest element in the list βt(S t,1), ..., βt(S t,K), so βt,[1] ≥ . . . ≥
βt,[K].
Let β̄t = βt,[m]
Let It = {s : βt(s) = β̄t and s = S t,x for some x}
Let Nt(s) = |{x : S t,x = s}|, for all s.
For s ∈ It, let

∑ ρ∗(s,1,t) ∑, if ρ∗ ′ ′ ∗(s′ 1 t) s′∈I (s , 1, t) > 0ρ , , tqt(s) = 
s ∈It
∑ Nt(s)
∑ ′s′∈It Nt(s ) , otherwise
Let b = Rounding(m − ′s′:βt(s′)>β̄ Nt(s ), (qt(s) : s ∈ It), (Nt(s) : s ∈ It))t
for all s do
If βt(s) > β̄t, set all Nt(s) sub-processes in s active.
If βt(s) = β̄t, set b(s) sub-processes in s active.
If βt(s) < β̄t, set 0 sub-processes in s active.
end for
end for
which shows that the per arm gap between the upper bound and the index policy goes to
zero under the limit assumption.:
Theorem 1. For any α ∈((0, 1)
T , )
1
lim Z(π̂, bαKc,K) − max Z(π, bαKc,K) = 0, (2.10)
K→∞ K π∈ΠbαKc,K
where bαKc = (bα1Kc, ..., bαT Kc)
16
Algorithm 2: Rounding(total, frac, avail)
∑
Inputs: total (a scalar), frac (a vector∑satisfying i fraci = 1), avail (a vector of the
same length as frac satisfying total ≤ i availi) ∑
Output: b (a vector of the same length as the inputs satisfying i bi = total, bi ≤ availi)
Let n = length(frac)
Let bi = min{availi, btotal ∗ fracic}, for i = 1, ..., n.
Let j = 1 ∑
while total > ni=1 bi do
Let b j = b j + 1(avail j > b j)
Let j = ( j mod n) + 1
end while
return b
We first point out that the optimal solutions of maxπ∈Πb Kc K Z(π, bαKc,K) are trivialα ,
when α = 0 or 1. So we do not include these two cases when we consider convergence of
the index policy π̂. To formalize the notations that will be used throughout the proofs, we
augment P(λ) to P(λ,m,K) to indicate the values of m and K assumed in the Lagrangian
relaxation. We use λ∗ to denote one and any element in arg infλ P(K,αK, λ) and let π∗∗
π∗∗be the optimal policy constructed in (2.8) using mt = αtK, which satisfies E (At) = αt
for all t. Note λ∗ and π∗∗ depend on only α (not on K).
As before, we let Nt(s) be the number of sub-processes in state s at time t under π̂.
We additionally define Mt(s) to be the number of sub-processes in state s at time t that
are set active by π̂. These quantities depend on K and m, but for simplicity we do not
include this dependence in the notation. We always assume m = bαKc and we rely on
context to make clear the value of K assumed. We also define Vt(s) to be the set of states
with the same index value as s, including s, and Ut(s) to be the set of states with index
17
value greater than that of s, for each time t. These quantities depend on α but not on K
or m.
We prove Theorem 1 by first demonstrating below in Theorem 2 that for each time
t, the proportion of the sub-processes that are in state s under our index policy π̂, Nt(s)K ,
approaches Pt(s) as K → ∞. In other words, our index policy π̂ recreates the behavior
of π∗∗ in the large K limit.
Theorem 2. For every s ∈ S and 1 ≤ t ≤ T ,
Nt(s)lim = Pt(s), Pπ̂ − a.s., (2.11)
K→∞ K
and
M (s)
lim t = P (s) ∗ π∗∗t (s, 1, t), Pπ̂ − a.s., (2.12)
K→∞ K
Before proving Theorem 2, we first present two intermediate results, whose proofs
are given in Appendix A.8 and A.9.
Lemma 3. At time 1 ≤ t ≤ T , for all s ∈ S , we have
(1) If βt(s) > λ∗t , then π
∗∗(s, 1, t) = 1.
(2) If βt(s) < λ∗t , then π
∗∗(s, 1, t) = 0.
Lemma 4. For any state s ∈ S and time 1 ≤ t ≤ T ,
∑
(1) If αt −∑s′∈U (s)∪V (s) Pt(s
′) ≥ 0, then π∗∗(s, 1, t) = 1.
t t
(2) If αt − ′s′∈U (s) Pt(s ) ≤ 0, then π∗∗(s, 1, t) = 0.t
Now we are ready to prove Theorem 2.
18
Proof. We prove (2.11) and (2.12) simultaneously via induction over the time periods.
When t = 1, all sub-processes starts in state s1, and we have
N1(s) Klim = lim = 1 = P1(s) if s = s1,
K→∞ K K→∞ K
N1(s) 0lim = lim = 0 = P1(s) otherwise.
K→∞ K K→∞ K
By the set-up of the original MDP, M1(s) = bα ∗ Kc, and we have
M1(s) bα ∗ Kclim = lim = α = π∗∗(s, 1, t) = P1(s) ∗ π∗∗(s, 1, t), if s = s1,
K→∞ K K→∞ K
M (s) 0
lim 1 = = 0 = P (s) ∗ π∗∗1 (s, 1, t), otherwise,
K→∞ K K
so we have proved the base case of the induction.
Now assume (2.11) and (2.12) hold up until time t. Fix a state s ∈ S and time
1 ≤ t ≤ T , define Y ′t(s , s) to be the number of sub-processes set active by π̂ in s′ at time
t which transition to state s at time t + 1, and Xt(s′, s) to be the number of sub-processes
set inactive by π̂ in s′ at time t which transition to s at time t + 1. Note that Y ′t(s , s) and
X ′t(s , s) also depend on K. We can subsequently express Nt+1(s) as∑
N ′ ′t+1(s) = Yt(s , s) + Xt(s , s).
s′∈S
Dividing both sides by K, and taking K to a limit, we get
Nt+1(s) ∑ 1 ∑ 1lim = lim Y ′t(s , s) + lim Xt(s′, s). (2.13)
K→∞ K K→∞ ′ K K→∞∈ ′∈ Ks S s S
Note Y (s′, s) is a binomial random variable with M (s′t t ) trials and success probability
P1(s′, s). Similarly, X ′t(s , s) is a binomial random variable with Nt(s′)−Mt(s′) trials and
success probability P0(s′, s). We can rewrite the RHS of (2.13) by applying Lemma 14,
19
which is stated in Appendix∑A.10:N ′
lim t+1
(s) Mt(s )
= lim ∗ P1(s′, s) (2.14)
K→∞ K ∑ K→∞ Ks′∈S N (s′t ) − M ′lim t(s )+ ∗ P0(s′, s)
s∑′ K→∞ K x∈S
= Pt(s′) ∗ π∗∗(s′, 1, t + 1) ∗ P1(s′, s) (2.15)
s′∈S∑
+ Pt(s′)(1 − π∗∗(s′, 1, t + 1)) ∗ P0x(s′, s) a.s.
s′∈S
= Pt+1(s). a.s. (2.16)
The last equality follows as we have exhausted all the ways of getting to s at time t + 1.
Hence we have shown (2.11) holds for time t + 1.
Next we show (2.12) holds for time t + 1. We define sets Pt = {Pt(s) : s ∈ S}, and
Nt = {Nt(s) : s ∈ S}. Recall Vt(s) is the set of states with the same index value as s,
including∑s, and Ut(s) is the set of states with index value greater than that of s, we let
N+(s) = N (s′
∑
t s′∈U (s) t ) and N
=
t (s) =
′
s′∈V (s) Nt(s ). We use notation
Nt
K for the set whicht t
consists of all elements in Nt divided by K. Lastly we use the function fs(Nt, bαtKc) to
represent the number of sub-processes set active at time t in state s:
fs(Nt, bαtKc) =1([bαtKc−N+t (s)]+≥N=t (s)) ∗ Nt(s)
+1([bα Kc−N+(s)]+<N=t (s)) ∗ bs(Nt, bαtKc), (2.17)t t
where bs(Nt, bαtKc) is number of sub-processes to set active as output by Rounding in
Algorithm 2. The first indicator represents situations where tie-breaking is not needed
and all the sub-processes are set active. The second indicator represents situations where
tie-breaking is needed and is determined by the function Rounding, and situations where
no tie-breaking is needed and no sub-process is set active.
To facilitate our proof, we consider a continuous version of the function Rounding
20
and denote it as Rounding-c. Rounding-c first distributes min{total∗fraci, availi}, instead
of min{btotal ∗ fracic, availi} in Rounding, to each state, and uses a fluid way to distribute
the difference between total and the amount distributed initially. Rounding-c is given
with full detail in Algorithm 3. We use b̄s(Nt, bαtKc) to denote the output of Rounding-c.
Algorithm 3: Rounding-c(total, frac, avail)
∑
Inputs: total (a scalar), frac (a vector∑satisfying i fraci = 1), avail (a vector of the
same length as frac satisfying total ≤ i availi) ∑
Output: b (a vector of the same length as the inputs satisfying i bi = total, bi ≤ availi)
Let n = length(frac)
Let bi = min{availi, total ∗ fraci}, for i = 1, ..., n.
Let L = {i : 1 ∑≤ i ≤ n, bi < availi}
while total > ni=1 bi do ∑
Let t = max{t ≥ 0 : b + ti |L| ≤ availi, ∀i ∈ L and
n
i=1(b +
t
i |L| ) ≤ total}
Let b = {b ti + 1(i∈L) |L| : 1 ≤ i ≤ n}
Let L = {i : 1 ≤ i ≤ n, bi < availi}
end while
return b
Moreover, we use
f̄s(Nt, bαtKc) =1([bα + =tKc−N + ∗ Nt (s)] ≥Nt (s)) t(s)
+1([bα Kc−N+t (s)]+<N=(s)) ∗ b̄s(Nt, bαtKc), (2.18)t t
to denote the number of sub-processes set active in state s at time t according to this
continuous tie-breaking rule Rounding-c.
This proof will be accomplished by the following three lemmas, whose proofs are
21
given in Appendices A.11, A.12, A.13
Lemma 5. ∣∣∣∣∣ ( )∣∣∣ N bαf (N b Kc) − K f̄ t tKc) ∣s t, α , ∣t s ∣∣ ≤ 2, ∀1 ≤ t ≤ TK K
.
Lemma 6. ( )
N bα Kc
lim f̄ t ts , = f̄ (P , α ), a.s, ∀1 ≤ t ≤ T.
K→∞ K K s t t
Lemma 7.
f̄s(Pt, αt) = Pt(s)π∗∗(s, 1, t), ∀1 ≤ t ≤ T.
Combining the three lemmas above we have( )
Mt+1(s) fs(Nt+1, bαlim lim t+1Kc) Nt+1 bαt+1Kc= = lim f̄s , = Pt+1(s)π∗∗(s, 1, t + 1),
K→∞ K K→∞ K K→∞ K K

Finally, we prove Theorem 1 by leveraging the results from Theorem 2.
Proof of Theorem 1. π̂ ∈ ΠbαKc,K implies Z(π̂, bαKc,K) ≤ maxπ∈Πb Kc K Z(π, bαKc,K).α ,
Thus,
1 1
lim Z(π̂, bαKc,K) ≤ lim sup Z(π, bαKc,K).
K→∞ K K→∞ K π∈ΠbαKc,K
22
On the other hand, ∑T ∑ 1 1
lim Z(π̂, bαKc,K) = lim Eπ̂  rt(s, 1)Mt(s) + rt(s, 0)(Nt(s) − Mt(s))K→∞ K ∑K→∞∑K t=1 s∈ST 1
= r π̂t(s, 1) lim E [M (s)] +
K→∞ K t
t=1 s∈S
1
∑rt∑(s, 0) lim E
π̂ [N
→∞ K t
(s) − Mt(s)]
[ KT ]
= rt(s, 1)ρ(s, 1, t) + rt(s, 0)ρ(s, 0, t)
∑t=1T ∑s∈S [ ]
= rt(s, 1)ρ(s, 1, t) + rt(s, 0)ρ(s, 0, t)
t=1 s∈S
− π∗∗ ∑ E λt (A − α )∑ t t t
=Q(λ∗) + λ∗tαt
1 ∑
= lim (KQ(λ∗) + bαKc λ∗)
K→∞ K t
1
= lim P(λ∗, bαKc,K)
K→∞ K
≥ 1lim sup Z(π, bαKc,K).
K→∞ K π∈ΠbαKc,K
Here, the third line follows by Theorem 2 and the fact that both Nt(s) and Mt(s) are
bounded and hence uniformly integrable random variables (for uniformly integrable
random variables, convergence almost surely implies convergence in expectation). The
fourth line holds because π∗∗ takes the active action at each time with probability α. The
fifth line follows from Lemma 2, where we have augmented the notation for P to include
the values of m and K assumed. The sixth line follows from Lemma 1.
Finally, sandwiching the two inequalities gives the desired result.  
23
2.5 Numerical Experiments
In this section we present numerical experiments for two problems: the finite-horizon
multi-arm bandit with multiple pulls per period,and subset selection [9, 33]. These
experiments demonstrate numerically that our index policy is indeed asymptotically op-
timal. We also compare the finite-time performance of our policy to other policies from
the literature. Although our previously provided theoretical results do not apply to finite
K, we see that our index policy performs strictly better than all benchmarks considered
in both of the problems.
2.5.1 Multi-armed bandit
In our first experiment, we consider a Bernoulli multi-armed bandit problem with a finite
time horizon T = 6, and multiple pulls per time period. A player is presented with K
arms and may select m = bK/3c of them to pull at every time st. Each arm pulled returns
a reward of 0 or 1. The player’s goal is to maximize her total expected reward. We
assume that each arm returns i.i.d rewards according to a Bernoulli distribution with an
unknown rate of success θx. We take a Bayesian approach and impose a Beta(1,1) prior
on each of the θx. Note that the posterior distributions of θx are still going to be beta-
distributed. The values of the state then correspond to the posterior parameters of the K
arms.
For comparison, we include results from an upper confidence bound (UCB) algo-
rithm with pre-trained confidence width. At every time step, we compute µi + α ∗ δi for
each arm i, where µi and δi are the sample mean and standard deviation of arm i. We
pre-train α by running the UCB algorithm on a different set of data (but simulated with
24
1.26
1.24
1.22
1.20
1.18
1.16 Upper bound
Index Policy
UCB
1.14
101 102 103 104
K
Figure 2.1: Upper bound and simulation results of MAB
the same distribution) with values of α ranging from 0 to 5 with a step size of 0.1 and
then set α to the value that gives the best performance.
Figure 2.5.1 plots the reward per arm (expected total reward divided by K) against K,
for K = 12, 120, 1200, 12000. The red dashed line represents the upper bound computed
using P(λ∗). For each policy, circles show the sample mean of the total reward per arm,
and vertical bars indicate a 95% confidence interval for the expected total reward per
arm. The UCB policy’s sample means are connected by a dashed green line, and the
index policy’s are connected by a dashed black line. Values are calculated using 5000
replications.
The index policy consistently outperforms the UCB policy. As K grows large, the
confidence interval for the index policy’s total performance per arm overlaps with the
upper bound, which numerically attests to the accuracy of Theorem 1 and illustrates the
rate of convergence.
25
reward per arm
2.5.2 Project assignment problem
In our second experiment, we consider the following finite-horizon restless bandit prob-
lem: suppose a team has K ongoing projects and m = bK/5c engineers who can work
on any single project at a time. The manager of the team decides which projects to pri-
oritize on a weekly basis. The state space of each project is the set of positive integers.
Every project starts at the state of 1. The state of a project can either remain unchanged
or can increment by 1 as we move to the next week. When an engineer works on a
project in state s for a week, it moves to the next state with probability p1(s); otherwise,
it transitions to the next state with probability p0(s). Each project can be worked on
by at most 1 engineer. When a transition happens, the team collects a reward which is
a function of the current state. All the states have the same reward 1 except the final
state which has a reward of 50. The manager’s goal is to maximize the total expected
reward over a horizon of T = 5 weeks. Here we set p1(s) = 0.9 − 0.6s/T for s < T , and
p1(s) = 0.05 for s = T . We set p2 = 0.16p1.
For comparison we use Whittle’s index policy [48] and a randomized policy that
selects m projects randomly at every time step. To apply Whittle’s index policy which
is for infinite horizon setting, we convert the problem to a infinite horizon problem by
adding an absorbing state for time T onwards, and make all the state at time T transit to
this absorbing state with probability 1.
Figure 2.5.2 presents the outcome of the experiment at K = 10, 100, 1000, 10000.
Again we plot the number of projects against the per arm reward. We use red dashed
line to represent the upper bound. The green dashed line shows the result of the ran-
domized policy, which perform significantly worse than the other two. The index policy
is represented by the black dashed line and the Whittle’s index policy is represented by
the blue dashed line. The index policy out-performs the Whittle’s policy and converges
26
1.40
1.38
1.36
1.34
1.32
Upper bound
Index Policy
Randomized policy
Whittle’s index policy
1.30
101 102 103 104
K
Figure 2.2: Upper bound and simulation result of project assignment problem
to the upper bound. Whittle’s policy is sub-optimal as it considers average case over
time. Hence it tends to allocate more resources to projects which are in state 5 which
has a very low probability of getting a large reward, while the optimal thing to do is to
allocate resource to arms that are that a state with lower value but has a much higher
probability of obtaining some reward.
2.5.3 Subset selection problem
In the third experiment, we consider a subset selection problem in ranking and selection
whose goal is to identify m best designs out of K designs, each with some underlying
distribution θx. This problem is considered in [9] as well as [33]. We assume m̄ parallel
computing resources are available; at each time step we select m̄ out of K design to
evaluate. After T rounds of evaluation, we select m best designs. In this numerical
study, we set T = 4, mK = 0.3 and
m̄
K = 0.5. We consider the situation when the outcomes
of evaluation are binary, but note that our model can handle any real-valued outcomes.
27
reward per arm
Below is how we formulate this problem as an RMAB:∑T+1
π  ( )maximize E Rt St,At ∈ 

π Π
t=1
(2.19)
subject to Pπ(|At| = m) = 1, for t = T + 1,
Pπ(|At| = m̄) = 1, for 1 ≤ t ≤ T,∑
where Rt(St,At) = 0 when t ≤ T , and R Kt(St,At) = x=1 E[θx|St,x] when t = T + 1. We
start with a uniform prior for each design.
We compare the performance of our policy against the OCBA-m selection procedure
proposed in [9]. Since [9] considers a slightly different setting in which a policy maker
can evaluate a design more than once in a time step, we modify the procedure slightly
to fit our setup: instead of sampling according to the number of times dictated by the
algorithm, we rank the designs by their desired number of samples, and simulate the
first m̄ of them. Moreover, we assign a positive sample and a negative sample to each
of the design at the beginning of the simulation so that it starts with the same amount
of information as our Bayesian setup. We also use the UCB policy as a standard of
comparison. The implementation of the UCB policy is similar to the one in section
2.5.1.
The simulation results show that all the three policies perform similarly when K
is small, with the OCBA-m policy having a slight edge for K = 10. From K = 100
onwards, the index policy consistently outperforms the other two. In addition, the gap
between the upper bound and the index policy vanishes as K becomes large, while the
gaps between the upper bound and the other two policies remain constant.
28
0.234
0.232
0.230
0.228
0.226
0.224
0.222 Upper bound
Index Policy
0.220 Ucb Policy
OCBA-m Policy
0.218
101 102 103 104 105
K
Figure 2.3: Upper bound and simulation result of subset selection
2.6 Conclusion
In this chapter we propose an index-based policy for finite horizon RMAB problems
that is computational tractable, and prove that it is asymptotically optimal in the same
limit as considered by Whittle. We also show that the numerical performance of this
index-based policy beats the state-of-art. For future work, we conjecture that our results,
including the formulation of the policy and the asymptotic optimality, can be extended
to the following situations:
1. Multiple actions associated with a state, instead of an active and a passive action
in the current formulation;
2. A total budget constraint over the entire time horizon, in addition to a budget
constraint at every time step.
3. Infinite state space.
29
reward per arm
CHAPTER 3
PARALLEL BAYESIAN POLICIES FOR FINITE MULTIPLE COMPARISONS
WITH A KNOWN STANDARD
In this chapter, we consider an applied problem in simulation optimization: the multi-
ple comparisons with a known standard (MCS). The problem is similar to an RMAB
problem considered in Chapter 2 except that it allows multiple pulls per arm at a time.
In this problem, one wishes to use simulation to determine, for each among a finite
pool of simulating systems, which ones have an expected output measure that exceeds a
known threshold. In this chapter, we use Bayesian statistics and dynamic programming
to study how one should allocate simulate effort in the MCS problem, so as to best sup-
port this final determination. This chapter is organized as follows: In Section 3.1, we do
a light motivation and literature review; In Section 3.2 we formally state the problem;
In Section 3.3, we provide a computationally tractable upper bound on the value of a
Bayes-optimal procedure, and a method for computing this upper bound; In Section 3.4
we present a heuristic motivated by this upper bound, which is similar to the index-based
policy proposed in Chapter 2 for RMAB, with the exception that it allows assignment of
multiple resources to a system at a time; In Section 3.5 we present numerical results in
which we demonstrate that the index policy performs close to an optimal policy; Lastly
we present our numerical results and conclusion in Sections 3.5 and 3.6.
3.1 Introduction
The MCS problem arises in at least two distinct ways in simulation applications. First, it
arises when determining which options perform better than some standard option whose
performance is so well-estimated that it can be treated as known [36]. Second, it arises
30
when determining which options have a secondary performance measure that satisfies
a constraint [2]. The MCS problem also arises outside of simulation, in crowdsourcing
service centers like Amazon’s Mechanical Turk, when allocating a budget across work-
ers who label items (e.g., images, documents), to best support accurate classification of
these items [12].
We consider a variant of the MCS problem in which parallel computing resources
are available. The growing availability of parallel computing resources presents new op-
portunities to perform simulation analysis at larger scales, but also imposes constraints
on the way simulation effort is allocated. In our model, simulation effort is allocated
batch-sequentially: simulations are performed in batches, and we decide how many ad-
ditional replications to perform for each system at the start of each batch based on the
results of previous batches. We are given a fixed budget, specified as a number of par-
allel computing resources and a number of batches, and our goal is to allocate these
batches of simulation efficiently, so as to best allow correct classification of the systems
once our simulation budget is exhausted.
We formulate the MCS problem in a Bayesian framework, and we measure the per-
formance of a batch-sequential procedure by its average case performance, averaging
across problem instances drawn from the prior and across simulation noise. While the
Bayes-optimal procedure is characterized by the dynamic programming equations [19],
the curse of dimensionality makes solving this dynamic program computationally in-
tractable for problems with many systems.
Rather than solving this dynamic program exactly, we provide a computationally
tractable upper bound on its value. This allows us to evaluate the quality of sub-optimal
heuristic policies relative to this upper bound. This provides guidance to the develop-
ment and improvement of heuristic policies, in the form of information about the opti-
31
mality gap. The analysis technique used in this upper bound is a Lagrangian relaxation
on the total number of simulations performed in any given batch. Using the Lagrange
multipliers obtained from this relaxation, we also develop a heuristic policy, and use nu-
merical experiments to demonstrate that it performs close to the upper bound on optimal
in the problem setting studied.
This chapter builds on the previous work [49], which also considered the Bayesian
MCS problem. That chapter considered the sequential setting, without parallel re-
sources, and provided a computationally efficient method for computing the Bayes-
optimal policy under two assumptions about limitations of sampling: that there is a time
horizon that is random and exponentially distributed; or there is no time horizon, and we
pay a fixed cost for each sample. Our current work differs from that work by considering
parallelism, and by considering a fixed budget. While the infinite-horizon fixed-cost-
per-sample model in [49] is quite natural for cloud computing settings, and the random
exponentially distributed horizon is attractive for its computational tractability, using a
fixed horizon is more natural than either model in [49] when allocating computing re-
sources that are owned rather than rented. While we focus on the parallel setting and
[49] focused on the sequential setting, our work can also provide an upper bound for the
sequential setting with fixed horizon by setting the number of parallel nodes to 1.
For simplicity in this chapter, we consider only Bernoulli samples, with a linear loss
function. However, the techniques developed in this chapter should also be adaptable to
other parametric sampling distributions with conjugate priors, and other loss functions.
Our model assumes synchronous computations, in which we wait for all simulations
in a batch to complete before starting the next batch. This approach is reasonable when
the variability in the time to simulate a system is small enough to allow waiting until
simulations finish before starting the next batch. Such assumptions are more commonly
32
met in controlled high-performance computing environments, and are less common in
cloud computing environments. If computation time is highly variable, it may be more
appropriate to model computation as asynchronous.
Our Lagrangian relaxation of a budget constraint in the MCS problem is related
to Lagrangian relaxations in restless multi-armed bandit problems [48, 23]. Indeed, if
one added an additional constraint that each system can be simulated at most once in
any batch, then our MCS problem can be reformulated as a restless multi-armed bandit
problem where we can pull multiple arms in each round: in this restless multi-armed
bandit problem, each system is an arm, pulling the arm corresponds to simulating that
system; the reward from this pull is the improvement in our ability to classify this sys-
tem; and the number of arms we can pull in a round is the number of parallel resources.
The restless multi-armed bandit framework of [48] is most well-known for allowing
bandit arms to evolve when they are not pulled, but also allows multiple pulls per round.
Our analysis can be seen as examining a generalization of restless bandits in which each
arm can be pulled multiple times per round, and our heuristic policy can be seen as a
generalization of the Whittle index policy to this setting.
Our use of a Lagrangian relaxation to study the Bayesian formulation of the MCS
problem is also similar to [50], which used a Lagrangian relaxation to bound the value
of the Bayes-optimal procedure for the ranking and selection problem.
While we consider the MCS problem in the Bayesian setting, much of the previous
work on the MCS problem has considered non-Bayesian settings. This previous work
includes the one-stage procedures by [37, 16], the two-stage procedures by [15, 7, 13],
and work on indifference-zone ranking and selection (see the survey [32]). The more
recent frequentist work includes [5] which provides a fully sequential procedure under
stochastic constraint, and [26] which further allows correlation across systems.
33
In the version of the MCS problem that we consider, we emphasize that our standard
has known value, and we seek to determine only whether each system is better or worse
than this standard. We do not consider standards with unknown value, produce joint
confidence intervals, nor select the best among those systems performing better than
standard. This is in contrast with much previous work on multiple comparisons [36,
31]. The variant of the MCS problem that we consider has also been called feasibility
determination [44].
3.2 Problem Formulation
We have k systems, each of which can be simulated using a stochastic simulation. When
we simulate system x ∈ {1, . . . , k}, we observe a Bernoulli(θx) random variable, indicat-
ing the system’s performance in that simulation. We will think of an outcome of 1 as
indicating the system succeeded in that simulation, and 0 as indicating failure. Although
we expect that the methods we develop in this paper can be extended to simulations that
generate non-Bernoulli samples, we focus on the Bernoulli setting here for simplicity.
The sampling means θx are initially unknown, and we wish to determine through sim-
ulation, for each system x, whether θx is greater than some known threshold dx. This
threshold may differ across systems.
We adopt a Bayesian formulation, in which we seek to do well on average with
respect to a prior probability distribution over the unknown sampling means θ1, . . . , θk.
We consider Bayesian prior probability distributions under which
θx ∼ Beta(α0,x, β0,x),
with independence across x, for some given values α0,x, β0,x. We assume a Beta prior
for tractability: the Beta prior is conjugate to the Bernoulli likelihood [14], and so this
34
assumption allows our posterior distribution to remain Beta-distributed.
We perform N batches of simulations, performing at most m simulations in parallel
in each batch. At the start of each batch n = 1, . . . ,N, we choose the nu∑mber of samples
zn,x to take from each system x, making sure to satisfy the constraint x zn,x ≤ m. This
choice of zn,x may depend upon the results observed from all previous batches. We then
observe the number of successes from each system x, which are conditionally binomial,
Yn,x|θx, zn,x ∼ Binomial(zn,x, θx).
We assume conditional independence of Yn,x across x and from all previous samples,
given θx and zn,x. This precludes the use of common random numbers, but is satisfied
when using independent sampling. After observing this Yn,x, we update our posterior
distribution on θx to obtain [14]
θx|z1,x,Y1,x, . . . , zn,x,Yn,x ∼ Beta(αn,x, βn,x),
∑
where αn,x = α0,x + n′≤n Yn′,x ca∑n be interpreted as the effective number of successes
from system x, and βn,x = β0,x + n′≤n(zn′,x − Yn′,x) as the effective number of failures.
This posterior is independent across x.
We use dynamic programming to analyze this problem. To support this, for
each x and n, we define the state variable S n,x = (αn,x, βn,x). To streamline the
discussion, we define αn = (αn,1, ..., αn,k), βn = (βn,1, .., βn,k), Sn = (S n,1, ..., S n,k),
{a(nd zn = (zn,1, . . . , zn,k). Let Λn be th)e space in whic∑h Sn takes value}s, Λn =
α0,1 + s1, β0,1 + f1, . . . , α0,k + sk, β0,k + fk : sx, f kx ∈ Z+∀x, x=1 sx + fx ≤ mn .
We stop sampling after batch N and decide, for each system x, whether to label θx as
above or below the threshold dx. If we label it as above, we receive a reward of θx − dx.
Otherwise, we receive dx − θx. The total terminal reward is the sum of these rewards
across the systems. As shown in [49], to maximize the conditional expected value of this
35
reward given what we know after our simulations are complete (which is summarized
in S N,x), we sh[ould choose]to receive θx − d whenever the conditional expected value of
this reward E θx − dx|S N,x is positive, and to receive dx − θx when it is negative. When
making decisions in this way, the conditional ∣expected termina∣ l re∣ward received is{ [ − | ] [ − | ]} ∣∣∣∣ [ − | ] ∣ ∣∣ ∣
∣
α ∣
max E θx dx S N,x , E d S
∣ ∣ N,xθ ∣x x N,x = E θx dx S N,x ∣∣ = ∣∣ − dx∣ .αN,x + βN,x ∣
Summing this reward across systems x, our conditional expected terminal reward is
∑k
r(SN) = ∣∣ ∣∣∣∣∣ α ∣N,x − ∣d ∣x∣ .α
x=1 N,x
+ βN,x ∣
We optionally allow a cost cx ≥ 0 for each sample generated for system x, in the
same units as the objective function. For example, if the reward is monetary, then the
cost might be the payment made to a cloud computing service such as Amazon Ec2 or
Microsoft Azure for the computer time required to run a simulation. If the computers
are owned, rather than rented, then it may be appropriate to set cx = 0. Combining
this optional sampling cost with the reward r(SN) gives the conditional expected overall
reward ∑k ∑N
r(SN) − cxzn,x.
x=1 n=1
Our goal is to find an algorithm or policy for choosing the samples to take, zn,x, so
as to maximize the expected value of this reward. A policy is a rule for choosing how to
allocate the next batch of samples, based on the results of the previous samples as sum-
marized by Sn. Formally, a policy π is a sequence of mappings π = (π0, . . . , πN−1),
where π kn : Λn →7 Z+ maps the state Sn to the action zn = πn(Sn), while satisfy-
ing th{ e constraint on the ∑number of samples in each batch. The set} of all policies is
Π = π = (π0, . . . , πN−1) : kx=1 πn,x(S) ≤ m ∀ n = 0, . . . ,N − 1,S ∈ Λn , where πnx(S) in-
dicate the xth component of πn(S).
36
Each policy π induces a probability distribution over (S0, z0, . . . ,SN−1, zN−1,SN),
which we call Pπ. We let Eπ indicate the expectation taken with respect to this probabil-
ity distribution. We define,
 ∑k ∑N Vπ πn (S) = E r(SN) − cxzn′,x | Sn = S ,
x=1 n′=n+1
for 0 ≤ n ≤ N, which is the conditional expectation of the future reward under policy
π, starting from state S at time n. We also define the value function, used below in our
dynamic programming approach, as
Vn(S) = sup Vπn (S).
π∈Π
The (unconditional) expected reward obtained under π is Vπ0 (S0), and an optimal policy
π∗ is any for which π∗ ∈ arg max ππ∈Π V0 (S0).
The problem sup Vππ∈Π 0 (S0) is a Markov decision process with finite horizon and
finite state space, and its solution is characterized by the dynamic programming equa-
tions. To apply dynamic programming, we first write down the dynamic programming
equation, which is a recursiv∑e relation for Vn:∑  

Vn(Sn) = max − cxzn+1,x + E[V

n+1(Sn+1)|Sn, zn+1] , 0 ≤ n ≤ N − 1.zn+1: x zn+1,x≤m  x
∑∣∣∣ ∣∣∣ (3.1a)k αV (S N,xN N) = ∣ − dx∣. (3.1b)
α
x=1 N,x
+ βN,x
Any policy whose actions achieve the maximum in (3.1a) is optimal [17].
We can use these recursive equations to compute Vn directly, first calculating VN(SN)
for all possible SN ∈ ΛN , and then proceeding in a backward recursion, using previously
computed values of Vn+1(Sn+1) to calculate Vn(Sn) for all Sn ∈ Λn. Then, given these
computed value functions, we can compute an optimal policy.
While this direct dynamic programming approach is theoretically well understand, it
quickly becomes computationally intractable as k grows. This is because the state space
37
at time n, Λ , has O((mn)2kn ) elements, and so storing the value function at all possible
states has a memory requirement that scales exponentially in k. Computation also scales
exponentially in k. This computational infeasibility due to the large dimension of the
problem is generally referred to as the ”curse of dimensionality” [38].
The computationally intractability of computing an optimal policy when k is large
leads us to consider other characterizations that can be computed more easily. In the next
section, we show how to compute an upper bound on the value of the optimal policy that
scales linearly in k, rather than exponentially.
3.3 Upper Bound
In this section, we provide a computationally tractable upper bound on the value of an
optimal policy. This bound can be used to calculate an optimality gap for any desired
heuristic policy π, by comparing the upper bound to an estimate of the heuristic policy’s
value Vπ(S0) obtained from direct simulation. This in turn can be used to judge whether a
particular heuristic policy is good enough to be used in practice, or if more development
(either of better heuristics or tighter upper bounds) would be worthwhile.
∑
The main idea in our upper bound is to relax the constraints kx=1 zn,x ≤ m with a
Lagrange multiplier. As the first step, we introduce values ∑λ = (λ1, ..., λ kN) ∈ R+. The
value λn will be the Lagrange multiplier for the constraint kx=1 zn,x ≤ m. For each λ,
define a modified value function Vλn (Sn) via the following recursion:
{ ∑ [ ] (∑k )}
Vλn (Sn) = max − c z +E Vλx n+1,x n+1(Sn+1)|Sn, zn+1 −λn+1 zn+1,x−m , n ≤ N−1,
zn+1∈{0,1,...,m}k x
∑∣ ∣ x=1∣∣ ∣∣ (3.2a)k αVλ (S ) ∣ N,xN N = − dx∣. (3.2b)αN,x + βx=1 N,x
38
We will see below in Lemma 9 that Vλ0 (S0) provides an upper bound on the value func-
tion V0(S0), and hence on the value of an optimal policy.
While computing Vλ0 (S0) directly using the recursion (3.2) would also seem to re-
quire storing a value for every state in the state space, just as the recursion (3.1), we
will see below in Lemma 8 that it can be computed instead as the sum of other modified
value functions, each of which corresponds to a single system, and which can be com-
puted by considering a much smaller state space whose size does not grow with k. This
will allow efficient computation.
Toward this end, we define the function Vλn,x for each x via the recursive relation{ [ ∣ ]}
Vλ
∣
n,x(S n,x) = max − z λ ∣n+1,x(cx + λn+1) + E Vz ∈{0 m} n+1,x(S n+1,x)∣S n,x, zn+1,x , n ≤ N − 1,n+1,x ,...,
λ ∣∣∣ ∣∣∣ (3.3a)∣ αV (S ) N,xN,x N,x = − dx∣. (3.3b)αN,x + βN,x
Vλn,x(S n,x) is the value function for a dynamic program corresponding to an MCS
problem with a single system x, with a new sampling cost cx + λn+1 for samples in
batch n. The cost of sampling depends on the batch n. The additional cost λn+1
beyond the cost cx in our original model arises from the way in which we relaxed
the constraint on the number of samples in each batch in the multi-system problem.
Vλn,x(S n,x{)( can be computed) directly using this recurs}ive relation, because the space
Λn,x = α0,x + sx, β0,x + fx : sx, fx ∈ Z+, sx + fx ≤ mn in which S n,x takes values has
only O(nm) elements, and does not grow exponentially in the problem parameters.
The following lemma shows that Vλn (Sn) can be computed directly from Vλn,x(S n,x),
allowing its efficient computation.
39
Lemma 8. For any λ ≥ 0, ∑k ∑N
Vλn (S
λ
n) = Vn,x(S n,x) + m λn′ , (3.4)
x=1 n′=n+1
Proof. This proof uses an inductive argument. At time N, (4.14) holds from (3.3b) and
(3.2b). Assume (4.14){hol∑ds for time n +[1∑. Then,k k ∑N ∣∣ ]
Vλn (Sn) = max − cxzn+1,x + E Vλn+1,x(S n+1,x) + m λn′ ∣∣Sn, zn+1
zn+1∈{1,..,m}k ∑x=1k } x=1 n′=n+2
− λn+1( zn+1,x − m) (3.5)
{x=1∑k ( [ ∣∣∣ ]) ∑k }= max − cxz λn+1,x + E Vn+1,x(S n+1,x)∣S n,x, zn+1,x − λn+1 zn+1,x
zn k+1∈{1,...,m} x=1 x=1
∑ (3.6)N
+ m λn′ (3.7)
∑ n′=n+1k { [ ∣ ] }
= max − ∣c λ ∣xzn+1,x + E Vn+1,x(S n,x)∣S n+1,x, zn+1,x − λn+1,xzn+1,xz
x n+1,x
∈{1,...,m}
=1 ∑N
+ m λn′ (3.8)
∑ n′=n+1k ∑N
= Vλn,x(S n,x) + m λn′ . (3.9)
x=1 n′=n+1
Equation (3.5) follows from the inductive hypothesis. Equality (3.7) holds due to the fact
that each system is independent from the current action and states of the other systems.
Equality (3.8) holds because the xth summand in the summation of (3.7) only depends
on zn+1,x; to maximize the sum is to maximize each of the summands by choosing the
right zn+1,x. 
Setting n = 0 in the above lemma, we obtain a readily computed expression for
Vλ0 (S0), ∑k ∑N
Vλ λ0 (S0) = V0,x(S 0,x) + m λn. (3.10)
x=1 n=1
40
The following lemma shows that this is an upper bound on the value function for our
original MCS problem.
Lemma 9. For any λ ≥ 0,
Vλ0 (S0) ≥ V0(S0). (3.11)
Proof. We first prove
Vλn (Sn) ≥ Vn(Sn), (3.12)
for all n such that 0 ≤ n ≤ N using an inductive argument. Then (3.11) holds automat-
ically. At time N, inequality (3.12)∑follows from equation (3.1b) and (3.2b). Assume
(3.12) holds at time n + 1. Noticing kx=1 zn+1,x − m ≤ 0 as stated in the problem formu-
lation, we have ∑k ∑k
Vλn (Sn) = max {− cxzn+1,x + E[Vλn+1(Sn+1)|Sn, zn+1] − λn+1( zn+1,x − m)}
zn+1∈{1,..,m}k ∑x=1 x=1k
≥ max {− cxzn+1,x + E[Vλn+1(Sn+1)|Sn, zn+1]}
zn+1∈{1,..,m}k ∑x=1k
≥ max {− cxzn+1,x + E[Vn+1(Sn+1)|Sn, zn+1]}
zn+1∈{1,..,m}k x=1 ∑k
≥ max∑ {− cxzn+1,x + E[Vn+1(Sn+1)|Sn, zn+1]} = Vn(Sn).
zn 1∈{1,..,m}k ,s.t. k+ x=1 zn+1,x≤m x=1
Now we have that for any λ ≥ 0, Vλ0 (S0) forms an upper bound on the optimal total
expected reward of the original problem. We then obtain the tightest upper bound of this
form by selecting the infimum.
Theorem 3. [∑k ∑N ]
UB(S0) = inf Vλ≥0 0,x(S 0,x) + m λn (3.13)λ
x=1 n=1
gives an upper bound on V0(S0).
Proof. This result follows directly from (3.10) and Lemma 9. 
41
While we have argued that Vλ0 (S0) can be computed efficiently as the sum of val-
ues from small dynamic programs, each corresponding to a single sy∑stem. We now
sh∑ow how the infimum in UB(S k λ0) can be computed. Define B(S0, λ) = x=1 V0,x(S 0,x) +
m Nn=1 λn. To compute the upper bound in equation (3.13), we first show that B(S0, λ)
is convex in λ and subsequently (3.13) is a convex optimization problem. Moreover, it
is possible to compute the subgradient of λ 7→ B(S0, λ), allowing the use of a first-order
convex optimization method.
Lemma 10. B(S0, λ) is convex in λ for any λ ≥ 0.
∑
Proof. Since m Nn=1 λn is convex in λ, it is sufficient to show V
λ
0,x(S 0,x) is convex in λ
for all x. This shall be shown by induction. At time N, Vλ αN,xN,x(S N,x) = | − d | isαN,x+βN x,x
constant thus convex in λ. Assume the result holds true for Vλn 1(S n+1,x), then same is+
true for E[Vλn 1(S n+1,x)|S n,x, zn+1,x] for any fixed S+ n,x and zn+1,x. −zn+1,x(cx + λn+1), which
is linear in λn+1, is also convex in λ. Since the maximum of convex functions is still
convex, Vλn,x(S n,x) = maxzn 1 x{−zn+1,x(cx + λn+1) +E[Vλn 1(S n+1,x)|S n,x, zn+1,x]} is convex in+ , +
λ. Subsequently we have Vλ0,x(S 0,x) is also convex through an inductive argument. 
To allow the use of a first-order convex optimization method for solving (3.13),
which are generally faster than derivative-free methods, we much provide a method
for computing the subgradient of B(S0, λ) with respect to λ. Since B(S0, λ) is the sum
of Vλ0,x(S 0,x) and a term that is linear in λ, it is sufficient to compute the subgradient
Vλ0,x(S 0,x). The next lemma provides an expression for this subgradient.
Before presenting this lemma, we first introduce some notation. Define π∗x(λ) to
be an optimal policy obtained by solving the single-system MCS problem for system
x, that is, the optimal policy for the dynamic program (3.3). Let rn,x(S n−1,x, zn,x; λn) =
−zn,x(cx + λn) be the reward collected in state S n−1,x by taking action zn,x for the single-
42
system∑MCS problem in (3.3). Let r λN+1,x(S N,x) be the terminal reward. Then V0,x(S 0,x) =∗
Eπx(λ)[ Nn=1 rn,x(S n−1,x, zn,x; λn) + rN+1,x(S N,x)|S 0,x].
Lemma 11. Fix any λ, and any corresponding optimal policy π∗x(λ). Then, the vector( ∗ )
g(S 0,x;λ) = −Eπx(λ)[zn,x|S 0,x] : n = 1, . . . ,N (3.14)
is a subgradient of λ 7→ Vλ0,x(S 0,x) at λ.
Proof. Let λ′ ∈ RN+ . Consider the dynamic program (3.3) with this value λ′, which has
′
value Vλ0,x(S 0,x). Since π
∗
x(λ) is a feasible policy for this dynamic program, we have
[∑N′ ∗ ]
Vλ (S 0 ) ≥ Eπx(λ) ′0,x ,x rn,x(S n−1,x, zn,x; λn) + rN+1,x(S N,x)|S 0,x
∑n=1N ∑N
∗ [ ] ∗
= Eπx(λ) rn,x(S ′ π (λ)n−1 x,x, zn,x; λn) + rN+1,x(S N,x)|S 0,x − (λn − λn)E [zn,x|S 0,x]
n=1 n=1
= Vλ ′0,x(S 0,x) + (λ − λ) · g(S 0,x, λ)
Thus g(S 0,x;λ) is a subgradient of Vλ0,x(S 0,x). 
We can compute (3.14) recursively using the Markov property. Recall π∗(λ) is an
optimal policy for the single-system problem (3.3) when given λ, and let z∗n (S ) be+1,x n,x
∗
the action taken in state S n,x at time n as dictated by π∗(λ). Let P(s, n) = Pπ (λ)[S n,x =
s|S 0,x]. We can then write the subgradient (3.1∑4) as
− π∗E x(λ)[z ∗ ′ ′n,x|S 0,x = sx] = − zn(s )P(s , n − 1). (3.15)
s′∈Λn−1,x
P(s, n) can then be computed recursively as
P(s n) 
1(s=S 0 x), if n = 0,,
, = ∑P[S 1,x = s|S 0,x, z
∗
1(S 0,x)], if n = 1,
 ′ ∗ ′ ′s′∈Λ − P[S n,x = s|Sn 1,x n−1,x = s , zn(s )] · P(s , n − 1), if n > 1.
43
Hence we can compute∑(3.14). Finally, since B(S0, λ) is the sum of the single-system
values Vλ N0,x(S 0,x) and m n=1 λn, we have the following subgradient of B(S0, λ):
∑k ∂Vλ0,x(S 0,x)
+ m ∈ ∂B(S0, λ) (3.16)
∂λ
x=1 n
Equation (3.16) allows us to compute the upper bound (3.13) by first-order convex opti-
mization.
3.4 Index Policy
In this section, we describe an index-based policy based on the same decomposition used
to develop the upper bound. The intuition behind this policy is based on an unproven
conjecture, but the policy is well-defined whether or not this conjecture is true. We
demonstrate in numerical experiments in Section 2.5 that this policy performs well.
This index-based policy considers the relaxed problem (3.2) with a Lagrange mul-
tiplier λ = λe, where e = (1, . . . , 1) is the vector of all 1s and λ is a real number. This
index-based policy is based on the intuition that, for any state S n,x, as we increase λ
and thus λ we should see that an optimal policy corresponding to λ should take fewer
samples, because the samples are more expensive. While we conjecture that this is true,
and our numerical experiments support it, we have not confirmed it theoretically.
To calculate the number of samples taken in a given state, our index-based policy
varies λ until we find a value in which an optimal policy for the relaxed problem takes
m samples (or, if no such λ exists, it should take as many samples as possible without
taking more than m). We then sample according to the optimal policy for this λ. When
we get a new state, we repeat this process, finding a new λ vector, and a new sampling
allocation.
44
We define this index-based policy more formally as follows. We first introduce some
notation. Let Qλ be the set of single-arm policies that are optimal for (3.3) at the given
value of λ. Let zπn,x(S n−1,x) be the number of samples taken under a single-arm policy π
at time n in state S n−1,x.
At each time step n = 1, . . . ,N, this policy computes zn based on Sn−1 in the follow-
ing way:
{ }
1. Let zλ π λn,x(S n−1,x) ∈ zn,x(S n−1,x) : π ∈ Q be the number of samples taken under an
optimal single-arm policy with the given set of Lagrange multipliers λ, breaking
ties arbitrarily.
2. Let λ∗
{ ∑ }
= inf λ : x zλn,x(S n−1,x) ≤ m, λ = λe .
3. Set λ∗ = λ∗e.
∗ ∑
4. Let z πn,x ∈ {zn,x(S ) : π ∈ Qλn−1,x }, so as to satisfy kx=1 zn,x ≤ m, breaking ties
arbitrarily between different allocations zn that satisfy this constraint.
This index-based policy leaves free the tie-breaking rule used when choosing
zλn,x(S n−1,x), and also when choosing zn,x in the final step. We conjecture that the first
tie-breaking rule has no impact on the value for λ∗, although we have not confirmed
this theoretically. We conjecture that the best tie-breaking rule used∑to choose zn,x in the
final step is one that minimizes the num∑ber of unused cores, m − kx=1 zn,x, and that it
is always possible to find one with m = kx=1 zn,x, but again we have not confirmed this
theoretically.
Figure 3.1 illustrates the above procedure with two systems and two parallel re-
sources. Figure 3.1(a) and 3.1(b) show how the optimal number of samples varies with
the value of λ for each system, given their current states (2, 2) and (3, 3). Figure 3.1(c)
45
3 3
2 2
1 1
0 0
−1 −1
0 0.1 0.2 0 0.1 0.2
λ λ
(a) (b)
5
4
3
2
1
0
−1
0 0.1 0.2
λ
(c)
Figure 3.1: This figure illustrates how zn is chosen by the index-based policy with
k = 2 systems, m = 2 parallel computing resources, and N = 20
simulation batches. Figure (a) plots zλ2,2, the optimal number of sam-
ples to take in batch 3 from system 1 when it is in state (2,2), against
the value of λ; Figure (b) plots zλ3,3, the optimal number of samples
to take in batch 3 from system 2 when it is in state (3,3). Figure (c)
plots zλ2,2 +z
λ
3,3, the optimal total number of samples to take across both
systems. The dashed line in (c) shows the constraint m = 2, and λ∗
will be the left endpoint of the solid line overlapping this dashed line.
∗
The number of samples taken from each system will be zλ2,2 = 1 and
∗
zλ3,3 = 1 respectively.
shows the optimal total number of samples across both systems. Since there are two
parallel resources, the constraint is m = 2. Hence in this case λ∗ is the left end point of
the interval at height zλ2,2 + z
λ
3,3 = 2.
46
zλ
2,2
zλ +zλ
3,3 2,2
zλ
3,3
3.5 Numerical Results
In this section, we present numerical results illustrating the upper bound and the index-
based policy, as well as the baseline equal allocation policy.
We consider 4 different value of k: k = 2, 4, 8, 16. For each value of k, we set the
time horizon N = 5, the threshold value dx = 0.2 for all x ∈ {1, ..., k}, and the number
of parallel computing resources m = k. We set the initial state to be the same for every
system: S 0,x = (1, 1). We first calculated the upper bounds according to Theorem 3
for each value of k. The squared dots in Figure 3.2 represents the values of the upper
bounds (shown on a scale in which we divide by k). We then simulated the index-
based policy described in Section 4.6 for 10000 iterations respectively for k = 2, 4, 8, 16
respectively. The thinner lines in Figure 3.2 show 95% confidence intervals for the mean
performance of this policy. As a baseline, we also simulate the equal allocation policy in
which the m parallel computing resources are distributed equally to the k systems. The
equal allocation policy is simulated for 50, 000 replications for each value of k, and 95%
confidence intervals for the mean performance are shown as the thicker lines in Figure
3.2.
Confidence intervals for the index policy are wider than those for the equal allocation
policy because fewer samples are taken when estimating the expected value of the index
policy. This is because each simulation of the index policy takes a substantial amount of
time in our current implementation. For the same reason, the index policy’s confidence
intervals still overlap the upper bounds. If we took more samples, we expect that the
upper limit of the confidence interval would eventually fall below the upper bound,
because we do not think that our policy is optimal. Nevertheless, our results show that
the index policy performs substantially better than the equal allocation policy in the
47
0.334  
Upper Bound
Index Policy
0.332 Equal Allocation Policy
0.33
0.328
0.326
0.324
0.322
0.32
0.318  
0 2 4 6 8 10 12 14 16 18
Number of Alternatives (k)
Figure 3.2: This figure shows the upper bound on the performance of the optimal
policy for the MCS problem (dashed line with squares) normalized
by dividing by k, as well as the estimated performance of two sub-
optimal policies: the index policy from Section 4.6 (thinner lines and
dots); and the equal allocation policy (thicker lines and dots). The
setting pictured uses m = k, dx = 0.2, α0x = β0x = 1, cx = 0. We
use 10,000 independent replications to estimate the value of the index
policy, and 50,000 for the equal allocation policy. The plot shows that
the index policy is substantially better than equal allocation, and is
statistically indistinguishable from optimal given the number of repli-
cations performed.
setting studied.
In the figure, the upper bound, divided by k, initially increases, and then levels out.
This can be understood as follows. Because our initial state S 0,x[is identical for e∑ach sys]-
tem, our upper bound 3.13 can be rewritten as UB(S0) = inf λ Nλ≥0 kV0,x(S 0,x) +[ k n=1 λn .
w∑here x] is any arbitrary x. Dividing by k provides 1k UB(S0) = infλ≥0 Vλ0,x(S 0,x) +N
n=1 λn .
Vλ0,x does not depend on k directly, but it does depend on m which is the constraint
on the maximum number of samples that can be taken from system x in any batch, and
48
Expected Total Reward / k
we set m = k in our experiments. Thus, when we increase k, we loosen this constraint.
We believe this is why 1k UB(S0) increases initially with k. Then, as k grows large, this
constraint is no longer binding, as the optimal value of λ causes us to take less than
m = k samples in each batch. We believe this is why 1k UB(S0) levels off as k becomes
large.
3.6 Conclusion
We offered a computationally feasible way to obtain the upper bound on the total ex-
pected reward of the finite-horizon MCS problem through Lagrangian relaxation. We
then proposed an index-based policy using this Lagrangian relaxation. Using the upper
bound as a reference, we showed this index policy performs close to optimal by running
numerical experiments on a specific set of parameters.
49
CHAPTER 4
BAYES-OPTIMAL EFFORT ALLOCATION IN CROWDSOURCING:
BOUNDS AND INDEX POLICIES
In this chapter, we consider another RMAB-like problem: classifications with
crowdsourcing. In this problem, one has a group of classification tasks and wishes to
hire crowd-workers to determine, for each among a finite pool of simulatable systems,
which ones have an expected output measure that exceeds a known threshold. In this
chapter, we use Bayesian statistics and dynamic programming to study how one should
allocate simulate effort in the MCS problem, so as to best support this final determi-
nation. This chapter is organized as follows: In Section 4.1 and 4.2, we introduce the
background of the problem and give a literature review; In Section 4.3 and 4.4, we for-
mally state the problem and formulate the problem as a dynamic program; In Section
4.5, we provide a computationally tractable upper bound on the value of a Bayes-optimal
procedure; In Section 4.6 we present an index-based heuristic policy derived from the
computation of the upper bound, and which is similar to the general index-based policy
proposed in Chapter 2; In Section 4.7, we present numerical results in which we demon-
strate that this index-based policy performs close to an optimal policy and outperforms
the state-of-art; Lastly we make a conclusion in Sections 4.8.
4.1 Introduction
Crowdsourcing can accomplish large-volume tasks such as image classification or doc-
ument relevance assessment by using large pool of amateur workers at much less ex-
pense than is possible by hiring experts or by developing an automatic machine learn-
ing method [29]. Moreover, online platforms such as Amazon Mechanical Turk make
crowdsourcing service widely accessible by providing a marketplace in which requesters
50
may post tasks, which crowd-workers may complete in exchange for money. These fac-
tors are making crowdsourcing increasingly important.
Although crowdsourcing is less expensive than hiring experts, the number of images
or other tasks that a requester can correctly label or process is nonetheless limited by his
or her budget. This fact is compounded by the noise and variability inherent to crowd-
workers’ responses, which typically requires a single item to be processed independently
several times by multiple workers.
In this part of the dissertation, our goal is to find a sequential allocation of workers
to tasks that most accurately supports a correct aggregated label for each task, subject to
a limited budget (which in turn limits the number of workers that a requester can hire)
and a limited time horizon. In this chapter we focus on binary labeling tasks, but our
approach can also be extended to multi-class labeling.
Intuitively, much can be accomplished through a sophisticated allocation of worker
effort: When budgets are large relative to the overall difficulty of the tasks to be ac-
complished, a good scheme should allocating more workers to those tasks that are more
difficult, so that uniform quality can be ensured. When budgets are small, however,
those most difficult tasks should be abandoned so that the bulk of the budget can be
used to ensure that at least those easy tasks are done correctly.
We adopt a Bayesian approach, which is natural in crowdsourcing because: 1) It
allows us to leverage prior information about the tasks to be accomplished, which may
be learned in the crowdsourcing setting from features associated with each task and the
typically large collections of historical data collected in previous crowdsourcing cam-
paigns; 2) It seeks to maximize average-case performance with respect to the prior distri-
bution, which is natural in crowdsourcing where requesters typically tolerate some vari-
51
ability in quality, and are most interested in maximizing aggregate performance across
a large volume of tasks, rather than ensuring robustness to some worst-case distribution
over task characteristics, or studying asymptotic behaviors that do not become relevant
until the number of workers working on each task grows large.
Within this Bayesian framework, we formulate and study sequential effort allocation
as a partially observable Markov decision process, using tools from dynamic program-
ming. While the curse of dimensionality [38] prevents solving this dynamic program
to optimality, we provide a computationally tractable upper bound on the expected per-
formance under any Bayes-optimal effort allocation policy. Upper bounds are useful
because they allow evaluating the optimality gap for any given heuristic on any prob-
lem instance, simply by simulating the heuristic and comparing its performance to the
bound. The technique we use to obtain such upper bound is the Lagrangian Relaxation
on weakly coupled dynamic programs discussed in [1] and [25]. The proofs we present
in Section 4.5 are very similar in spirit to [1], but while Adelman based his proof on the
value functions of the DP formulation in a infinite horizon setting, we offer a proof based
on the initial objective function of the problem in a finite horizon setting. Nonetheless,
our crowdsourcing model is a specific application of the more general formulation in
[1] and [25]. Then, using Lagrange multipliers that appear in this upper bound, we de-
rive an index-based heuristic policy that is similar in spirit to the Gittins index policy
for multi-armed bandits [22] and the Whittle index policy for restless bandits [48]. We
then show that this index policy has performance close to the upper bound in numerical
experiments, and also outperforms other state-of-art policies for resource allocation .
Although the primary novelty and contribution of our work is that it is the first to
characterize the performance of the Bayes-optimal policy for effort allocation in crowd-
sourcing, and to develop Bayesian bandit-style index policies, our work is also novel
52
is modeling the asynchronous nature of crowd-work in a continuous-time setting, in
contrast with previous work on effort allocation in crowdsourcing that assumed instant
completion of tasks [51], [11], [28]. This model is inspired by how crowd-workers are
employed on Amazon Mechanical Turk; allowing an asynchronous process thus gives a
closer proximity to the real situations.
4.2 Related Work
There are two major strands of former works to which our work is related. The first
is the work on effort allocation and crowd labeling. Much of this work adopts a fre-
quentist viewpoint and focuses on error bounds for inference [30, 20, 29, 45, 27]. [30]
proposed an allocation algorithm based on a random graph, and while its performance
asymptotically order-optimal, one needs a very large number of workers to make this
relevant. [45] incorporates a limited budget, but lacks the notion of optimality. None
of the work above considers a finite time horizon. There is also work with more of a
Bayesian flavor([51, 4]). While they focused on the efficiency of allocation, they did not
consider an optimal solution. Among the work that adopt a Bayesian framework, our
work is similar to [10] in that we both form an optimal policy in the form of a stochastic
dynamic program. Although they also provide a well-motivated heuristic policy, our
work pushes further by deriving an upper bound based on this formulation of optimal
policy.
The second strand resides in the literature of Multi-armed bandit (MAB) and
stochastic dynamic programming. The formulation a Bayesian-optimal procedure as
a dynamic program is considered in [34, 35]. Our use of Lagrangian relaxation is an
application of the relaxation method of weakly coupled dynamic program discussed in
53
[1]. The setting in this work differs from the previous works by that only one task is
to be assigned when a worker enters and the completion of task is not instant. The
index-based policy proposed in this work, which uses Lagrangian Multipliers to assign
indices, draws inspiration from [48].
4.3 Problem Statement
We consider a requester of crowdsourcing service with K independent binary labeling
tasks. Due to a budget constraint, the requester allows a maximum of U workers to work
on these tasks, and requires all work to be completed by a time horizon T . We model the
arrival of workers to the crowdsourcing system by a Poisson process with rate r. (Our
model can be generalized to non-homogeneous Poisson processes with little additional
effort.)
As each worker enters the system, the requester selects one of the K tasks for the
worker to label. We let z` ∈ {1, . . . ,K} indicate the task assigned to the `th worker. (We
use [K] to denote {1, . . . ,K} for the rest of the paper.) The worker spends a random
Exponential(µ) amount of time on the task x, independent of all else, and then provides
a binary label y`.
Workers do not always give the correct label because the task may be ambiguous
and thus hard to categorize, or workers may be careless or lack background information
when they conduct the labeling process. We suppose that workers are “homogeneous”
(a term used in [11]), and give noisy but unbiased labels. More specifically, each task x
has an associated unknown value θx ∈ [0, 1], which is the underlying probability that it
will be labeled as positive by a worker. The distribution of the label generated by the `th
54
worker given θ1, ...θK and z` is
y`|θ1:K , z` ∼ Bernoulli(θz ). (4.1)`
We set a known threshold value dx, and consider the label for task x being positive
if θx > dx. We let B = {x : θx > dx} be the set of tasks whose correct label is positive.
Note B is unknown as θx are unknown.
For analytical convenience we use the Beta distribution, which is the conjugate prior
of the Bernoulli distribution, as the prior for each θx independent across all x.
θx ∼ Beta(α0,x, β0,x).
With the assumption of this independent beta prior on each θx, and the conditionally
independent Bernoulli responses as in (4.1), the posterior on θx after some number of
workers have provided responses will remain beta-distributed, with first parameter equal
to the sum of α0,x and the number of positive responses, and the second parameter equal
to the sum of β0,x and the number of negative responses. In practice, one can estimate
appropriate values for the parameters α0,x and β0,x from historical data on tasks pre-
viously labeled by the crowd. We discuss this further in section 4.7 where numerical
experiments are performed.
Note the assumption of a Beta distribution can be relaxed without a great deal of dif-
ficulty, as the posterior distribution will remain in an exponential family parameterized
by the number of positive and negative labels observed for the instance. The assumption
of independence cannot be easily generalized, as it is necessary for the decomposition
in our Lagrangian relaxation, without which the upper bound in Section 4.5 much more
challenging to compute.
Thus, after the worker budget U has been exhausted or the time horizon T has
55
elapsed, the requester will have a posterior distribution on each θx which remains beta-
distributed. Let α′x, β
′
x be the posterior parameter for this time. At this time, we model
the requester as choosing, for each task x, an estimated label based on the responses
of the crowd-workers, and then receiving a reward of 1 for each correctly labeled task,
and 0 for the incorrectly labeled tasks. (Our approach can be easily generalized to other
reward or loss structures that are additive across tasks, and depend only on θx and some
task-specific estimate based on the crowd’s feedback.)
The expected reward under the posterior that the requester will obtain is P(θx >
dx|α′x, β′x), if s/he chooses a positive label, and P(θ ′ ′x < dx|αx, βx) if s/he chooses a neg-
ative label (θx has a density, and so θx = dx with a posterior probability of 0). Thus,
the requester chooses the label giving the larger reward, and achieves a reward whose
expected value under the posterior {is, }
R(α′x, β
′
x) = max P[θ
′
x>dx|αx, β′x],P[θx<dx|α′ , β′x x] ,
Across all tasks, the requester’s expected re∑ward under the posterior isK
R(α′, β′) = R(α′x, β
′
x), (4.2)
x=1
where α′ = (α′x : x ∈ [K]) and similarly for β′.
The goal of the requester is to design a policy to dynamically assign tasks to workers
entering the system so as to maximize the expected reward received, based on the labels
obtained from the crowd-workers.
4.4 Dynamic Programming Formulation
We now formalize the problem statement from Section 4.3 as control of a continuous-
time Markov chain, which can be analyzed through a stochastic dynamic program built
56
on the embedded discrete-time Markov chain. This continuous-time Markov chain
tracks the evolution of worker assignments and posterior distributions on θx that results
from a requester’s dynamic assignment policy.
The state of this continuous-time Markov chain contains:
• length-K vectors α = (α1, . . . , αK) and β = (β1, . . . , βK) that will describe the
posterior distribution on each θx given the labels observed thus far (θx will be
distributed according to Beta(αx, βx) under this posterior).
• a length-K vector w = (w1, ...,wK) that tracks the number of workers currently
working on each task.
• an integer ` that tracks the number of workers that have entered the system and
been assigned to tasks (but not necessarily completed them).
• the time t of the most recent event, either a worker completing a task, or a worker
arriving.
We indicate such a generic state by s = (α,β, t,w, l) and let S = RK × RK × R × NK × N
be the set of possible values this state can take. We let α(s), β(s), t(s), w(s) and `(s) all
indicate the corresponding components of s.
Transitions occur in this Markov chain when workers complete tasks, and when
workers arrive to start work on a task. We use n to count the number transitions (or
“events”), we let S n ∈ S indicate the state just after the nth event, for n ≥ 1. The initial
state is S 0 = (α0, β0, 0,0, 0), where α0 = (α0,x : x ∈ [K]) and β0 = (β0,x : x ∈ [K])
together describe the prior distribution, and 0 is a vector of K zeros.
We let ∆n denote th∑e time duration between event n and n+1, i.e., ∆n = t(S n+1)−t(S n).
Then, ∆n|S n ∼ Exp(µ Kx=1 wx(S n) + r).
57
We define a policy π that controls how the requester assigns incoming workers to
tasks, based on the current state. This policy π will map S n and ∆n onto {0, 1}K , and
π(S n,∆n) will give the number of new workers assigned to each of the K tasks, if the
transition from S n to S n+1 was caused by an arriving worker. Below we will constrain
this to prevent assigning more than one task to a worker, and then later in the Lagrangian
relaxation we will relax this constraint.
Formally, let Π be the set of all measurable functions from S × R to {0, 1}K+ . Then,
let | · | return the sum of individual components of a vector, and define
Π0 = {π ∈ Π : |π(s,∆)| ≤ 1,∀s ∈ S, t ∈ R}, (4.3)
where we have added the additional constraint to Π that at most one task can be assigned
to an incoming worker. Only those π ∈ Π0 will be feasible policies for the problem of
interest, but we will consider the larger set of policies Π to support later theoretical
analysis.
The set of policies Π0 allows not assigning an incoming worker to a task even when
budget or time remains, but we will see below that this will still exhaust one unit of
budget, and so optimal policies (or reasonable heuristics) will always assign incoming
workers to tasks when possible.
Each π ∈ Π defines a discrete time Markov chain (S n : n ∈ {0, 1, . . .}) over the state
space S, whose transition kernel we will indicate by Pπ(s′|s). This transition kernel can
be written as ∫ ∞
Pπ(s′|s) = Pπ(s′|s,∆) exp(−∆q(s)) d∆,
0
where we have defined ∑K
q(s) = µ wx(s) + r.
x=1
58
Thus, to complete the description of this transition kernel, it is sufficient to describe
Pπ(S n+1|S n,∆). For this description we suppose S n = (α,β, t,w, `) and let q = q(S n).
When t + ∆n ≥ T , the system has exceeded its time horizon, all outstanding tasks
on which workers are currently working are canceled, and only the time is updated:
S n+1 = (α,β, t + ∆n,0, `).
When t + ∆n < T , time remains and the next event can be either a worker arrival or
a worker completion. A completion either outputs a positive result or a negative result.
A worker arrives with probability r/q. If ` < U, then the requester allocates this
worker to a task, and the total number of arrivals is incremented: S n+1 = (α,β, t +
∆n,w + π(S n,∆n), ` + 1). If ` ≥ U, then the worker budget has been exceeded, and the
requester cannot allocate the worker, so S n+1 = (α,β, t + ∆n,w,∆n), `).
For each x ∈ [K], a worker completes this task x and reports a positive label with
probability αx µwxq . When this occurs, Sαx+βx n+1 = (α + ex, β, t,w − ex, `).
Similarly, a worker completes task x and reports a negative label with probability
βx µwx . When this occurs, S = (α,β + e , t,w − e , `).
αx+βx q n+1 x x
This completely specifies the transition kernel for the discrete-time Markov chain
that describes the continuous-time dynamics of both worker allocation and the posterior
distribution on each θx.
To model completion, we define SA = {s ∈ S : t(s) ≥ T or (`(s) ≥ U and w(s) = 0)}
to be the set of states in which our time horizon has elapsed, or our worker budget
has been exhausted and all allocated workers have finished their work. We then let
N = inf{n ≥ 0 : S n ∈ SA} be the number of events that occur up to and including the
time when we reach a state in SA. The posterior α(S N), β(S N) is the one with which the
59
requester must make his/her final determination of the task labels, and so the expected
reward under the posterior that s/he receives at time t(S N) is R(α(S N), β(S N)).
Recall that our goal stated in section 4.3 was to find the dynamic allocation policy
π of workers to tasks that maximizes the expected number of correctly classified tasks.
With the definition of this Markov chain in place, this overarching goal may be stated
formally as solving
π [ ]sup E R(α(S N), β(S N)) . (4.4)
π∈Π0
As a stochastic control problem, its solution may be characterized using stochastic
dynamic programming. We define the value function as
[ ]
V(s) = sup Eπ R(α(S N), β(S N))|S 0 = s , (4.5)
π∈Π0
and observe that the value function satisfies the dynamic programming recursion.
First, for s ∈ SA, we have V(s) = R(α(s), β(s)). Then, for s = (α,β, t,w, `) < SA and
q = q(s), we have, If ` < U:
( ) [
V(∫s) = 1 − exp(−q(T − t)) ·T−t
r max V(α,β,t∫+y,w + e , `+ 1)e
−qy
z dy+
∑0 zK ( α T−t
µw xx ∫ V(α+ex, β, t+y,w−e , `)e
−qy
x dy
α + β
x=1 x x 0
β T−tx )]
+ V{ (α, β+}ex, t+y,w−e
−qy
x, `)e dy
αx + βx 0
+exp(−q(T − t)) R(α,β) . (4.6)
60
If ` ≥ U: ( ) [
V(s) =∫1 − exp(−q(T − t)) ·T−t
r V(α,β,t∫+ y,w, `)e
−qydy+
∑ 0K ( α T−t
w xµ x ∫ V(α+ex, β, t+y,w−e
−qy
x, `)e dy
α + β
x=1 x x 0
β T−tx )]
+ V{ (α, β+}ex, t+y,w−e , `)e
−qy
x dy
αx + βx 0
+exp(−q(T − t)) R(α,β) . (4.7)
Moreover, knowing the value function reveals an optimal policy: an optimal policy is
given by choosing the task Z` to assign to the next worker, in response to previous state
S n at time t(S n)+∆n, to achieve the maximum in maxz V(α(S n), β(S n), t(S n)+∆n,w(S n)+
ez, ` + 1).
However, solving this dynamic program is computationally infeasible. For example,
if we discretize the continuous time line to just 1000 intervals, when we have K = 4
tasks, the number of states to consider after l = 20 workers entering the system is
2.26 ∗ 1011, which is too big to compute. Hence we seek to first provide an upper bound
to the optimal value and then use the upper bound as the yardstick to measure how close
a heuristic policy performs to an optimal policy.
4.5 Upper Bound on the Bayes-Optimal Policy
Although solving (4.4) directly using the stochastic dynamic program (4.6),(4.7) is com-
putationally intractable, in this section we show how to obtain a computationally feasible
upper bound on the value (4.4) using a Lagrangian relaxation.
Recall we use n to count events and S n is the state corresponding to the nth event.
Define n` as the number of events that have occurred by the time of the `th arrival (in-
61
clusive), i.e., n` = inf{n : `(S n) = `}. For 1 ≤ ` ≤ U, define a` = π(S n`−1,∆n`−1),
so that a = 1∑if the `th`,x worker is assigned to task x. Therefore Π0 satisfies
Π0 = {π ∈ Π : Pπ( Kx=1 a`,x ≤ 1)= 1 ∀`}. We also define a ne ∑ 
w subset of Π:
  K 
 
Π π1 = π ∈ Π : E a x`,  ≤ 1 ∀` . (4.8)
x=1
Under Π1, we may assign a worker to more than one task along a particular sample path,
as long as the expected number of tasks assigned to each worker is no larger than 1.
Returning to our Markov chain model, we will observe that when a worker is assigned
more than one task, the tasks are completed independently from each other. Observe
that Π1 includes a larger set of policies than Π0, and that Π includes a set that is larger
still, i.e., Π0 ⊆ Π1 ⊆ Π. Our result will use this relation.
To streamline notation, let R = R(α(S N), β(S N)). The optimal reward for the original
crowdsourcing problem (4.4) is then,
R0 = sup Eπ [R] , (4.9)
π∈Π0
and the optimal reward under the larger class of policies Π1 is
R π1 = sup E [R] . (4.10)
π∈Π1
Here we introduce a non-negative vector λ = {λ1, ..., λU} ≥ 0, which we use below
as a Lagrange multiplier within a Lagrangian relaxation.
More specifically, we will relax the constraint that each worker is assigned to at most
one task, but will penalize the number of tasks assigned in a way that ensures that an
upper bound holds regardless of what λ is (as long as it is componentwise-nonnegative).
This will then provide an upper bound on the optimal value of the original problem
R0, which can be made tighter by minimizing over λ. This upper bound can then be
62
computed below via decomposition into K small dynamic programs of fixed dimension
that can be solved efficiently, even as K grows large.
Our upper bound is provided in the following theorem.
Theorem 4. The term
[ ∑U ∑ ] ∑U
inf supEπ R − (λ
≥ `
a`,x) + λ` (4.11)
λ 0 π∈Π `=1 x `=1
forms an upper bound to R0.
Proof.
[ ∑U ∑ ] ∑U
supEπ R − (λ ` a`,x) + λ`π∈Π
= supEπ 
`=1 x `=1
 ∑U  ∑ R − λ`( a`,x − 1)

π∈Π
 ∑
`=1
U  ∑
x
≥ sup Eπ R − λ`( a`,x − 1) 

π∈Π1 `∑=1U  x ∑ 
= sup Eπ [R] − λ` Eπ[ a `,x] − 1
π∈Π1 `=1 x
≥ sup Eπ [R]
π∈Π1
≥ sup Eπ [R] (4.12)
π∈Π0
The∑first inequality is due to Π1 ⊆ Π. The second inequality is because λ ≥ 0 and
Eπ[ x a`,x] ≤ 1 for any π ∈ Π1. The third inequality is due to Π0 ⊆ Π1. Since (4.12)
holds true for any value of λ > 0, we obtain Theorem 4. 
Calculating the supremum term in Theorem 4 directly by dynamic programming is
again computationally infeasible, because the state space of this dynamic program again
is over all of S, which has 3K + 1 dimensions. We avoid this issue by decomposing this
63
supremum term into the sum of the optimal values for K dynamic programs, one for
each task, each of which has a much more manageable 4 dimensions.
To support this decomposition, we write the state S n ∈ S for the whole system (K
tasks) as S n = (S n,1, . . . , S n,K), where S n,x is the state for task x when n events have
occurred, and includes αx, βx, t,wx, and the global counter `. We let S(x) = R × R × R ×
N × N be the set of possible values for this single-task state S n,x.
Following a development identical to that in Section 4.4, but for the single task
x, we may define a space of policies π(x) that map the single-task state S n,x and the
elapsed time since the last event (x)∆n (counting worker arrivals over the whole system,
and completions of task x only) onto a binary decision of whether or not to allocate an
incoming worker to task x, so that π(x)(S n,∆n) ∈ {0, 1}. Following this development, we
construct K independent Markov chains, one for each task, where each one is controlled
by its respective single-task policy π(x). We define N as before, to be the first time that
the time horizon elapses, or our worker budget has been exhausted and all outstanding
workers have completed their work. We then let Rx = R(αx(S N), βx(S N)) be the reward
obtained from this the single task at this time.
The following theorem shows that the bound in Theorem 4 can be re-written in
terms of the sum of solutions of single-task dynamic programming problems, where
each obtains the reward Rx, and is penalized for assigning workers to its task.
Theorem 5. ∑K ∑U
(x)[ ] ∑U
inf sup Eπ Rx − λ`a≥ `,x + λ` (4.13)λ 0
x=1 π
(x)∈Π(x) `=1 `=1
forms an upper bound on R0.
64
Proof. For any λ ≥ 0:
∑U ∑
supEπ[R − (λ` a`,x)]
π∈Π ∑ `=1 ∑xK K ∑U
= supEπ[ Rx − λ`a`,x]
π∈Π [ x∑=1 ( x=K ∑1 `=1U )]
= supEπ Rx − λ`a`,x
π∈Π ∑ x=1K [ ∑`=1U ]
= sup Eπ Rx − λ`a`,x .
π∈Π x=1 `=1
This is bounded above by,
∑K [ ∑U ]
supEπ Rx − λ`a`,x
∑x=1 π∈Π `=1K ∑U
π(x)
[ ]
= sup E Rx − λ`a`,x . (4.14)
π(x)∈Π(x)x=1 [ `=1 ∑ ]
The equality at (4.14) is because sup ππ∈Π E R − Ux `=1 λ`a`,x depends only on
(αt,x, β (x)t,x,wt,x : 0 ≤ t ≤ T ), which is in turn governed by π . By Theorem 4, for
any λ ≥ 0, ∑K [ ∑U(x) ] ∑U
sup Eπ Rx − λ`a`,x + λ`
(x) (x)
x=1 π ∈Π `=1 `=1
forms an upper bound on R0. This hold for any λ ≥ 0, and so we have thus proved
Theorem 5. 
Since the state space is much smaller for a single-task system, we can use dynamic
programming to solve for the supremum term
(x)[ ∑U ]
sup Eπ Rx − λ`a`,x , (4.15)
π(x)∈Π(x) `=1
for any λ value. What remains in the computing of the upper bound is to solve for the
infimum in Theorem 5. We ex∑plore the convexity[prope∑rty of the ]prob∑lem follow by a(x)
binary search. Define B(λ) = Kx=1 sup
π U U
π(x)∈Π(x) E0 Rx − `=1 λ`a`,x + `=1 λ`, which is
65
the upper bound derived in Theorem 5 without the infimum. First we prove that B(λ) is
convex in λ.
Lemma 12. B(λ) is convex in λ.
∑ [ ](x) ∑
Proof. First note U`=1 λ` is convex in λ. To prove sup
π U
π(x)∈Π(x) E Rx − l=1 λlal,x is
convex in λ, pick any λ1, λ2 ≥ 0 and t ∈ [0, 1] . Let∑U
′ (x)
[ ]
π = sup Eπ Rx − (tλ1,l + (1 − t)λ2,l)al,x (4.16)
π(x)∈Π(x) l=1
We have
∑U
π(x)
[ ]
t sup E Rx − λ1,lal,x
π(x)∈Π(x) [l=1 ∑U(x) ]
+(1 − t) sup Eπ Rx − λ2,lal,x
[ π(x)∈∑Π(x) ] l=1U [ ∑U ]
≥tEπ′ R − λ a + (1 − t)Eπ′x 1,l l,x Rx − λ2,lal,x
[ ∑l=1 ] l=1U′
=Eπ Rx − (tλ1,l + (1 − t)λ2,l)al,x
l=
(x)[1 ∑U ]
= sup Eπ Rx − (tλ1,l + (1 − t)λ2,l)al,x .
π([x)∈Π(x) ∑ l]=1(x)
Hence sup π Uπ(x)∈Π(x) E [Rx − l=1 λlal,x ]is convex in λ for any x ∈ [K], subsequently the
sum of sup Eπ
(x) − ∑R Uπ(x)∈Π(x) x l=1 λlal,x across x is also convex in λ. Thus we complete
the proof that B(λ) is convex in λ. 
With the convexity in λ, we approximate λ′ that achieves the infimum by setting
λ′ = λ′ × (1, , , 1), and use a Fibonacci search to find the infimum. Here we constrain all
the units of λ′ to be the same for simpler computation, a tighter bound can be obtained
by allowing each unit of λ′ to vary and use sub-gradient descent to locate the infimum.
66
4.6 Index Policy
We introduce an index-based heuristic policy built on the Lagrangian relaxation we used
in proving the upper bound. In this policy, we compute some λ∗x for each task x based
on its state S ∗n,x, such that λx is the greatest value of λ that the optimal policy will decide
to hire the worker on state S n,x when solving (4.15) with λ = λ1, 1 = (1, . . . , 1). We
then assign the incoming worker to the task with the highest λ∗. The intuition behind
this policy is that in a single-task problem described by (4.15), we view λ` as a cost of
employing the `th worker. As λ` increases, our decision switches from hiring the worker
to not hiring, where the switching point is at λ∗x. Hence tasks with a high λ
∗
x are the
tasks that are worth hiring more workers to work on. Below we present the algorithm
in a more formal way. A useful technique to reduce the amount of computation is to
Algorithm 4: Index Policy
1: while ` < U do
2: For each x ∈ {1, . . . ,K}, compute λ∗x = inf{λ ∈ R+ : aλ`,x = 1}, where aλx,` is the
optimal decision from (4.15) when λ = λ1.
3: Let x∗ = arg maxx λ∗x. Break tie arbitrarily.
4: Assign task x th∗ to the ` worker.
5: end while
put a cap on the total number of workers that can be assigned to a task, for this reduces
the size of the state space of the dynamic program involved in solving for (4.15). This
additional cap does not affect the decision made by the Index Policy. Intuitively it is
unlikely for any reasonable policy to assign all the U number of workers to one task, so
it is unlike for any task to get more than a certain number of workers assigned. One can
check the validity of the cap by running simulations with the capped index policy, and
67
see whether there are tasks that use all the workers that the cap allows.
We show in section 4.7 that this policy’s performance is close to optimal.
4.7 Numerical Experiment
For numerical experiment we concentrate on the case in which T = ∞. In this case we
stop when we reach the maximum number of workers the budget allows. The Bellman’s
recursion to compute 4.15 in the computation of upper bound for this special case is
given in the supplement. In the first set of simulation study, we compare the perfor-
mances of different policies on simulated data against the corresponding upper bounds.
In the second set of simulation study, we use a real dataset for simulation.
4.7.1 Simulation using simulated data
In the first set of simulations we evaluate the performance of the Index Policy using
simulated data, and compared the total reward given in (4.2) generated by the Index
Policy to the upper bound 4.11. We also compare the performance of the Index Policy
to Optimistic Knowledge Gradient(OKG) method [10], which is a state-of-art Bayesian
allocation policy. A round of simulated process includes generating either an arrival of
worker or a completion of task based on the arrival rate r = 0.1 and and completion
rate µ = 0.4 with distributions specified in Section 4.3. If it is a completion of task, we
generate a label based on the posterior parameters. The process stops when we exhaust
all the budget, i.e., the number of workers that are allowed to hire, and we get a reward
as in (4.2). We vary the number of tasks to be K = 10, 100, 1000, and set the budget to
be U = 1.2K. We use a non-informative prior with α = 1 and β = 1. We use a threshold
68
dx = 0.5 for all the tasks. For each value of K = 10, 100, 1000, we simulate the process
5000 times, and obtain a 95% confidence interval for the simulated total reward. In
Figure 4.1 we show a Semi-log plot of the number of tasks K against the average reward
per task with the corresponding confidence intervals.
0.760
0.755
0.750
Upper bound
0.745 Index Policy
Okg Policy
101 102 103
K
Figure 4.1: Semi-log plot of K against average per task reward (R/K) for K =
10, 102, 103.
We would like to make a note that in this numerical study, we truncate the time
horizon of the single-task DP from 1.2K to 6 to reduce computational complexity. This
truncation preserves the original dynamics as it is very unlikely for a task to be assigned
6 workers when the total number of workers is just 1.2 times the number of tasks. We
use simulation to demonstrate that 6 is still a loose cap numerically. We run a simulation
with simulated data with 1000 replications for number of workers K = 10 and 100, and
U = 1.2K and count the number of tasks that uses 0 worker, 1 worker and up to 6
workers. The results are shown in Figure 4.2. The results show that a task uses at most
69
total reward/K
4 workers, hence set a cap at 6 does not affect the performance of the index policy.
4500 60000
4000
50000
3500
3000 40000
2500
30000
2000
1500 20000
1000
10000
500
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
No. of workers assigned to a task No. of workers assigned to a task
Figure 4.2: Histogram of number of workers assigned to a task
The performance of the index policy is consistently better than the OKG policy,
especially for smaller number of tasks. Moreover, the gap between the upper bound and
the simulated value gets smaller as K increases, which demonstrates both that the upper
bound is tight and the Index Policy performs close to an optimal policy as the number
of tasks gets larger.
We emphasize that the improved performance over OKG offered by our Index Policy
is only one aspect of the contribution of this work. The other aspect is the tightness of the
upper bound, especially for problem instances with many tasks. This tight upper bound
for K=1200 shows that the index policy is within 0.03% of optimal, and that continued
algorithmic development will not provide significantly increased performance for large-
scale crowdsourcing problems with characteristics matching this particular simulated
dataset. The ability to bound the improvement from continued algorithmic development
for a particular problem instance, or class of problem instances, is useful for managers
at companies that use crowdsourcing and wish to allocate engineering\R&D effort.
70
No. of tasks
No. of tasks
4.7.2 Simulation using real data
This set of simulations uses a real dataset, PASCAL RTE-1[43], which consists of 800
tasks, each comes with 10 labels obtained from crowdworkers and a gold standard label.
(A gold standard label of a classification task is its true label.) We evaluate the perfor-
mance of the Index Policy against the OKG policy, the Thompson Sampling [8] and a
widely used frequentist approach - the upper confidence bound(UCB) policy [3]. (The
specific version of UCB used here is UCB1-tuned.) The metric used to evaluate the
performance is the accuracy score. More specifically, it is the number of correctly pre-
dicted labels over the total number of tasks. In each round of simulation process, we still
simulate events (either arrival or completion) the same way as we did in Section 4.7.1.
If the event is a completion, we read the most recent label for that task in the dataset. At
the end of each process, predicted labels are compared against the gold standard labels.
dx is still 0.5 for all tasks. For each value of K = 10, 100, 750, we simulate the process
5000 times, and obtain a 95% confidence interval for the simulated total reward.
Before simulating, we use the remaining 50 tasks from the RTE dataset as the ‘his-
torical data’ to estimate the parameters of the Beta prior, which is set to be the same
across all tasks. This comes with an assumption that all tasks are homogeneous, hence
a subset of them are representative of a larger population. We first obtain an estimate
of θx for each of the 45 tasks, then use these empirical values of θx to fit a Beta distri-
bution by Method of moments. In Figure 4.3 we show a Semi-log plot of the number
of tasks K against the accuracy score with the corresponding confidence intervals. All
the Bayesian policies see an smaller optimality gap when K gets larger. Index Policy
performing consistently the best among all the policies. It is proven in [10] that OKG
policy is consistent: it achieve 100% accuracy almost surely when number of work-
ers goes to infinity. We demonstrate numerically that the Index Policy performs better
71
0.80
0.79
0.78
0.77
0.76
Upper bound
0.75 Index Policy
Okg Policy
0.74 UCB
Thompson
0.73
101 102 103
K
Figure 4.3: Semi-log plot of K against accuracy score for K = 10, 100, 750.
than the OKG policy. It is thus reasonable to anticipate that the Index Policy is not
only consistent, but is asymptotically optimal when both the number of workers and the
number of tasks goes to infinity, while keeping the ratio of the number of workers and
the number of tasks constant.
4.8 Conclusion
We formulated the effort-allocation problem in crowdsourcing in a continuous time set-
ting with budget constraint and time constraint. We also provide a computationally
feasible upper bound on value of the Bayes-optimal policy using Lagrangian relaxation.
Using the Lagrange multiplier used in proving the upper bound, we also derived an
index-based policy and showed in numerical experiments that it performs close to opti-
mal.
72
Accuracy Score
APPENDIX A
A.1 Notation of Chapter 2
SK ,AK ,P·(·|·),R(·, ·) State space, action space, transition kernal and reward function of
the original MDP.
S,A, P·(·, ·), r(·, ·) State space, action space, transition kernal and reward function of
the sub-processes of the original MDP.
s, S Generic element and random element of SK .
s, S Generic element and random element of S.
K Number of sub-processes.
T Time horizon
mt Number of sub-processes to be set active at time step t
Π Set of all Markov policies of the original MDP
π(s, a, t) the probability of choosing action a in state s under policy π at
time t.
Π Set of all Markov policies of the sub-MDP
π(s, a, t) the probability of choosing action a in state s under policy π at
time t.
Π∗(λ) Set of Markov deterministic optimal policies for sub-MDP Q(λ),
given λ ∈ RT .
πλ An element in Πλ.
πλ A deterministic optimal policy for the relaxed problem which ob-
tained by the decomposition method in Lemma 2, given λ ∈ RT .
P(λ) Optimal value of the relaxed problem, given λ ∈ RT .
λ∗ An value that attains infλ P(λ)
73
π∗∗ An optimal markov policy for the sub-MDP which satisfies
π∗∗E [A mt] = K , ∀1 ≤ t ≤ T .
π̂ The index based policy proposed by this paper
β̄t Indices of the tied sub-processes.
It The set of states occupied by the tied sub-processes.
Nt(s) The number of sub-processes in state s at time t under index policy
π̂.
Pt(s) The probability of an individual sub-process landing in state s at
∗∗
time t under π∗∗. Pt(s) = Pπ [S t = s]
α The ratio between the number of sub-processes set active, m, and
the total number of sub-processes K.
Mt(s) The number of sub-processes in state s at time t that are set active
under our index policy π̂.
Yt(s′, s) The number of sub-processes set active by π̂ in s′ at time t which
transition to state s at time t + 1.
X ′t(s , s) The number of sub-processes set inactive by π̂ in s′ at time t which
transition to s at time t + 1.
Ut(s) The set of states whose indices are greater than the index of state
s at time t. Ut(s) = {s′′ ∈ S : βt(s′′) > βt(s)}.
Vt(s) The set of states whose indices are equal to that of s.
V (s) = {s′′t ∈ S : βt(s′′) = βt(s)}.
|v| An operation that sums all the elements in vector v.
H1, H2 random variables due to the rounding rules in Algorithm 2
Z(π,m,K) the expected reward of the original MDP obtained by policy π
Table A.1: List of notation
74
A.2 Upper Bound
Proof. Proof of Lemma 1 Let ΠP = {π ∈ Π : Pπ(|At| = mt) = 1, ∀1 ≤ t ≤ T }. Let
ΠE = {π ∈ Π : Eπ[|At|] = m Tt, ∀1 ≤ t ≤ T }. For any λ ∈ R , we have
P(λ)
= maxEπ 
∑T ( ) ∑ Rt St,A  − Eπ t   λt(|At| − m∈ t)

π Π
∑
t=1 t
T
≥ ( )maxEπ R S ,A  − Eπ ∑ λ (|A | − m )t t t t t t 
π∈ΠE
t=1 t
= maxEπ
∈ 
∑T ( )Rt St,At 
π ΠE ∑t=1T 
≥  ( )maxEπ  Rt St,A ∈ t  ,π ΠP
t=1
which is the optimal value of the original MDP. The first inequality is due to ΠE ⊆ Π.
The first equality is due to the fact that any policy π in ΠE satisfies Eπ[At|] = mt. The
last inequality is due to ΠP ⊆ ΠE. 
75
A.3 Decomposition
Proof. Proof of Lemma 2:
π ∑T ( ) ∑T maxE R S ,A − Eπ ( t t t  λt |A | −
)
m 
∈ t t π Π ∑t=1 ( ) t=1T  ∑T
= maxEπ  R S ,A − λ |A | + m
∈  t t t t t   tλtπ Π t=1 t=1
= maxEπ ∑T ∑K − r (S , A ) λ A  ∑T+ m
∈   t t,x t,x t t,x tλt∑π Π t=1∑x=1 t=1K T
= maxEπ  rt(S t, At) − λtAt ∑T+ mtλt
π∈Π
x=1 t=1 t=1

The first equality is due to linearity of expectation. The second equality is obtained
by the definition of rt(·, ·) and | · |. The third equality is obtained by the independence of
the process under policies in Π.
A.4 Show arg infλ∈RT P(λ) is non-empty
∑ ∑
Proof. When λ ≥ 0, P(λ) = x Rx(λ) + m t λt ≥ 0 + 0 = 0. Rx(λ) is bounded below
by 0 since a policy of not playing at all gives a total reward of 0. When λ < 0, the
cost of play∑ing is negative, ∑an optimal policy will always play at all time steps. Hence
P(λ) ≥ m ( t(0 − λt)) + m t λt = 0. For the case in which λ contains both positive
and negative entries, writing λ as a convex combination of λ1 > 0 and λ2 < 0 and we
have that P(λ) is still bounded below by zero, since P(λ) is convex in λ. Hence we
76
can conclude that infλ∈RT P(λ) exists (note here we make no claim about whether this
infimum is attained by any finite λ) and denote this value by h∗.
Recall we have assumed in the setup that all the rewards are bounded and non-
negative, let r̄ be an upper bound for all the reward values. For any λ with λt ≥ T r̄, the
corresponding optimal policies for a single-arm problem will be not play at time t, for T r̄
is at least the maximum reward∑obtainable by the single-a∑rm problem.∑Hence P(λ) ≥ 0+
mTr̄. For any λ ≥ 0, P = mE[ t rt,x(S t,x, 1) − λt|s1,x] + m t λt = mE[ t rt,x(S t,x, 1)|s1,x],
which is independent of λ. Hence the infimum is attained on the set H = {λ : λt ≥
∀t and maxt λt ≤ tr̄}. Since H is compact, there exists a λ∗ ∈ H s.t. P(λ∗) = h∗. Hence
arg infλ∈RT P(λ) is non-empty. 
A.5 Proof of the existence of π∗∗
Let L(π,λ) be the Lagrangian of the sub-problem
∑T
maximize Eπ[ rt(S t, At)]
π
t=1
m
subject to Eπ[At]
t
= , ∀1 ≤ t ≤ T.
K
Then
∑T ∑T m ∑T m
max L(π,λ) = maxEπ[ r (S , A )] − λ (Eπt t t t [At] − t ) = Q( ) tλ + λt .
π π K K
t=1 t=1 t=1
77
Let π∗∗ be a policy that attains Q(λ∗). Hence
∑T
Q(λ∗
m
) t+ λ∗ =L(π∗∗, λ∗t )K
t=1
= inf max L(π,λ)
λ π
= max inf L(π,λ)
π λ ∑T ∑T
= max inf Eπ
m
[ r π tt(S t, At)] − λt(E [At] − ).
π λ K
t=1 t=1
We are allowed to switch inf and max in the third equality because L(π,λ) satisfies
Slater’s condition, which is implied by the existence of a feasible po∑licy of Q(λ). The
fourth equality is just a re-write of what L(π,λ) is. Since Q(λ∗) + T λ∗mtt=1 t K is finite,
maxπ infλ L(π,λ) is finite. Any policy π that attains maxπ infλ L(π,λ) has to satisfy
that Eπ[At] − mtK is zero for all t, otherwise infλ L(π,λ) will be negative infinity. Hence
∗∗
Eπ [At] = mtK .
A.6 Proof of T∗maxs,a,t rt(s, a) upper bounds βt(s)
It is sufficient to show that for any λ ≥ 0 with λ λt > T ∗maxs,a,t rt(s, a), V (s, t) is attained
by choosing a = 0. When t = T , rT (s, 1) − λT < rT (s, 1) − T ∗ maxs,a,t rt(s, a) ≤
0. On the other hand rT (s, 0) ≥ 0 as all rewards are non-negative by the set-
ting of our or∑iginal MDP. Hence it is optimal to choose a = 0. When t < T ,
rt(s, 1) − λ a ′t + s′∈S P (s, s )Vλ(s′, t + 1) < maxs,a,t rt(s, a) − T ∗ maxs,a,t rt(s, a) + (T −
t) ∗ maxs,a,t rt(s, a) ≤ 0 ≤ rt(s, 0). Hence it is also optimal to choose a = 0. Therefore
βt(s) ≤ T ∗maxs,a,t rt(s, a) for all s, t.
78
A.7 A result that justifies using bisection
Lemma 13. If there exists an optimal policy π that takes action a = 1 in state s ∈ S at
time t for a sub-MDP (S,A, r, P·), and satisfies Pπ[S t = s] > 0, then a = 1 is strictly
optimal in state s at time t under a modified sub-MDP (S,A, r′, P·) with r′t (s, 1) > rt(s, 1),
and r′t equals rt otherwise.
Proof. Proof of Lemma 13 We prove by contradiction. Let V(π, r) denote the total
expected reward obtained by policy π with reward function r. Assume that there exists
an optimal policy π′ for sub-MDP (S,A, r′(·, ·), P·(·, ·)) such that π′(s, 0, t) = 1. Since
neither r (s, 1) nor r′t t (s, 1) contributes to the total expected reward, V(π
′, r) = V(π′, r′).
Let π be an optimal policy for (S,A, r(·, ·), P·(·, ·)). Then we have V(π, r) ≥ V(π′, r).
On the other hand, V(π, r′) is greater than V(π, r) by (r′t (s, 1) − r πt(s, 1))P [S t = s] > 0.
Hence we get that V(π, r′) > V(π, r) ≥ V(π′, r) = V(π′, r′) contradicting that π′ is an
optimal policy of sub-MDP (S,A, r′(·, ·), P·(·, ·)).  
A.8 Proof of Lemma 3
Proof. Proof of Lemma 3 To prove (1), when β ∗t(s) > λt , by definition of the index in
(2.6), there exists an  > 0 such that there is a π ∈ Π∗(λ∗[λ∗t + , t]) and π(s, 1, t) =
∗ λ∗[λ∗1. Recall how we construct set Π (λ) in Section 2.3.1, the value function V t +,t]
corresponding to sub-MDP Q(λ∗[λ∗t + , t]) has to satisfy∑
r (s, 1) − λ∗ − ∗ + Vλ [λ∗t +,t](s′, t + 1)P1t t (s, s′)∑ s′∈S
≥ λ∗r (s, 0) + V [λ∗t +,t] ′t (s , t + 1)P0(s, s′).
s′∈S
79
Since λ∗[λ∗t + , t] and λ
∗ share the same elements from the (t + 1)th position onwards,
Vλ
∗[λ∗ ∗t +,t](s, t′) = Vλ (s, t′), for all s ∈ S and t′ ≥ t + 1. Hence
∑ ∑
∗
r (s, 1) − λ∗ + Vλ (s′, t + 1)P1 ∗t t (s, s′) > rt(s, 0) + Vλ (s′, t + 1)P0(s, s′). (A.1)
s′∈S s′∈S
Next we consider two separate cases: 1) State s is visited with positive probability under
∗∗ ∗∗
π∗∗, that is, Pπ (S t = s) > 0; 2) State s is visited with zero probability, i.e., Pπ (S t =
∗∗
s) = 0. If 1) Pπ (S t = s) > 0, since π∗∗ is an optimal policy∑for the unconstrained sub-
MDP Q(λ∗
∗
) in (2.3), and a = 1 attains max{r ∗t(s, a) − aλt + s′∈S Pa(s, s′)Vλ (s′, t + 1)}
∗∗
alone, hence π∗∗(s, 1, t) = 1. If 2) Pπ (S t = s) = 0, we get π∗∗(s, 1, t) = 1 directly from
the construction of π∗∗ in (2.8).
Statement (2) can be proven using a similar argument. We therefore skip the proof
to avoid redundancy.  
A.9 Proof of Lemma 4
Proof. Proof of Lemma 4 To prove (1), suppose, for the sake of contradiction, that
π∗∗(s, 1, t) < 1. By Lemma 3, we have βt(s) ≤ λ∗t . Therefore Ut(s) ∪ Vt(s) forms a
superset to the set of states with indices of at least λ∗t . We also know that π
∗∗ takes active
action with probability α at time t. Hence we can write α as the sum of the probabilities
of taking the active action in all states s′ with π∗∗(s′, 1, t) = 1 and the probabilities of
taking the active action in all states s′ with 0 < π∗∗(s′, 1, t) < 1:
∑ ∑
α = P (s′) ∗ 1 + P (s′) ∗ π∗∗(s′t t , 1, t)
s′∈{s′′:π∑∗∗(s′′,1,t)=1} s′∈{s∑′′:0<π∗∗(s′′,1,t)<1}
< P (s′t ) + P ′t(s ). (A.2)
s′∈{s′′:π∗∗(s′′,1,t)=1} s′∈{s′′:0<π∗∗(s′′,1,t)<1}
80
Taking the contrapositives of both statements in Lemma 3, we get if 0 < π∗∗(s′, 1, t) < 1
then βt(s′) = λ∗t . Hence ∑ ∑
(A.2) = P (s′t ) ≤ Pt(s′) ≤ α
s′∈{s′′:β (s′′)≥1} s′t ∈Ut(s)∪Vt(s)
We get α < α, which is a contradiction, as desired.
To prove (2), we again use contradiction. Assume π∗∗(s, 1, t) > 0; by the contraposi-
tive of the second statement of Lemma 3 we know βt(s) ≥ λ∗t . Then Ut+1(s) is a subset of
{s′ : β (s′t ) > λ∗t }, wh∑ich in turn is a subset of {s′ :∑π∗∗(s′, 1, t) = 1} by Lemma 3. Hence
by the fact that α = ′ ′ ∗∗ ′s′∈{s′′:π∗∗(s′′,1,t)=1} Pt(s ) ∗ 1 + s′∈{s′′:0<π∗∗(s′′,1,t)<1} Pt(s ) ∗ π (s , 1, t),
we must have either 1)
∑ ∑
α > P ′t(s ) ≥ Pt(s′)
s′∈{s′′:π∗∗(s′′,1,t)=1} s′∈Ut(s)
when there exists some s′ ∈ {s′′ : 0 < π∗∗(s′′, 1, t) < 1} such that Pt(s′) > 0, or 2)∑ ∑
α ≥ Pt(s′) > P (s′t )
s′∈{s′′:π∗∗(s′′,1,t)=1} s′∈Ut(s)
∑otherwise, as we must have that π∗∗(s, 1, t) = 1. In either case we get α >′
s′∈U (s)=1 Pt(s ), which again forms a contradiction.  t
A.10 Lemma 14
Lemma 14. Let X(k) be a sequence of non-negative random variables such that
(k)
lim 1 X(k)k→∞ k = γ, a.s.. If Y
(k)|X(k) ∼ Bin(X(k), p), then lim Yk→∞ k = γp, a.s..
81
Proof. Proof of Lemma 14 We consider two cases: 1) X(k) → ∞; 2) X(K) is bounded.
When 1) X(k) → ∞, we have
Y (k)
lim
k→∞ k
Y (k) X(k)
= lim
k→∞ X(k) k
Y (k) X(k)
= lim (k) limk→∞ X k→∞ k
=p ∗ γ
If 2) X(K) is bounded, then γ = 0. Y (k) is also bounded, so limk→∞ = 0.  
A.11 Proof of Lemma 5
Proof. Proof of Lemma 5 Before attempting the proof, we make some observations on
the properties of function Rounding-c. These observations will also be useful for proofs
that come later in Appendix A.12. Based on how Rounding-c works in Algorithm 3,
∑Figure A.11 plots the number of sub-processes allocated to the ith state b∑i against total−
j,i min{total ∗ frac j, avail j}. bi is a piecewise linear function of total − j,i min{total ∗
frac j, avail j} with the changes of gradients occur when the value of some avail j drops to
total∗ frac j. The function line intersect with y-axis at min{total∗ fraci, availi} when there
is no left-over after assigning each state min{total ∗ frac j, avail j}. The line ends when it
reaches min{total, availi} as that is the largest possible assignment to state i.
To give an explicit form, we first sort for j , i and j ∈ {1, ..., n} by c j = avail j −
total ∗ frac j. Let c′ , ..., c′1 ` be the sorted values of c js from the smallest to the largest with
no repetition, and let n j denote the number of states with c′j. Let x denote the value
82
Figure A.1: Plot of Rounding-c
∑
represented on the x-axis total − j,i min{total ∗ frac j, avail j}, we can write bi asmin{total ∗ fraci, availi} +
x
n n 1 for 0 ≤ x ≤ n c′ 1+...+ +
1
` 1

′ ′
min{total ∗ fraci, avail }
c1 x−n1c1
i + n1 n + n for n c
′ ≤ x ≤ n c′
 +...+ ` 2+...+n +1 1 1 2` 2bi =  (A.3)
.
..
min{total ∗ frac avail } c′1 x−n ′`ci, i + n1 + ... + ` ′+...+n` n`+1 for n`c` ≤ x ≤ total
Let bi be the ith component of the output by Rounding(total, frac, avail) and bci be
the ith component of the output by Rounding-c(total, frac, avail). We first show that
|bi − bci| ≤ 2 by comparing the algorithms of Rounding and Rounding-c. If min{total ∗
fraci, availi} is attained by availi, then bi = bci.∑If availi is larger than btotal ∗ fracic
(hence also larger than total ∗ fraci), and total = j min{btotal ∗ frac jc, avail j} (meaning
no residue left to distribute in Rounding), then the difference between bi and bci is no
83
∑
more than 1. When total > j min{btotal ∗ frac j, avail jc} and there is residue left to be
distributed, the amount distributed to the ith state in Rounding may be smaller or larger
than the that in Rounding-c, depending on the position of i in an array of 1 to n. This
together with the possible difference between btotal ∗ frac jc and total ∗ frac j gives a total
difference of at most 2.
Subsequently we have |bs(Nt, bαtKc) − b̄s(Nt, bαtKc)| ≤ 2, and | fs(Nt, bαtKc) −
f̄ Nt bαtKc)s(Nt, bαtKc)| = | fs(Nt, bαtKc) − K ∗ f̄s( K , K |. 
A.12 Proof of Lemma 6
Proof. Proof of Lemma 6 First note that b̄s(Nt, bαtKc) is continuous as illustrated in the
proof of Lemma 5 in Appendix A.11. Hence f̄s(Nt, bαtKc) is continuous. This is because
when bαtKc − N+t (s) , N=t (s), f̄s(Nt, bαtKc) is continuous as Nt(s) and b̄s(Nt, bαtKc) are
continuous. When bα + =tKc − Nt (s) = Nt (s), we have Nt(s) = b̄s(Nt, bαtKc). By the
continuous mapping theorem, and the fact that lim Nt(s)K→∞ K = Pt(s) almost surely and
lim bαtKcK→∞ K = αt, we have
Nt bαlim f̄ ( tKc N bα Kcs , ) = f̄s( lim t t, lim ) = f̄s(P→∞ K K →∞ K →∞ K t, αt), a.s.K K K

84
A.13 Proof of Lemma 7
Proof. Proof of Lemma 7 We first prove that f̄s(Nt, bαtKc) equals to
 ∑min{Pt(s), [α − ′ P (s
′)]+ ∑ ρ(s,1,t) },

s ∈Ut(s) t s′∈ ′Vt (s) ρ(s ,1,t)
gs(Pt) = 
∑
if s′∈V ρ(s′, 1, t) > 0t(s)
 ∑
(A.4)
min{P (s), [α − P (s′)]+ ∑ Pt(s) ∑t s′∈U (s) t ′ P (s′) }, t s ∈Vt (s) t if s′∈V (s) ρ(s′, 1, t) = 0,t
then we show gs(Pt) = Pt(s)π∗∗(s, 1, t). To prove the former, We divide our discussion
into two main cases.
• When ∑ ∑
[α − P ′ + ′t(s )] − Pt(s ) ≥ 0, (A.5)∑ s′∈Ut(s) s′∈Vt(s)
if α − s′∈U (s) Pt(s′) ≤ 0, then gs(Pt) = 0 in both cases. Moreover, (A.5) impliest
Pt(s) =∑0 by Lemma 4, hence fs(Pt, α) = 0.
If α − ′s′∈U (s) Pt(s ) > 0, (A.5) implies that Pt(s) = 1 by Lemma 4.t ∑ ∑ ∑
α − Pt(s′) ≥ P (s′t ) = ρ(s′, 1, t) ≥ Pt(s).
s′∈Ut(s) s′∈Vt(s) s′∈Vt(s)
Hence Pt(s) attains the minimum in both cases of gs(Pt), and fs(Pt, α) = Pt(s) =
gs(Pt).
• When ∑ ∑
[α − P (s′)]+t − P (s′t ) < 0, (A.6)∑ s′∈Ut(s) s′∈Vt(s)
if α − ′∑s′∈U (s) Pt(s ) ≤ 0, then both fs(Pt, α) and gs(Pt) are zero.t
If α − s′∑∈U (s) Pt(s′) > 0, fs(Pt, α) is eithert
min{(α − s′∈U (s) Pt(s′))∑ ρ(s,1,t)t s′∈Vt (s)ρ(s′ , Pt(s)} or,1,t)
85
∑
min{(α − P (s′))∑ Pt(s)s′∈U (s) t ′ ′ , Pt(s)} ∑depending on whether s′∈V (s) ρ(s′, 1, t)t s ∈Vt (s)Pt (s ) t
is greater than or equal to zero, which matches exactly the two cases in gs(Pt).
∑We first consider the case when
ρ(s′s′∈V (s) , 1, t) > 0. This can be further divided into two sub-cases,t
• Case 1:∑When gs(Pt) = Pt(s),
if α − P (s′) ≤ 0, by Lemma 4, π∗∗s′∈U (s) t (s, 1, t) = 0. Since Pt(s) attains thet
minimu∑m in this case, Pt(s) = 0. Hence gs(Pt) = 0 = Pt(s)π
∗∗(s, 1, t).
If α − s′∈U (s) Pt(s′) > 0, this leads to two cases by Lemma 4: π∗∗(s, 1, t) = 1t
or 0 < π∗∗(s, 1, t) < 1. If∑it is the latter, we have the second term in gs(Pt) be-
comes ρ(s, 1, t), since α− ′s′∈U (s) Pt(s ) cancels with the denominator. ρ(s, 1, t) =t
Pt(s)π∗∗(s, 1, t) < Pt(s), which contradicts that Pt(s) attains the minimum. Hence
π∗∗(s, 1, t) = 1. We have gs(P∑t) = Pt(s) = Pt(s) ∗ π
∗∗(s, 1, t).
• Case 2:∑When gs(P ) ρ(s,1,t)t = [α − ′ +s′∈Ut(s) Pt(s )] ∑ ′ ,s ∈Vt (s)ρ(s′ ,1,t)
if α − ′s′∈U (s) Pt(s ) ≤ 0, again by Lemma 4, π∗∗(s, 1, t) = 0, gt s(Pt) = 0 =
Pt(s)π∗∗∑(s, 1, t).
If α − ′ ∗∗s′∈U (s) Pt(s ) > 0, again we have two ∑cases: π (s, 1, t)∑= 1 or 0 <t
π∗∗(s, 1, t) < 1. If it is the latter, we have α − ′s′∈U (s) Pt(s ) = P (s′).t s′∈Vt(s) t
If it is ∑the former, by assumption
∑[α − P ′ Pt(s)
∑ ′
s′∈Ut(s) t(s )]∑ ′ ∑′ attains the minimum, α − ′s ∈V (s)P (s ) s ∈∑Ut(s) Pt(s ) ≤t t
s′∈V (s) Pt(s′). Hence α − ′ ′t s′∈Ut(s) Pt(s ) can only be equal to s′∈V (s) P (s ).t t
Subsequently for both cases π∗∗(s, 1, t) = 1 and 0 < π∗∗(s, 1, t) < 1, we have
g (P ) = ρ(s, 1, t) = P (s)π∗∗s t t (s, 1, t).
∑
Now we look at the case when ′ ′s′∈Vt(s) ρ(s , 1, t) = 0. We have either 1) Pt(s ) = 0
for all s′ ∈ V (s) 2) π∗∗t (s′, 1, t) = 0 for all s′ ∈ Vt(s).
86
∑
When π∗∗(s′, 1, t) = 0 for all s′ ∈ V ′t(s), α − s′∈U (s) Pt(s ) ≤ 0 hence gs(Pt) = 0 =t
Pt(s)π∗∗(s, 1, t). 
A.14 Bellman’s recursion for T = ∞
Below is the Bellmans’s recursion on value functions for a single-task system when
T = ∞:
if l = U and wx = 0:
Vx(αx, βx,wx, l) = R(αx, βx); (A.7a)
if l = U and wx > 0:
Vx(αx, βx,wx, l)
αx
= Vx(αx + 1, βx,wx − 1, l)+
αx + βx
βx Vx(αx, βx + 1,wx − 1, l); (A.7b)
αx + βx
If l < U:
Vx(αx, βx,wx, l)
r
= ma[x {Vx(αx, βx,wx + al,x, l + 1) − λq l+1al,x}x al,x∈{0,1}µwx αx
+ Vx(αq x
+ 1, βx,wx − 1, l)
x αx + βx
β ]x
+ Vx(αx, βx + 1,wx − 1, l) .
αx + βx
87
BIBLIOGRAPHY
[1] D Adelman and A Mersereau. Relaxations of Weakly Coupled Stochastic Dynamic
Programs. Operations Research, 2008.
[2] S Andradóttir and S H Kim. Fully Sequential Procedures for Comparing Con-
strained Systems via Simulation. Naval Research Logistics, 57(5):403–421, 2010.
[3] P Auer, N Cesa-Bianchi, and P Fischer. Finite-time Analysis of the Multiarmed
Bandit Problem. Machine Learning, 2002.
[4] Y Bacharach, T Grapple, T Minka, and J Guiver. How To Grade A Test Without
Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing
And Aptitude Testing. ICML, 2013.
[5] D Batur and S H Kim. Finding Feasible Systems in the Presence of Constraints on
Multiple Performance Measures. ACM Transactions on Modeling and Computer
Simulation, 20(3):13:1–13:26, 2010.
[6] D A Berry and B Fristedt. Bandit problems: sequential allocation of experiments
(Monographs on statistics and applied probability). Springer, 1985.
[7] E Bofinger and G J Lewis. Two-Stage Procedures for Multiple Comparisons
with a Control. American Journal of Mathematical and Management Sciences,
12(4):253–275, 1992.
[8] O Chapelle and L Li. An Empirical Evaluation of Thompson Sampling. NIPS,
2002.
[9] C Chen, D He, M Fu, and L Lee. Efficient Simulation Budget Allocation for
Selecting an Optimal Subset. Informs Journal on Computing, 2008.
[10] X Chen, Q Lin, and D Zhou. Statistical Decision Making for Optimal Budget
Allocation in Crowd Labeling. arXiv, 2011.
[11] Xi Chen, Qihang Lin, and D Y Zhou. Optimistic Knowledge Gradient Policy for
Optimal Budget Allocation in Crowdsourcing. ICML, 2013.
[12] Xi Chen, Qihang Lin, and Dengyong Zhou. Optimistic Knowledge Gradient Pol-
icy for Optimal Budget Allocation in Crowdsourcing. In Proceedings of the 30th
International Conference on Machine Learning, pages 64–72, 2013.
88
[13] H Damerdji and M K Nakayama. Two-Stage Procedures for Multiple Comparisons
with a Control in Steady-State Simulations. In Proceedings of the 1996 Winter
Simulation Conference, pages 372–375, Piscataway, New Jersey, 1996. Institute of
Electrical and Electronics Engineers, Inc.
[14] M H DeGroot. Optimal Statistical Decisions. McGraw Hill, New York, 1970.
[15] E J Dudewicz and S R Dalal. Multiple Comparisons with a Control When Vari-
ances Are Unknown and Unequal. American Journal of Mathematics and Man-
agement Sciences, 3(4):275–295, 1983.
[16] C W Dunnett. A Multiple Comparison Procedure for Comparing Several
Treatments with a Control. Journal of the American Statistical Association,
50(272):1096–1121, 1955.
[17] E B Dynkin and A A Yushkevich. Controlled Markov Processes. Springer, New
York, 1979.
[18] E.B Dynkin and Yushkevish A.A. Controlled Markov Processes. Springer, 1979.
[19] P I Frazier. Learning with dynamic programming. In Wiley Encyclopedia of Op-
erations Research and Management Science, pages 1–13, Hoboken, New Jersey,
2011. Wiley.
[20] A Ghosh, S Kale, and P McAfee. Who moderates the moderators? Crowdsourcing
Abuse detection in user-generated content. ACM, 2011.
[21] C Gittins, J. Bandit Processes and Dynamic Allocation Indices. Journal of the
Royal Statistical Society. Series B, 1979.
[22] J C Gittins. Multi-Armed Bandit Allocation Indices. John Wiley and Sons, New
York, 1989.
[23] John Gittins, Kevin Glazebrook, and Richard Weber. Multi-armed Bandit Alloca-
tion Indices. Wiley, Hoboken, New Jersey, 2nd edition, 2011.
[24] Y Gocgun and A Ghatey. Lagrangian Relaxation and Constraint Generation For
Allocation And Advanced Scheduling. Computer and Operations Research, 2011.
[25] J Hawkins. A Lagrangian decomposition approach to weakly coupled dynamic
optimization problems and its applications. Ph.D Thesis, Center of Operations
Research, MIT, 2002.
89
[26] Christopher Healey, Sigrún Andradóttir, and Seong-Hee Kim. Selection proce-
dures for simulations with multiple constraints under independent and correlated
sampling. ACM Transactions on Modeling and Computer Simulation, 24(3):14:1–
14:25, 2014.
[27] CJ Ho, S Jabbari, and J W Vaughan. Adaptive Task Assignment For Crowdsourced
classification. ICML, 2013.
[28] D Karger, S Oh, and D Shah. Budget-Optimal Task Allocation for Reliable Crowd-
sourcing Systems. Operations Research, 2013.
[29] D R Karger, S Oh, and D Shah. Iterative Learning for Reliable Crowdsourcing
System. NIPS, 2011.
[30] D R Karger, S Oh, and D Shah. Efficient crowdsourcing for multi-class labeling.
ACM, 2013.
[31] S H Kim. Comparison with a Standard via Fully Sequential Procedures. ACM
Transactions on Modeling and Computer Simulation, 15(2):155–174, 2005.
[32] S H Kim and B L Nelson. Recent Advances in Ranking and Selection. In Pro-
ceedings of the 2007 Winter Simulation Conference, pages 162–172, Piscataway,
New Jersey, 2007. Institute of Electrical and Electronics Engineers, Inc.
[33] L. W. Koenig and A. M. Law. A procedure for selecting a sub- set of size m con-
taining the l best of k independent normal populations. Comm. Statist.Simulation
Comm. 14 719734, 1985.
[34] W S Lovejoy. A survey of algorithmic methods for partially observed Markov
decision processes. Annals of Operations Research, 28(1):47–65, 1991.
[35] G E Monahan. A survey of partially observable Markov decision processes: The-
ory, models, and algorithms. Management Science, 28(1):1–16, 1982.
[36] B L Nelson and D Goldsman. Comparisons with a Standard in Simulation Exper-
iments. Management Science, 47(3):449–463, 2001.
[37] E Paulson. On the Comparison of Several Experimental Categories with a Control.
The Annals of Mathematical Statistics, 23(2):239–246, 1952.
[38] W B Powell. Approximate Dynamic Programming: Solving the curses of dimen-
sionality. John Wiley and Sons, New York, 2007.
90
[39] Warren B Powell. Approximate Dynamic Programming: Solving the Curses of
Dimensionality. Wiley, 2011.
[40] M.L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro-
gramming. Wiley, 2008.
[41] H Robbins. A note on gambling systems and birth statistics. American Mathemat-
ical Monthly, 1952.
[42] A Ruszczynski. Nonlinear Optimization. Princeton University Press, 206.
[43] R Snow, B OConnor, D Jurafsky, and A Ng. Cheap and Fast But is it Good?
Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP, 2008.
[44] R Szechtman and E Yücesan. A New Perspective on Feasibility Determination.
In Proceedings of the 2008 Winter Simulation Conference, pages 273–280, Piscat-
away, New Jersey, 2008. Institute of Electrical and Electronics Engineers, Inc.
[45] T L Thanh, M Venanzi, A Rogers, and N R Jennings. Efficient budget allocation
with accuracy guarantees for crowdsourcing classification tasks. ACM, 2013.
[46] R Weber and G Weiss. On An Index Policy For Restless Bandits. Journal of
Applied Probability, 1990.
[47] R Weber, R and G Weiss. Addendum to ’On An Index Policy For Restless Bandit’.
Advances in Applied Probability, 1991.
[48] P. Whittle. Restless Bandits: Activity Allocation in a Changing World. Journal of
Applied Probability, 1988.
[49] J Xie and P I Frazier. Sequential Bayes-Optimal Policies for Multiple Comparisons
with a Known Standard. Operations Research, 61(3):1174–1189, 2013.
[50] J. Xie and P.I. Frazier. Upper Bounds on the Bayes-Optimal Procedure for Ranking
& Selection with Independent Normal Priors. Proceedings of the 2013 Winter
Simulation Conference, 2013.
[51] Y Yan, R Rosales, G Fung, and J Dy. Active Learning From crowds. ICML, 2011.
[52] F Ye, H Zhu, and E Zhou. Weakly Coupled Dynamic Program: Information and
Lagrangian Relaxations. arXiv preprint arXiv:1405.3363, 2014.
91