Multi-Armed Bandits in Large-Scale Complex Systems
This dissertation focuses on the multi-armed bandit problem (MAB) where the objective is a sequential arm selection policy that maximizes the total reward over time. In canonical formulations of MAB, the following assumptions are adopted: the size of the action space is much smaller than the length of the time horizon, computation resources such as memory are unlimited in the learning process, and the generative models of arm rewards are time-invariant. This dissertation aims to relax these assumptions, which are unrealistic in emerging applications involving large-scale complex systems, and develop corresponding techniques to address the resulting new issues. The first part of the dissertation aims to address the issue of a massive number of actions. A stochastic bandit problem with side information on arm similarity and dissimilarity is studied. The main results include a unit interval graph (UIG) representation of the action space that succinctly models the side information and a two-step learning structure that fully exploits the topological structure of the UIG to achieve an optimal scaling of the learning cost with the size of the action space. Specifically, in the UIG representation, each node represents an arm and the presence (absence) of an edge between two nodes indicates similarity (dissimilarity) between their mean rewards. Based on whether the UIG is fully revealed by the side information, two settings with complete and partial side information are considered. For each setting, a two-step learning policy consisting of an offline reduction of the action space and online aggregation of reward observations from similar arms is developed. The computation efficiency and the order optimality of the proposed strategies in terms of the size of the action space and the time length are established. Numerical experiments on both synthetic and real-world datasets are conducted to verify the performance of the proposed policies in practice. In the second part of the dissertation, the issue of limited memory during the learning process is studied in the adversarial bandit setting. Specifically, a learning policy can only store the statistics of a subset of arms summarizing their reward history. A general hierarchical learning structure that trades off the regret order with memory complexity is developed based on multi-level partitions of the arm set into groups and the time horizon into epochs. The proposed learning policy requires only a sublinear order of memory space in terms of the number of arms. Its sublinear regret orders with respect to the time horizon are established for both weak regret and shifting regret in expectation and/or with high probability, when appropriate learning strategies are adopted as subroutines at all levels. By properly choosing the number of levels in the adopted hierarchy, the policy adapts to different sizes of the available memory space. A memory-dependent regret bound is established to characterize the tradeoff between memory complexity and the regret performance of the policy. Numerical examples are provided to verify the performance of the policy. The third part of the dissertation focuses on the issue of time-varying rewards within the contextual bandit framework, which finds applications in various online recommendation systems. The main results include two reward models characterizing the fact that the preferences of users toward different items change asynchronously and distinctly, and a learning algorithm that adapts to the dynamic environment. In particular, the two models assume disjoint and hybrid rewards. In the disjoint setting, the mean reward of playing an arm is determined by an arm-specific preference vector, which is piecewise-stationary with asynchronous change times across arms. In the hybrid setting, the mean reward of an arm also depends on a joint coefficient vector shared by all arms representing the time-invariant component of user interests, in addition to the arm-specific one that is time-varying. Two algorithms based on change detection and restarts are developed in the two settings respectively, of which the performance is verified through simulations on both synthetic and real-world data. Theoretical regret analysis of the algorithm with certain modifications is provided under the disjoint reward model, which shows that a near-optimal regret order in the time length is achieved.
Large-Scale Complex Systems; Multi-Armed Bandits; No-regret Learning
Acharya, Jayadev; Chen, Yudong
Electrical and Computer Engineering
Ph. D., Electrical and Computer Engineering
Doctor of Philosophy
dissertation or thesis