Show simple item record

dc.contributor.authorVakili, Sattar
dc.identifier.otherbibid: 10361449
dc.description.abstractThis dissertation focuses on sequential learning and inference under unknown models. In this class of problems, observations are available one at a time, and optimal actions cannot be determined \emph{a priori}. Solutions to such problems require optimization over time: to learn optimally from the past and then act accordingly with foresighted planning. The first part of the dissertation is on sequential learning within the framework of multi-armed bandit (MAB) theory. Originated in 1930s, the MAB problems have been traditionally studied under the following modeling assumptions: the reward distributions are assumed to have a bounded support, the learning is centralized with a single player, the reward distributions are assumed to be time-invariant, and the performance measure mainly targets at maximizing the expected return of a sequential learning policy. This dissertation aims to relax these modeling assumptions and develop parallel results applicable to a wider range of applications, in particular, emerging applications such as online ad placement and web search as well as social and economic networks. The main results in the first part include a new policy based on a deterministic sequencing of exploration and exploitation for the classic MAB. The structure of the proposed policy allows easy extensions to variations of MAB, including decentralized MAB with multiple players and incomplete reward observations under collisions. The proposed policy achieves the optimal logarithmic \emph{regret} order under general reward distribution models such as Sub-Gaussian and heavy-tailed distributions. The time-variation in the reward process is also addressed by considering an arbitrary time-variation as well as a piece-wise stationary model. A lower bound on the regret order is obtained and order optimality of a sequential learning policy is shown. The issue of risk in multi-armed bandit problems is considered and parallel results are developed under the measure of mean-variance, a commonly adopted risk measure in economics and mathematical finance. It is shown that the model-specific regret and the model-independent regret in terms of the mean-variance of the reward process are lower bounded by $\Omega(\log T)$ and $\Omega(T^{\frac{2}{3}})$, respectively. It is also shown that variations of the policies developed for the classic risk-neutral MAB achieve these lower bounds. Also, under the measure of value at risk (another common risk measure in financial mathematics) a sequential learning policy is developed. In addition, a minimal side information on the reward model is introduced that can lead to bounded regret, thus, complete learning. Provided the side information, a sequential learning policy with bounded regret is proposed. The second part of the dissertation is on sequential inference which can be categorized into three classes of problems: sequential hypothesis testing, change detection, and active hypothesis testing. These classic problems date back to 1940s with pioneering works by Wald, Chernoff, Page, and Shewhart. The focus of this dissertation is on time-varying models, hierarchical observations, and non-parametric composite hypotheses motivated by recent engineering applications such as short-term instability detection from PMU measurements in power systems and detection of heavy hitters and denial-of-service attacks in the Internet, and communication and computer networks. The main results in the second part include asymptotically optimal tests for the sequential hypothesis testing and change-point detection problems under a time-varying distribution model suitable for instability detection applications. The asymptotic optimality of the tests are proven by establishing fundamental limits on the Bayesian cost of any test. %We address the sequential hypothesis testing and change-point detection problems under a time-varying distribution model suitable for instability detection applications. The problem of active inference under hierarchical observations, and unknown and non-parametric models is also considered. A policy that interactively chooses the observation points is proposed and is shown to achieve optimal logarithmic order sample complexity in both the problem size and the reliability constraint.
dc.subjectMulti-Armed Bandit
dc.subjectSequential inference
dc.subjectSequential learning
dc.subjectTime-varying models
dc.subjectUnknown models
dc.subjectComputer science
dc.subjectElectrical engineering
dc.subjectOperations research
dc.typedissertation or thesis and Computer Engineering University of Philosophy D., Electrical and Computer Engineering
dc.contributor.chairZhao, Qing
dc.contributor.committeeMemberBitar, Eilyan Yamen
dc.contributor.committeeMemberWagner, Aaron B.

Files in this item


This item appears in the following Collection(s)

Show simple item record