Learning In The Presence Of Unawareness
Markov decision processes (MDPs) are widely used for modeling decision-making problems in robotics, automated control, and economics. Traditional MDPs assume that the decision maker (DM) knows all states and actions. However, this may not be true in many situations of interest. We define a new framework, MDPs with unawareness (MDPUs), which allows for the possibility that a DM may not be aware of all possible actions. We provide a complete characterization of when a DM can learn to play near-optimally in an MDPU, and give an algorithm that learns to play near-optimally when it is possible to do so, as efficiently as possible. In particular, we characterize when a near-optimal solution can be found in polynomial time. We formalize decision-making problems in robotics and automated control using continuous MDPs and actions that take place over continuous time intervals. We then approximate the continuous MDP using finer and finer discretizations. Doing this results in a family of systems, each of which has an extremely large action space, although only a few actions are "interesting". This can be modeled using MDPUs, where the action space is much smaller. As we show, MDPUs can be used as a general framework for learning tasks in robotic problems. We prove results on the difficulty of learning a near-optimal policy in an an MDPU for a continuous task. We apply these ideas to the problem of having a humanoid robot learn on its own how to walk. Finally, we consider the scenario in which the DM has a limited budget for solving the problem on top of unawareness. In order to deal with such problems, we define a model called MDPUs with a prior and a budget (MDPUBs) that considers both unawareness and a limited budget. We also consider the problem of learning to play approximately optimally in a subclass of MDPUBs called budgeted learning problems with unawareness (BLPUs), which are multiarmed bandit problems in which there may be arms that the DM is unaware of. We provide a policy that is 0.25c-optimal for BLPUs, where c is a constant determined by the probability of discovering a new arm and the probability of there being undiscovered arms, that can be computed in polynomial time.
learning; unawareness; robotics
Kleinberg,Robert David; Selman,Bart; Blume,Lawrence Edward
Ph. D., Computer Science
Doctor of Philosophy
dissertation or thesis