Learning In The Presence Of Unawareness

Other Titles



Markov decision processes (MDPs) are widely used for modeling decision-making problems in robotics, automated control, and economics. Traditional MDPs assume that the decision maker (DM) knows all states and actions. However, this may not be true in many situations of interest. We define a new framework, MDPs with unawareness (MDPUs), which allows for the possibility that a DM may not be aware of all possible actions. We provide a complete characterization of when a DM can learn to play near-optimally in an MDPU, and give an algorithm that learns to play near-optimally when it is possible to do so, as efficiently as possible. In particular, we characterize when a near-optimal solution can be found in polynomial time. We formalize decision-making problems in robotics and automated control using continuous MDPs and actions that take place over continuous time intervals. We then approximate the continuous MDP using finer and finer discretizations. Doing this results in a family of systems, each of which has an extremely large action space, although only a few actions are "interesting". This can be modeled using MDPUs, where the action space is much smaller. As we show, MDPUs can be used as a general framework for learning tasks in robotic problems. We prove results on the difficulty of learning a near-optimal policy in an an MDPU for a continuous task. We apply these ideas to the problem of having a humanoid robot learn on its own how to walk. Finally, we consider the scenario in which the DM has a limited budget for solving the problem on top of unawareness. In order to deal with such problems, we define a model called MDPUs with a prior and a budget (MDPUBs) that considers both unawareness and a limited budget. We also consider the problem of learning to play approximately optimally in a subclass of MDPUBs called budgeted learning problems with unawareness (BLPUs), which are multiarmed bandit problems in which there may be arms that the DM is unaware of. We provide a policy that is 0.25c-optimal for BLPUs, where c is a constant determined by the probability of discovering a new arm and the probability of there being undiscovered arms, that can be computed in polynomial time.

Journal / Series

Volume & Issue



Date Issued




learning; unawareness; robotics


Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Halpern,Joseph Yehuda

Committee Co-Chair

Committee Member

Kleinberg,Robert David
Blume,Lawrence Edward

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Rights URI


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record