Principled Off-Policy Imitation Learning via Boosting
No Access Until
Permanent Link(s)
Collections
Other Titles
Author(s)
Abstract
Imitation learning is a promising paradigm to learn policies to solve a variety of tasks given some expert data. In particular, off-policy imitation learning is particularly nice for practitioners, as it in principle allows the policy to use previously collected data to improve, similar to standard value-based off-policy reinforcement learning algorithms such as Deep Q Learning and actor-critic methods. However, this is generally ill-defined, as the policy improvement operator can generally only in principle be applied to data collected by the most recent policy, making the algorithm on-policy. To mitigate this while still remaining off-policy, we design an actor-critic method where we treat the replay buffer as a collection of data from a set of weak learners. Our algorithm more appropriately weights each weak learner’s data when it comes to sampling for policy optimization, offering a principled way to mitigate the above distribution mismatch problem in the off-policy setting. We apply this technique to both state and vision-based tasks in the DeepMind Control Suite domain and see that our method does indeed improve learning in terms of sample efficiency.