TOWARDS SPECIALIZED REINFORCEMENT LEARNING FROM DIVERSE DATA A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Jonathan Daniel Chang August 2024 © 2024 Jonathan Daniel Chang ALL RIGHTS RESERVED TOWARDS SPECIALIZED REINFORCEMENT LEARNING FROM DIVERSE DATA Jonathan Daniel Chang, Ph.D. Cornell University 2024 Reinforcement learning (RL) fundamentally focuses on teaching agents how to make decisions by interacting with an environment. Unlike supervised learning approaches that learn from a fixed dataset, reinforcement learning agents learn by doing, receiving feedback as rewards or penalties based on their chosen actions. However, the generality of RL makes efficient learning incredibly difficult, making adopting novel tasks complicated. Even with the explosive success of ChatGPT (OpenAI, 2023), which applied a deep RL algorithm, Proximal Policy Optimization (PPO) (Schulman et al., 2017b), to large language models (LLMs), there has been more interest in either reducing the complexity of RL algorithms or eliminating the need for online interaction. Even beyond LLMs, despite RL’s superhuman abilities in games such as DOTA (Berner et al., 2019) or Go (Silver et al., 2016b), we have yet to see widespread adoption of RL in real-world applications compared to other, arguably more specialized, learning paradigms such as supervised learning. We notice that a critical challenge for the broad adoption of RL is efficiently utilizing diverse data sources to create a specialized algorithm. That is, for many of the successes mentioned above in RL, the algorithms used to learn the agents were general-purpose algorithms that could also be used in other applications. In this thesis, we attempt to introduce RL algorithms that progress toward specialized algorithms for various settings. We first discuss doing efficient inverse reinforcement learning (IRL) from different types of data sources. We consider three settings: learning from observations alone, where the demonstration data does not contain action infor- mation; offline learning, where instead of interactive access to the environment we only get a large dataset of interactions; and off-policy learning where the interactive feedback that we learn from can be from different learning agents. In all three settings, we introduce a principled algorithm that performs efficient learning in a wide range of control tasks. Next we discuss learning specialized algorithms in the space of generative models. Foundation models increasingly live up to their namesake, becoming capable base models for improved downstream performance on various tasks across multiple application domains. In this thesis’s second part, we investigate RL with these models for learning decision-making agents from diverse data sources. We present three different learning settings with generative models: text generation with an interactive black box model, text generation with high-quality human labels, and text instruction-guided image generation. Overall, each setting has a specific property, whether it is deterministic transition dynamics or a short horizon, that allows for the design of more specialized algorithms that efficiently exploit these properties and improves beyond the general RL baseline. BIOGRAPHICAL SKETCH Jonathan Chang was born in Germany. He received his Bachelors and Masters in computer science and applied mathematics from Brown University. He then attended Cornell University for his Ph.D. At Cornell, he was a teaching assistant for Introduction to Artificial Intelligence and Topics in Reinforcement Learning. Jonathan also spend time as a research intern at Meta in New York City and Microsoft Research in Montreal. He received the LinkedIn Ph.D. Award for Fall 2023 and Spring 2024. iii To my parents Michael & Elizabeth, my friend Luke, and my other half Elise iv ACKNOWLEDGEMENTS First, to my advisor, Wen Sun. From the first Zoom call, Wen has been able to balance encouragement and structured advice. He has always helped me make progress on our projects and had an almost uncanny intuition for why I was stuck on a problem. Over the countless meetings I’ve had with him, I could count on walking out of his office with a clear goal and an inspired excitement about the problem we were tackling. Through his guidance, he shaped how I approach research, how I ask questions, and how to move forward in the face of profound challenge and uncertainty. Richard Feynman wrote "The worthwhile problems are the ones you can really solve or help solve, the ones you can really contribute something to." During my 2nd year, Wen gave me similar advice when I felt discouraged about my abilities as a researcher and that advice gave me the strength to continue research. Thank you for your support, your trust, your patience in allowing me to make mistakes, and your time being an amazing mentor and advisor. I will always remember our time doing research, and I look forward to continuing our collaborations! To my thesis committee members: Hadas Kress-Gazit, Sanjiban Choudhury, and Yoav Artzi. Thank you for the taking the time to speak with me over the years about my research and provide valuable feedback about how I’ve been doing. Especially when my research focus pivoted away from robotics and control to generative models, everyone on the committee encouraged this change and supported me nonetheless. Next, to my previous advisors, Stefanie Tellex and George Konidaris. Thank you for taking a chance on me. Only armed with a bullish excitement for reinforcement learning and deep learning, I joined their lab to pursue research for the first time. I can still remember Stefanie walking down College Hill with me to give me advice about Graduate School applications and George giving me essential tips about writing, presenting, and conducting research in his late night e-mails. Thank you! I would not be here otherwise. To Rahul Kidambi and Kianté Brantley. My graduate school experience can be v divided into two chapters: first being mentored by Rahul and next being mentored by Kianté. Both Rahul and Kianté taught me how to be a team player, how to be a compassionate collaborator, and an impactful mentor to a more junior researcher. Both led by example and gave me the invaluable gift of experiencing great mentorship. Thank you both for all the support! To my other mentors Mikael Henaff, Dipendra Misra, Brandon Amos, Eric Yuan, and Marc-Alexandre Cote. Thank you all for your time, patience, internship opportunities, and guidance over the years. I would love to continue collaborating in the future! To Rebecca "Becky" Stewart, thank you for all your help navigating my time at Cornell and for you patience answering all my questions first thing in the morning! Much of my logistical mishaps were corrected by Becky, and I wouldn’t be here without her support. To my other collaborators and friends: Dhruv Sreenivas, Masatoshi Uehara, Kaiwen Wang, Wendy Huang, Owen Oertell, Rajkumar Ramamurthy, Ge Gao, Nico Espinosa Dice, Yuda Song, Yiyi Zhang, Wenhao Zhang, Runzhe Wu, Jerry Chee, Aaron Gokaslan, Aaron Tucker, Katie Luo, Rohan Banerjee, thank you all! To my parents, Michael and Elizabeth Chang, thank you for your love and support. I am grateful for all you have done and am fortunate to be blessed with a supportive network of unwavering encouragement. Thank you for giving me the opportunity to pursue my passions and providing the space for me to grow into the person I am today. I couldn’t have done it without you both. To my partner, Elise Burdette, I am forever grateful for your love and support through the last five years. Thank you for being there with me to witness my entire journey at Cornell, for listening to countless rants about reinforcement learning, and filling our days with both epic adventures and happiness. Thank you for always believing in me, and I am nothing but excited for our future together. vi Last but not least, to my dog, Choco, and my cat, George. Thank you for reminding me that there is more to life than research and insisting that we enjoy the good weather. I can’t imagine my Ph.D. without you two even if George likes to delete blocks of code by sitting my keyboard. Thanks you two. vii CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction 1 1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Imitation Learning from Diverse Data Sources . . . . . . . . . 4 1.1.2 Reinforcement Learning and Imitation Learning with Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . 6 1.2.2 Deep Policy Gradient Algorithms . . . . . . . . . . . . . . . . 7 1.2.3 Inverse Reinforcement Learning (IRL) . . . . . . . . . . . . . . 11 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 13 I Imitation Learning 14 2 Model-Based Imitation Learning From Observation Alone 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Function Approximation Setup . . . . . . . . . . . . . . . . . . 21 2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Components of MobILE . . . . . . . . . . . . . . . . . . . . . . 24 2.4.2 Exploration And Imitation Tradeoff . . . . . . . . . . . . . . . 26 2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1 Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Exploration in ILFO and the Exponential Gap between IL and ILFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 Practical Instantiation of MobILE . . . . . . . . . . . . . . . . . . . . . 31 2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7.1 Benchmarking MobILE on MuJoCo suite . . . . . . . . . . . . . 34 2.7.2 Importance of the optimistic MDP construction . . . . . . . . . 35 2.7.3 Varying Number of Expert Samples . . . . . . . . . . . . . . . 35 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 viii 3 Model-Based Offline Imitation Learning 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Specialization to offline RL . . . . . . . . . . . . . . . . . . . 46 3.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.1 Analysis: Discrete MDPs . . . . . . . . . . . . . . . . . . . . . 49 3.5.2 Analysis: KNRs and GPs for Continuous MDPs . . . . . . . . . 49 3.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.1 Evaluation on MuJoCo Continuous Control Tasks . . . . . . . . 56 3.7.2 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4 Model-Free Off-Policy Imitation Learning 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Adversarial Imitation Learning (AIL) . . . . . . . . . . . . . . 66 4.3.2 Discriminator Actor Critic (DAC) . . . . . . . . . . . . . . . . . 67 4.3.3 ValueDICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.1 AILBoost: Adversarial Imitation Learning via Boosting . . . . 70 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.1 Controller State-based Experiments . . . . . . . . . . . . . . . 74 4.5.2 Image-based Experiments . . . . . . . . . . . . . . . . . . . . 76 4.5.3 Sensitivity to gradient-based optimization for weak learners and discriminators . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 II IL and RL for Generative Models 79 5 Learning to Generate Better Than Your LLM 80 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4 Reinforcement Learning from Guided Feedback . . . . . . . . . . . . . 88 5.5 Theoretical Justification . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 97 5.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 101 ix 6 Provably Efficient RL with Preference-based Feedback via Dataset Reset 102 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 Dataset Reset Policy Optimization . . . . . . . . . . . . . . . . . . . . 110 6.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.4.1 Theoretical Sample Complexity . . . . . . . . . . . . . . . . . 114 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.5.1 How well can DR-PO optimize the RLHF objective? . . . . . . 119 6.5.2 Analysis of Dataset Reset Proportion . . . . . . . . . . . . . . 120 6.5.3 DR-PO Transfer Performance . . . . . . . . . . . . . . . . . . 121 6.5.4 DR-PO Scaling Performance on Anthropic HH . . . . . . . . . 122 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7 RL for Consistency Models: Faster Reward Guided Text-to-Image Genera- tion 124 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.3.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 128 7.3.2 Diffusion and Consistency Models . . . . . . . . . . . . . . . . 129 7.3.3 Reinforcement Learning for Diffusion Models . . . . . . . . . 131 7.4 Reinforcement Learning for Consistency Models . . . . . . . . . . . . 132 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.5.1 RLCM vs. DDPO Performance Comparisons . . . . . . . . . . 138 7.5.2 Train and Test Time Analysis . . . . . . . . . . . . . . . . . . 139 7.5.3 Ablation of Inference Horizon for RLCM . . . . . . . . . . . . 140 7.5.4 Qualitative Effects on Generalization . . . . . . . . . . . . . . 140 7.6 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . 141 8 Conclusion 143 8.1 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.2 Reinforcement Learning for Generative Models . . . . . . . . . . . . . 144 8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 III Appendix 146 A Missing Proofs and Details in Chapter 2 147 A.1 Analysis of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . 147 A.1.1 Discrete MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.1.2 KNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.1.3 General Function Class G with Bounded Eluder dimension . . . 155 A.1.4 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . 158 x A.2 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 A.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A.3.1 Environment Setup and Benchmarks . . . . . . . . . . . . . . . 163 A.3.2 Practical Implementation of MobILE . . . . . . . . . . . . . . . 164 A.3.3 Hyper-parameter Details . . . . . . . . . . . . . . . . . . . . . 170 A.4 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . 170 A.4.1 Modified Cartpole-v0 environment with noise added to transi- tion dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 A.4.2 Swimmer Learning Curves . . . . . . . . . . . . . . . . . . . . 171 A.4.3 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . 172 A.4.4 Ablation Study on Number of Models used for Strategic Explo- ration Bonus . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 B Missing Proofs and Details in Chapter 3 174 B.1 Bonus Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 B.1.1 Tabular models . . . . . . . . . . . . . . . . . . . . . . . . . . 174 B.1.2 KNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 B.1.3 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . 176 B.2 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 B.3 Finite sample error bound for each model . . . . . . . . . . . . . . . . 181 B.3.1 Discrete MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 181 B.3.2 KNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 B.3.3 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . 186 B.3.4 Missing Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . 193 B.4 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 B.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 214 B.5.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . 214 B.5.2 Dynamics Ensemble Architecture and Model Learning . . . . . 215 B.5.3 Policy Architecture and TRPO Details . . . . . . . . . . . . . . 216 B.5.4 Discriminator Update and Cost Function Details . . . . . . . . 217 B.6 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 218 B.6.1 MILO with Expert Trajectories . . . . . . . . . . . . . . . . . . 218 B.6.2 Performance of MILO on Ant without Pessimism . . . . . . . . 219 C Missing Proofs and Details in Chapter 4 220 C.1 Detailed Algorithm Pseudocode . . . . . . . . . . . . . . . . . . . . . 220 C.2 Implementation and Experiment Details . . . . . . . . . . . . . . . . . 221 C.2.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . 221 C.2.2 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . 222 C.2.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 222 C.3 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 C.3.1 Aggregate Performance Comparisons . . . . . . . . . . . . . . 225 C.3.2 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . 226 C.3.3 Learning curves across different optimization schedules . . . . 226 xi D Missing Proofs and Details in Chapter 5 229 D.1 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 229 D.2 Additional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 232 D.3 Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . 233 D.3.1 KL Reward Constraint . . . . . . . . . . . . . . . . . . . . . . 233 D.3.2 Task Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 D.3.3 IMDB - Algorithm Details . . . . . . . . . . . . . . . . . . . . 235 D.3.4 CommonGen - Algorithm Hyperparameters . . . . . . . . . . . 236 D.3.5 TL;DR Summarization - Algorithm Hyperparameters . . . . . . 237 D.4 IMDB Qualitative Examples . . . . . . . . . . . . . . . . . . . . . . . 239 D.5 CommonGen Qualitative Examples . . . . . . . . . . . . . . . . . . . 240 D.6 TL;DR Qualitative Examples . . . . . . . . . . . . . . . . . . . . . . . 241 E Missing Proofs and Details in Chapter 6 243 E.1 DR-PO with NPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 E.2 Proof of Theorem 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 E.2.1 Q function Estimation Error . . . . . . . . . . . . . . . . . . . 246 E.2.2 NPG Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 247 E.2.3 Unregularized Suboptimality Gap w.r.t. r∗ . . . . . . . . . . . . 251 E.3 NPG with regularized Q functions . . . . . . . . . . . . . . . . . . . . 253 E.3.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . 254 E.4 Proof of Theorem 90 . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 E.4.1 Q function Estimation Error . . . . . . . . . . . . . . . . . . . 256 E.4.2 Regularized NPG Analysis . . . . . . . . . . . . . . . . . . . . 256 E.4.3 Unregularized Suboptimality Gap w.r.t. r∗ . . . . . . . . . . . . 257 E.5 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 E.5.1 Least Sqaures Guarantee . . . . . . . . . . . . . . . . . . . . . 259 E.5.2 Maximum Likelihood Estimation Guarantee . . . . . . . . . . . 260 E.5.3 Performance Difference . . . . . . . . . . . . . . . . . . . . . 261 E.5.4 KL Divergence Property . . . . . . . . . . . . . . . . . . . . . 261 E.6 Additional Experiment Details . . . . . . . . . . . . . . . . . . . . . . 262 E.6.1 Experiment Hyperparameters and Task Details . . . . . . . . . 262 E.6.2 Dataset Reset Implementation Details . . . . . . . . . . . . . . 262 E.6.3 Details on GPT4 Winrate . . . . . . . . . . . . . . . . . . . . . 265 E.6.4 Examples from Test . . . . . . . . . . . . . . . . . . . . . . . 267 E.6.5 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 271 E.6.6 Additional Experiments . . . . . . . . . . . . . . . . . . . . . 272 F Missing Proofs and Details in Chapter 8 273 F.1 Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 F.2 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 F.2.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 273 F.2.2 Hyperparameter Sweep Ranges . . . . . . . . . . . . . . . . . 273 F.2.3 Details on Task Prompts . . . . . . . . . . . . . . . . . . . . . 274 xii F.3 Additional Samples from RLCM . . . . . . . . . . . . . . . . . . . . . 276 F.3.1 Aesthetic Task . . . . . . . . . . . . . . . . . . . . . . . . . . 277 F.3.2 Prompt Image Alignment . . . . . . . . . . . . . . . . . . . . . 278 xiii CHAPTER 1 INTRODUCTION Reinforcement learning (RL) fundamentally focuses on teaching agents how to make decisions by interacting with an environment. Unlike supervised learning approaches that learn from a fixed dataset, reinforcement learning agents learn by doing, receiving feedback as rewards or penalties based on their chosen actions. This dynamic learning process enables agents to optimize their behavior over time to achieve specific goals, making RL particularly effective for problems where explicit programming of correct actions is challenging. Recent advancements in deep learning have led to the development of deep reinforcement learning, where deep neural networks approximate the optimal action-value functions. This integration has significantly expanded RL’s applicability, allowing it to tackle complex tasks ranging from robotic control (Akkaya et al., 2019) to game playing (Silver et al., 2018; Berner et al., 2019), revolutionizing how machines learn and adapt in uncertain and variable environments. This generality of the paradigm makes RL incredibly difficult, making adopting novel tasks complicated. Even with the explosive success of ChatGPT (OpenAI, 2023), which applied a deep RL algorithm, Proximal Policy Optimization (PPO) (Schulman et al., 2017b), to large language models (LLMs), there has been more interest in either reducing the complexity of RL algorithms or eliminating the need for online interaction. Even beyond LLMs, despite RL’s superhuman abilities in games such as DOTA (Berner et al., 2019) or Go (Silver et al., 2016b), we have yet to see widespread adoption of RL in real-world applications compared to other, arguably more specialized, learning paradigms such as supervised learning. We notice that a critical challenge for the broad adoption of RL is efficiently utilizing diverse data sources to create a specialized algorithm. That is, for many of the successes mentioned above in RL, the algorithms used to learn the 1 agents were general-purpose algorithms that could also be used in other applications. In this thesis, we attempt to introduce RL algorithms that progress toward specialized algorithms for various settings. The first contribution of this dissertation comes from investigating efficient inverse reinforcement learning (IRL) from different types of data sources. We consider three settings. First, a significant source of demonstration data exists in the form of videos. Although it is straightforward to define the demonstrators’ states we would like to imitate (i.e., video frames), it is difficult to reliably infer the actions taken to reproduce the sequence of states shown in a video demonstration. Imitating an expert without knowledge of the actions creates a challenging exploration problem. This setting is called Imitation Learning from Observations (ILFO). Second, from the safety concerns of deploying a suboptimal model to the prohibitive costs of running an immense model, real-world barriers exist for online exploration and active data collection. Moreover, actively collecting high-quality demonstration data, usually through human labeling, is incredibly costly. Then, for a given task, we can envision the setting where a learner can access a large corpus of low-quality but pre-collected data and a much smaller source of high-quality expert data rather than get interactive access to the environment. This setting is called offline imitation learning, and we introduce the first offline IL algorithm. Finally, when it does make sense to allow for online exploration for IL, using the collected data efficiently is critical for an algorithm to be practical. Prior works improved the sample complexity of IL algorithms by investigating the intersection of off-policy RL and distribution-matching IL methods. A key challenge, however, was developing an algorithm that correctly utilized these off-policy samples and scaled to more difficult problem settings with high-dimensional states. This setting is called off-policy IL, and we provide a principled off-policy procedure. 2 The second contribution of this work is learning specialized algorithms in the space of generative models. Foundation models increasingly live up to their namesake, becoming capable base models for improved downstream performance on various tasks across multiple application domains. As alluded to before, Reinforcement learning from human feedback (RLHF) has emerged as a promising technique to make these foundation models even more capable in complex settings and aligned to human intentions, providing a new toolkit to optimize and guide generative model behavior. This approach, though, requires collecting a vast amount of resource-demanding human preference data, posing significant challenges in scalability and widespread adoption of effective RLHF. In this thesis’s second part, we investigate algorithmic improvements for learning decision-making agents from diverse data sources, working toward a vision of a scalable method for aligning these agents to user intentions. We present three different learning settings with generative models: text generation with an interactive black box model, text generation with high-quality human labels, and text instruction-guided image generation. Overall, each setting has a specific property, whether it is deterministic transition dynamics or a short horizon, that allows for the design of more specialized algorithms that efficiently exploit these properties. We live in a world with an over abundance of demonstration data for sequential tasks in videos, texts, and messy spreadsheets. Applying general purpose RL algorithms to all these settings is unnecessarily hard. Our main focus in this thesis is to improve the efficiency of RL by designing specialized sequential decision making algorithms from various sources of information. 1.1 Main Contributions We outline the two main contributions presented in this thesis. 3 1.1.1 Imitation Learning from Diverse Data Sources We consider principled approaches to imitation learning from three different data types: 1. Observations or when we have no action information from the expert that we wish to imitate; 2. Offline or when we only have access to a large corpus of (suboptimal) trajectories from the environment with no additional interactive data; 3. Off-policy where our online experience we use for learning is from multiple different policies. In the imitation learning from observations (ILFO) setting, we show that ILFO is strictly harder than IL and exploration is necessary to do effective imitation. In the offline setting, we presented the state of the art in offline IL at the time of publication showing the efficacy of our model-based approach to more complicated simulation domains. Furthermore, in theory we investigated how the use of additional offline data allowed for us to mitigate covariate shift without any additional online interaction or expert interaction. Finally, we proposed an off-policy IRL algorithm that was both principled and practically scalable. At the time of writing, many off-policy IRL algorithms that scaled to higher dimensional states such as images required ad-hoc modifications that were not principled. On the other hand, another line of off-policy IL work building on top of DICE (Nachum et al., 2019a) were principled but struggled to scale to high dimensional tasks. For off-policy IL, we propose AILBoost that aims to be the best of both worlds with a principled algorithm that scales well in practice. 1.1.2 Reinforcement Learning and Imitation Learning with Genera- tive Models Our main contributions in this section lies in connecting many of the insights from the previous part into the space of generative models. The key insight underlying these 4 contributions is that many generative processes are sequential in nature, and thus could be modeled with the same machinery we have been using in IL and RL. For example, the generation of a sentence is the sequential prediction of next words or the generation of an image is a sequence of denoising steps. With that in mind, we brodaly investigate two different generation tasks: 1. text generation (Chang et al., 2023, 2024b) and 2. image generation with diffusion (Oertell et al., 2024). For text generation, we lift many prior algorithms like AggreVaTeD (Sun et al., 2017a) and LOLS (Chang et al., 2015b) to LLMs while introducing new RL algorithms such as PPO++ (Chang et al., 2023) and DR-PO (Chang et al., 2024b). The main contribution here is similar to the previous part where we leverage various different data sources such as interactive data for PPO++ and offline data for DR-PO to augment RL finetuning. Finally, we introduce a novel perspective in image generation with consistency models to improve upon guided text-to-image generation over existing methods in diffusion models while being up to two orders of magnitude faster. 1.2 Background Here we give a brief introduction to background concepts that are repeatedly used throughout the dissertation. We specifically cover Markov Decision Process, Trust Region Policy Optimization (Schulman et al., 2015a), Proximal Policy Optimization (Schulman et al., 2017b), and adversarial inverse reinforcement learning. Each chapter will redefine relevant background concepts in more detail, making each chapter self- contained. The purpose of this section is to give a brief introduction to the models and base algorithms used throughout this dissertation. 5 1.2.1 Markov Decision Process (MDP) In this dissertation, we present algorithms that model the problem as a Markov Decision Process. There are two types of MDPs that we consider here. Finite Horizon MDP: As the name suggests, finite horizon MDPs refer to a problem that has a well-defined, finite maximum horizon. For example, when modeling language generation as an MDP, perhaps we set a maximum generation length. More formally, a finite horizon MDP [cite] is defined as a tuple, (S,A, P,R, µ,H). S andA define the set of states and actions respectively for the modelled task. P defines the transition dynamics or the probability of transitioning to the next state, s′ ∈ S, when taking action a ∈ A from state s ∈ A. R is the reward function for this task. Finally, we define µ as the initial distribution of states and H ∈ N+ as the finite horizon of the MDP. We can then model a sequential task as an MDP such that at timestep, t: we are at state st ∈ S, take action at ∈ A, transition to the next state st+1 ∈ S according to P(st+1|st, at), and get a reward of R(st, at). We also define a policy π : S → ∆(A) that maps from a state to a distribution over actions. So for a given initial state, s0 ∼ µ, we get an action a0 ∼ π(·|s0), and then transition to the next state s1 ∼ P(·|s0, a0). We repeat this process for a maximum of H timesteps to get a trajectory τ = (s0, a0, s1, a1, . . . , sH, aH). Now, given a policy π, we can then define the state-action Q function, Qπ t (s, a) as the following: Qπ t (s, a) = R(s, a) + Es′∼P(·|s,a) [ Vπ t+1(s′) ] where the state value function Vπ t (s) is defined as Vπ t (s) = E  H∑ i=t R(si, ai) ∣∣∣∣∣ ai ∼ π(·|si), si+1 ∼ P(·|si, ai), st = s  . The objective function J(π) is then defined as J(π) = Es0∼µ [ Vπ 0 (s0) ] . 6 The goal would be to learn a policy π ∈ Π that maximizes this objective. Discounted Infinite Horizon MDP: Discounted infinite horizon MDPs account for problems that do not have a set defined sequence length. They are defined as (S,A, P,R, µ, γ), where γ ∈ (0, 1) is the discount factor. Different from finite horizon MDPs, the definitions for the state and state-action value functions become time in- dependent. We define the state-action Q function for a given policy, Qπ(s, a) as the following: Qπ(s, a) = R(s, a) + γEs′∼P(·|s,a) [ Vπ(s′) ] , with the value function Vπ(s) is defined as Vπ(s) = E  ∞∑ t=0 γtR(st, at) ∣∣∣∣∣ at ∼ π(·|st), st+1 ∼ P(·|st, at), s0 = s  . Also we can define the advantage function Aπ(s, a) as Aπ(s, a) = Qπ(s, a) − Vπ(s). Intuitively, rather than determining how good an action is in the absolute sense like the Q function, we hope to capture how good an action is in expectation. 1.2.2 Deep Policy Gradient Algorithms The works presented throughout this dissertation assume using multiple Deep RL al- gorithms as black box optimizers for the RL objective. Outside of Chapter 4, we use on-policy algorithms or RL algorithms that update their policy based on experience collected from the current, up-to-date policy. Specifically, I will present a primer on Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy Optimization (PPO) (Schulman et al., 2017b). For both algorithms, we parameterize our 7 policy as a deep neural network with parameters, θ. We denote this policy as πθ and will compute the gradient with respect to θ, i.e. ∇θJ(πθ). Policy Gradient: Both TRPO and PPO are policy gradient algorithms. I present a quick primer on policy gradients. Intuitively, the key idea underlying policy gradients is to increase the probability of actions leading to higher values and drop the probabilities of actions resulting in lower values. More formally, for our objective J(πθ), we have our policy gradient, ∇θJ(πθ) = Eτ∼πθ  H∑ t=0 ∇θ log πθ(at|st)Aπθ(st, at)  , where τ is a trajectory collected by following πθ and Aπθ is the advantage function for the current policy. The derivation of this gradient was done by Williams (1992) where he introduced the foundational REINFORCE algorithm, a Monte-Carlo policy gradient algorithm. I encourage readers to refer to Sutton and Barto (1998) and Agarwal et al. (2019) for a detailed treatment of policy gradients. A policy gradient algorithm would then, with the gradient above, update the policy parameters, θ, usually with stochastic gradient ascent with some learning rate α, θ ← [ θ + α∇θJ(πθ). A key challenge here is how to select the appropriate learning rate. The first-order gradients from our formulation give us the direction of our policy updates, but do not give us the magnitude of our update. Moreover, given the online nature of our RL optimization process, a learning rate for a policy at one iteration may be catastrophic for another update. This problem motivates our next concept, the natural policy gradient (Kakade, 2001a). Natural Policy Gradient: The natural policy gradient (NPG) remedies the step size problem by including a second-order derivative to capture the sensitivity of gradients 8 dependent on the changes in parameters, θ. For this purpose, NPG first computes the KL-divergence between the policy before and after the update, DKL (πθ||πθ+∆θ) = ∑ s πθ(s) log ( πθ(s) πθ+∆θ(s) ) . We can then set the maximum tolerance of change as ϵ and solve a constrained optimiza- tion problem argmax DKL(πθ ||πθ+∆θ)≤ϵ J(πθ + ∆θ). To derive an algorithm computing this conservative update, Kakade (2001a) approximates the constraint as a Lagrangian and performs a Taylor expansion on the modified objective. This results in the following update rule θ ← [ θ + √ 2ϵ ∇θJ(πθ)⊤F(θ)∇θJ(πθ) F−1(θ)∇θJ(πθ) F(θ) = E [ ∇θ log πθ∇θ log π⊤θ ] is the Fisher information matrix. The key observation here is that we replaced the learning rate α with a dynamic term that depends on the local sensitivity of the current policy being updated and the constraint ϵ. Furthermore, we get second order information from F−1(θ). Both of these improvements allows for us to make the largest policy update within the divergence threshold ϵ. Trust Region Policy Optimization (TRPO): There are two key challenges from NPG that TRPO aims to address. First, it is incredibly expensive to compute the natural gradient F−1(θ)J(πθ which involves computing the inverse of the hessian which is a matrix of size |θ| · |θ|. As we parameterize our policies as larger neural networks, the exact computation of this term becomes infeasible. TRPO instead approximates the natural policy gradient with the conjugate gradient method, an iterative numerical method. The conjugate gradient method converges in much fewer steps than the computation needed to exactly compute the inverse of the hessian of parameters, and improves the scalability of NPG. 9 Second, although NPG provided us with the optimal step size, our Taylor approxi- mation of the KL divergence between policies may misrepresent the actual distance and our update could still violate the constraint. To mitigate this, TRPO performs line-search that iteratively decreases the update size until we have an update that does not violate the constraint. Algorithm 1 shows how TRPO enforces that the KL-constraint is satisfied. Algorithm 1 Line Search for TRPO 1: α=1.0 2: Compute proposed update ∆θ = √ 2ϵ ∇θJ(πθ)⊤F(θ)∇θJ(πθ) F−1(θ)∇θJ(πθ) 3: for i = 0, 1, . . . , L do 4: Try update θ′ = θ + α∆θ 5: if DKL(πθ||πθ′) ≤ ϵ then 6: Accept the update and break 7: else 8: Decay α 9: end if 10: end for Proximal Policy Optimization (PPO): Schulman et al. (2017b) introduced PPO building on the foundations introduced in TRPO. Despite improvements of TRPO over NPG in scalability, TRPO is still complicated to implement and is much more costly to implement than first-order updates used in vanilla policy gradient algorithms. To balance practicality and the insight from NPG of taking conservative policy updates, PPO approximates the trust region used in TRPO and NPG (i.e. DKL ≤ ϵ with a clipped surrogate objective, J(πθ) = Eτ∼πθ  H∑ t=0 min ( πθ(at|st) πθk(at|st) Aπθk (st, at), clip ( πθ(at|st) πθk(at|st) , 1 − ϵ, 1 + ϵ ) Aπθk (st, at) ) The main insight from this objective is that the clip serves as an implicit trust region. Furthermore, this objective allows for us to use stochastic gradient descent to update the policy gradient, making both the implementation and the optimization much simpler. 10 Although ϵ requires tuning, the relative simplicity of PPO over TRPO and its improved scalability has made PPO the de facto policy gradient algorithm over TRPO. 1.2.3 Inverse Reinforcement Learning (IRL) Another main topic of this dissertation is inverse reinforcement learning. Although we have used the term imitation learning (IL) up to this point, the family of methods that we investigate to solve the IL problem is IRL. Here I will briefly introduce the setting for IRL and present a unified min-max objective for IRL similar to our RL objective. Setting: Recall we defined our setting for RL to be an MDP or a tuple (S,A, P,R, µ,H). In IL, we still model our task as an MDP, but we do not assume access to the reward. Instead, we usually assume access to an expert’s demonstrations in the form of a dataset of trajectories. Thus, the setting for IRL would be an MDP without the reward function (i.e. (S,A, P, µ,H)) and a separate dataset of N, demonstrations, D = {τi} N i=0. Objective: As the name, Inverse RL, suggests the overall objective of IRL is to learn the unknown reward function that the expert policy, πE, is maximizing. Note that this procedure could easily be extended to solve an IL problem which seeks to imitate and learn πE from demonstrations. Specifically, once we have the optimal reward function R that the expert is optimizing for, we can then do RL with that reward function to learn the imitating policy. We formalize this and get the following objective π̂ = argmax π∈Π min f∈F Eπ [ f (s, a) ] − EπE [ f (s, a) ] where F is the set of discriminators and Π is the set of policies. As detailed by Ke et al. (2020), this objective can be viewed as framing IRL as a divergence minimization 11 problem. That is for difference choices of F , we are effectively trying to minimize a different divergence between our current policy πθ and our expert policy πE. This min-max objective has close connections to generative adversarial networks (Goodfellow et al., 2014) leading to practical instantiations of this objective in foundational works such as GAIL (Ho and Ermon, 2016a). Broadly, algorithms that solve this objective iteratively do two steps: 1. solve for the inner minimization to learn f , 2. with f fixed, do RL to maximize learn a policy, π, that maximizes the current reward depended on f . 1.3 Organization The remainder of the thesis is organized as follows. Part I: Imitation Learning consists of three chapters. Chapter 2 focuses on imitation learning from observations alone or the setting where the expert demonstrations do not include any action information. In this chapter, we present a model-based algorithm MobILE. Chapter 3 considers doing effective imitation learning in the setting where we do not allow for online interactions within the MDP, but instead have access to a large dataset of (suboptimal) experience. Here, we introduce a model-based offline algorithm MILO. Finally, Chapter 4 considers doing imitation learning from off-policy data in a principled way and propose AILBoost. Part II: IL and RL for Generative Models presents five chapters. Chapter 5 unifies a wide range of RL and interactive IL algorithms and specializes them for language generation with Large Language Models. We call this framework Reinforcement Learn- ing from Guided Feedback (RLGF) and introduce a new algorithm PPO++. Chapter 6 removes the assumption that we have access to another interactive LLM and considers whether we could directly use human demonstrations to improve RL for language models. 12 Here we present a hybrid RL algorithm DR-PO. Chapter 7 introduces using RL for a novel class of generative models called Consistency Models. Here we propose RLCM which does guided text-to-image generation with consistency models. Finally, Chapter 8 concludes the dissertation with future directions that build upon the ideas presented here. Part III: Appendix contains multiple missing proofs and implementation details that were kept out of the main text for clarity. 1.4 Bibliographical Remarks This thesis contains works for which this author was a key contributor. Chapter 2 is based on joint work with Rahul Kidambi and Wen Sun (Kidambi et al., 2021). Chapter 3 is based on joint work with Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, and Wen Sun (Chang et al., 2021). Chapter 4 is based on joint work with Dhruv Sreenivas, Wendy Huang, Kianté Brantley, and Wen Sun (Chang et al., 2024a). Chapter 5 is based on joint work with Kianté Brantley, Rajkumar Ramamurthy, Dipendra Misra, and Wen Sun (Chang et al., 2023). Chapter 6 is based on joint work with Wenhao Zhan, Kianté Brantley, Dipendra Misra, Jason D Lee, and Wen Sun (Chang et al., 2024b). Finally, Chapter 7 is based on joint work with Owen Oertell, Yiyi Zhang, Kianté Brantley, and Wen Sun (Oertell et al., 2024). 13 Part I Imitation Learning 14 CHAPTER 2 MODEL-BASED IMITATION LEARNING FROM OBSERVATION ALONE This chapter studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that consist only of states visited by an expert (without access to actions taken by the expert). We present a provably efficient model-based framework, MobILE, to solve the ILFO problem. MobILE involves carefully trading off strategic exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of structural complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by presenting an exponential sample complexity separation between IL and ILFO. We complement these theoretical results with experimental simulations on benchmark OpenAI Gym tasks that indicate the efficacy of MobILE. Code for implementing the MobILE framework is available at https://github.com/rahulkidambi/MobILE-NeurIPS2021. 15 https://github.com/rahulkidambi/MobILE-NeurIPS2021 2.1 Introduction This chapter considers Imitation Learning from Observation Alone (ILFO). In ILFO, the learner is presented with sequences of states encountered by the expert, without access to the actions taken by the expert, meaning approaches based on a reduction to supervised learning (e.g., Behavior cloning (BC) (Ross and Bagnell, 2010a), DAgger (Ross et al., 2011b)) are not applicable. ILFO is more general and has potential for applications where the learner and expert have different action spaces, applications like sim-to-real (Song et al., 2020a; Desai et al., 2020) etc. Sun et al. (2019c) reduced the ILFO problem to a sequence of one-step distribution matching problems that results in obtaining a non-stationary policy. This approach, however, is sample inefficient for longer horizon tasks since the algorithm does not effectively reuse previously collected samples when solving the current sub-problem. Another line of work considers model-based methods to infer the expert’s actions with either an inverse dynamics (Torabi et al., 2018a) or a forward dynamics (Edwards et al., 2019) model; these recovered actions are then fed into an IL approach like BC to output the final policy. These works rely on stronger assumptions that are only satisfied for Markov Decision Processes (MDPs) with injective transition dynamics (Zhu et al., 2020); we return to this in the related works section. We introduce MobILE—Model-based Imitation Learning and Exploring, a model- based framework, to solve the ILFO problem. In contrast to existing model-based efforts, MobILE learns the forward transition dynamics model—a quantity that is well defined for any MDP. Importantly, MobILE combines strategic exploration with imitation by interleaving a model learning step with a bonus-based, optimistic distribution matching step – a perspective, to the best of our knowledge, that has not been considered in 16 Figure 2.1: Expert performance normalized scores of ILFO algorithms averaged across 5 seeds in environments with discrete action spaces (Reacher-v2) and continuous action spaces (Hopper-v2 and Walker2d-v2). Imitation Learning. MobILE has the ability to automatically trade-off exploration and imitation. It simultaneously explores to collect data to refine the model and imitates the expert wherever the learned model is accurate and certain. At a high level, our theoretical results and experimental studies demonstrate that systematic exploration is beneficial for solving ILFO reliably and efficiently, and optimism is a both theoretically sound and practically effective approach for strategic exploration in ILFO (see Figure 2.1 for comparisons with other ILFO algorithms). This paper extends the realm of partial information problems (e.g. Reinforcement Learning and Bandits) where optimism has been shown to be crucial in obtaining strong performance, both in theory (e.g., E3 (Kearns and Singh, 2002a), UCB (Auer et al., 2002)) and practice (e.g., RND (Burda et al., 2018)). This paper proves that incorporating optimism into the min-max IL framework (Ziebart et al., 2008; Ho and Ermon, 2016b; Sun et al., 2019c) is beneficial for both the theoretical foundations and empirical performance of ILFO. Our Contributions: We present MobILE (Algorithm 2), a provably efficient, model- based framework for ILFO that offers competitive results in benchmark gym tasks. MobILE can be instantiated with various implementation choices owing to its modular 17 design. In this chapter, we detail the following contributions: 1. MobILE combines model-based learning, optimism for exploration, and adversarial imitation learning. MobILE achieves global optimality with near-optimal regret bounds for classes of MDP dynamics that satisfy well studied notions of complexity. The key idea of MobILE is to use optimism to trade-off imitation and exploration. 2. We show an exponential sample complexity gap between ILFO and classic IL where one has access to expert’s actions. This indicates that ILFO is fundamentally harder than IL. Our lower bound on ILFO also indicates that to achieve near optimal regret, one needs to perform systematic exploration rather than random or no exploration, both of which will incur sub-optimal regret. 3. We instantiate MobILE with a model ensemble of neural networks and a disagreement- based bonus. We present experimental results on benchmark OpenAI Gym tasks, indicating MobILE matches or outperforms existing approaches. Ablation studies indicate that optimism helps in improving the performance in practice. 2.2 Related Works Imitation Learning (IL) is considered through the lens of two types of approaches: (a) behavior cloning (BC) (Pomerleau, 1989) which casts IL as a reduction to supervised or full-information online learning (Ross and Bagnell, 2010a; Ross et al., 2011b), or, (b) (adversarial) inverse RL (Ng and Russell, 2000; Abbeel and Ng, 2004; Ziebart et al., 2008; Finn et al., 2016b; Ho and Ermon, 2016b; Ke et al., 2019; Ghasemipour et al., 2020), which involves minimizing various distribution divergences to solve the IL problem, either with the transition dynamics known (e.g., Ziebart et al. (2008)), or unknown (e.g., Ho and Ermon (2016b)). MobILE does not assume knowledge of the transition dynamics, 18 is model-based, and operates without access to the expert’s actions. Imitation Learning from Observation Alone (ILFO) (Sun et al., 2019c) presents a model-free approach FAIL that outputs a non-stationary policy by reducing the ILFO problem into a sequence of min-max problems, one per time-step. While being theoreti- cally sound, this approach cannot share data across different time steps and thus is not data efficient for long horizon problems. Also FAIL in theory only works for discrete actions. In contrast, MobILE learns a stationary policy using model-based approaches by reusing data across all time steps and extends to continuous action space. Another line of work (Torabi et al., 2018a; Edwards et al., 2019; Yang et al., 2019) relies on learning an estimate of expert action, often through the use of an inverse dynamics models, Pe(a|s, s′). Unfortunately, an inverse dynamics model is not well defined in many benign problem instances. For instance, (Zhu et al., 2020, remark 1, section 9.3) presents an example showing that inverse dynamics is not well defined except in the case when the MDP dynamics is injective (i.e., no two actions could lead to the same next state from the current state. Note that even deterministic transition dynamics doesn’t imply injectivity of the MDP dynamics). Furthermore, ILPO (Edwards et al., 2019) applies to MDPs with deterministic transition dynamics and discrete actions. MobILE, on the other hand, learns the forward dynamics model which is always unique and well-defined for both deterministic and stochastic transitions and works with discrete and continuous actions. Another line of work in ILFO revolves around using hand-crafted cost functions that may rely on task-specific knowledge (Peng et al., 2018; Aytar et al., 2018; Schmeckpeper et al., 2020). The performance of policies outputted by these methods relies on the quality of the engineered cost functions. In contrast, MobILE does not require cost function engineering. Model-Based RL has seen several advances (Sutton, 1990; Li and Todorov, 2004; Deisenroth and Rasmussen, 2011) including ones based on deep learning (e.g., Lampe 19 and Riedmiller (2014); Gu et al. (2016); Luo et al. (2018); Janner et al. (2019); Lowrey et al. (2019); Wang et al. (2019)). Given MobILE’s modularity, these advances in model- based RL can be translated to improved algorithms for the ILFO problem. MobILE bears parallels to provably efficient model-based RL approaches including E3 (Kearns and Singh, 2002b; Kakade et al., 2003b), R-MAX (Brafman and Tennenholtz, 2001), UCRL (Jaksch et al., 2010), UCBVI (Azar et al., 2017), Linear MDP (Yang and Wang, 2019), LC3 (Kakade et al., 2020a), Witness rank (Sun et al., 2019a) which utilize optimism based approaches to trade-off exploration and exploitation. MobILE utilizes optimism to trade-off exploration and imitation. 2.3 Setting We consider an episodic finite-horizon MDP,M = {S,A, P⋆,H, c, s0}, where S,A are the state and action space, P⋆ : S × A 7→ S is the MDP’s transition kernel, H is the horizon, s0 is a fixed initial state (note that our work generalizes when we have a distribution over initial states), and c is the state-dependent cost function c : S 7→ [0, 1]. Our result can be extended to the setting where c : S × S 7→ [0, 1], i.e., the ground truth cost c(s, s′) depends on state and next state pairs. For analytical simplicity, we focus on c : S 7→ [0, 1].1 We denote dπP ∈ ∆(S ×A) as the average state-action distribution of policy π under the transition kernel P, i.e., dπP(s, a) := 1 H ∑H t=1 Pr(st = s, at = a|s0, π, P), where Pr(st = s, at = a|s0, π, P) is the probability of reaching (s, a) at time step t starting from s0 by following π under transition kernel P. We abuse notation and write s ∼ dπP to denote a 1Without any additional assumptions, in ILFO, learning to optimize action-dependent cost c(s, a) (or c(s, a, s′) is not possible. For example, if there are two sequences of actions that generate the same sequence of states, without seeing expert’s preference over actions, we do not know which actions to commit to. 20 state s is sampled from the state-wise distribution which marginalizes action over dπP(s, a), i.e., dπP(s) := 1 H ∑H t=1 Pr(st = s|s0, π, P). For a given cost function f : S 7→ [0, 1], Vπ P; f denotes the expected total cost of π under transition P and cost function f . Similar to the IL setting, in ILFO, the ground truth cost c is unknown. Instead, we can query the expert, denoted as πe : S 7→ ∆(A). Note that the expert πe could be stochastic and does not have to be the optimal policy. The expert, when queried, provides state-only demonstrations τ = {s0, s1 . . . sH}, where st+1 ∼ P⋆(·|st, at) and at ∼ π e(·|st). The goal is to leverage expert’s state-wise demonstrations to learn a policy π that performs as well as πe in terms of optimizing the ground truth cost c, with polynomial sample complexity on problem parameters such as horizon, number of expert samples and online samples and the underlying MDP’s complexity measures (see section Section 2.5 for precise examples). We track the progress of any (randomized) algorithm by measuring the (expected) regret incurred by a policy π defined as E[Vπ]−Vπ∗ as a function of number of online interactions utilized by the algorithm to compute π. 2.3.1 Function Approximation Setup Since the ground truth cost c is unknown, we utilize the notion of a function class (i.e., discriminators) F ⊂ S 7→ [0, 1] to define the costs that can then be utilized by a planning algorithm (e.g. NPG (Kakade, 2001b)) for purposes of distribution matching with expert states. If the ground truth c depends on both the state and next state, (s, s′), we use discriminators F ⊂ S × S 7→ [0, 1]. Furthermore, we use a model class P ⊂ S ×A 7→ ∆(S) to capture the ground truth transition P⋆. For the theoretical results in the paper, we assume realizability: Assumption 1. Assume F and P captures both the ground truth cost and transition, i.e., 21 c ∈ F , P⋆ ∈ P. We will use an integral probability metric (IPM) with F as our divergence measure. Note that if c ∈ F and c : S 7→ [0, 1], then the IPM defined as max f∈F Es∼dπ f (s) − Es∼dπe f (s) directly upper bounds sub-optimality gap Vπ − Vπe , where Vπ is the expected total cost of π under cost function c. This justifies why minimizing the IPM between two state distributions suffices (Ho and Ermon, 2016b; Sun et al., 2019c). Similarly, if c depends on s, s′, we can simply minimize the IPM between two state-next state distributions, i.e., max f Es,s′∼dπ f (s, s′) − Es,s′∼dπe f (s, s′) where discriminators now take (s, s′) as input.2 To permit generalization, we require P to have bounded complexity. For analytical simplicity, we assume F is discrete (but exponentially large), and we require the sample complexity of any PAC algorithm to scale polynomially with respect to its complexity ln(|F |). The ln |F | complexity can be replaced to bounded conventional complexity measures such as Rademacher complexity and covering number for continuous F (e.g., F being a Reproducing Kernel Hilbert Space). 2.4 Algorithm We introduce MobILE (Algorithm 2) for the ILFO problem. MobILE utilizes (a) a function class F for Integral Probability Metric (IPM) based distribution matching, (b) a tran- sition dynamics model class P for model learning, (c) a bonus parameterization B for exploration, (d) a policy class Π for policy optimization. At every iteration, MobILE (in Algorithm 2) performs the following steps: 2we slightly abuse notation here and denote dπ as the average state-next state distribution of π, i.e., dπ(s, s′) := dπ(s) ∫ a π(a|s)daP⋆(s′|s, a). 22 Algorithm 2 MobILE: The framework of Model-based Imitation Learning and Exploring for ILFO 1: Require: IPM class F , dynamics model class P, policy class Π, bonus function class B, expert datasetDe ≡ {se i } N i=1. 2: Initialize policy π0 ∈ Π, replay bufferD−1 = ∅. 3: for t = 0, · · · ,T − 1 do 4: Execute πt in true environment P⋆ to get samples τt = {sk, ak} H−1 k=0 ∪ sH. Append to replay bufferDt = Dt−1 ∪ τt. 5: Update model and bonus: P̂t+1 : S ×A → S and bt+1 : S ×A → R+ using buffer Dt. 6: Optimistic model-based min-max IL: obtain πt+1 by solving Equation (2.1) with P̂t+1, bt+1,De. 7: end for 8: Return πT . 1. Dynamics Model Learning: execute policy in the environment online to obtain state- action-next state (s, a, s′) triples which are appended to the bufferD. Fit a transition model P̂ onD. 2. Bonus Design: design bonus to incentivize exploration where the learnt dynamics model is uncertain, i.e. the bonus b(s, a) is large at state s where P̂(·|s, a) is uncertain in terms of estimating P⋆(·|s, a), while b(s, a) is small where P̂(·|s, a) is certain. 3. Imitation-Exploration tradeoff: Given discriminators F , model P̂, bonus b and expert dataset De, perform distribution matching by solving the model-based IPM objective with bonus: πt+1 ← arg min π∈Π max f∈F L(π, f ; P̂, b,De) := E(s,a)∼dπ P̂ [ f (s) − b(s, a) ] − Es∼De [ f (s) ] , (2.1) where Es∼De f (s) := ∑ s∈De f (s)/|De|. Intuitively, the bonus cancels out discriminator’s power in parts of the state space where the dynamics model P̂ is not accurate, thus offering freedom for MobILE to explore. We first explain MobILE’s components and then discuss MobILE’s key property—which is to trade-off exploration and imitation. 23 2.4.1 Components of MobILE Dynamics model learning: For the model fitting step in line 5, we assume that we get a calibrated model in the sense that: ∥P̂t(·|s, a) − P⋆(·|s, a)∥1 ≤ σt(s, a),∀s, a for some uncertainty measure σt(s, a), similar to model-based RL works, e.g. (Curi et al., 2020). We discuss ways to estimate σt(s, a) in the bonus estimation below. There are many examples (discussed in Section 2.5) that permit efficient estimation of these quantities including tabular MDPs, Kernelized nonlinear regulator, nonparametric model such as Gaussian Processes. Consider a general function class G ⊂ S ×A 7→ S, one can learn ĝt via solving a regression problem, i.e., ĝt = argmin g∈G ∑ s,a,s′∈Dt ∥g(s, a) − s′∥22, (2.2) and setting P̂t(·|s, a) = N (̂ gt(s, a), σ2I ) , where, σ is the standard deviation of error induced by ĝt. In practice, such parameterizations have been employed in several settings in RL withG being a multi-layer perceptron (MLP) based function class (e.g.,(Rajeswaran et al., 2020)). In Section 2.5, we also connect this with prior works in provable model- based RL literature. Bonus: We utilize bonuses as a means to incentivize the policy to efficiently explore unknown parts of the state space for improved model learning (and hence better distribu- tion matching). With the uncertainty measure σt(s, a) obtained from calibrated model fitting, we can simply set the bonus bt(s, a) = O(Hσt(s, a)). How do we obtain σt(s, a) in practice? For a general class G, given the least square solution ĝt, we can define a version space Gt as: Gt = { g ∈ G : ∑t−1 i=0 ∑H−1 h=0 ∥g(st h, a t h) − ĝt(st h, a t h)∥22 ≤ zt } , with zt being a hyper parameter. The version space Gt is an ensemble of functions g ∈ G which has training error onDt almost as small as the training error of the least square solution ĝt. 24 In other words, version space Gt contains functions that agree on the training setDt. The uncertainty measure at (s, a) is then the maximum disagreement among models in Gt, with σt(s, a) ∝ sup f1, f2∈Gt ∥ f1(s, a) − f2(s, a)∥2. Since g ∈ Gt agree onDt, a large σt(s, a) indicates (s, a) is novel. See example 3 for more theoretical details. Empirically, disagreement among an ensemble (Osband et al., 2018a; Azizzadenesheli et al., 2018a; Burda et al., 2019; Pathak et al., 2019a; Lowrey et al., 2019) is used for designing bonuses that incentivize exploration. We utilize an neural network ensemble, where each model is trained on Dt (via SGD on squared loss Equation (2.2)) with different initialization. This approximates the version space Gt, and the bonus is set as a function of maximum disagreement among the ensemble’s predictions. Optimistic model-based min-max IL: For model-based imitation (line 6), MobILE takes the current model P̂t and the discriminators F as inputs and performs policy search to minimize the divergence defined by P̂n and F : dt(π, πe) := max f∈F [ Es,a∼dπ P̂t ( f (s) − bt(s, a)) − Es∼dπe f (s) ] . Note that, for a fixed π, the arg max f∈F is identical with or without the bonus term, since Es,a∼dπ P̂t bt(s, a) is independent of f . In our implementation, we use the Maximum Mean Discrepancy (MMD) with a Radial Basis Function (RBF) kernel to model discriminators F .3 We compute argminπ dt(π, πe) by iteratively (1) computing the argmax discriminator f given the current π, and (2) using policy gradient methods (e.g., TRPO) to update π inside P̂t with f − bt as the cost. Specifically, to find πt (line 6), we iterate between the following two steps: 1. Cost update: f̂ = argmax f∈F Es∼dπ̂ P̂t f (s) − Es∼De f (s) 2. PG Step: π̂ = π̂ − η · ∇πV π̂ P̂t , f̂−bt , where the PG step uses the learnt dynamics model P̂t and the optimistic IPM cost f̂ (s) − bt(s, a). Note that for MMD, the cost update step has a closed-form solution. 3For MMD with kernel k, F = {w⊤ϕ(s, a)|∥w∥2 ≤ 1} where ϕ: ⟨ϕ(s, a), ϕ(s′, a′)⟩ = k((s, a), (s′, a′)). 25 2.4.2 Exploration And Imitation Tradeoff We note that MobILE is performing an automatic trade-off between exploration and imitation. More specifically, the bonus is designed such that it has high values in the state space that have not been visited, and low values in the state space that have been frequently visited by the sequence of learned policies so far. Thus, by incorporating the bonus into the discriminator f ∈ F (e.g., f̃ (s, a) = f (s)−bt(s, a)), we diminish the power of discriminator f at novel state-action space regions, which relaxes the state-matching constraint (as the bonus cancels the penalty from the discriminators) at those novel regions so that exploration is encouraged. For well explored states, we force the learner’s states to match the expert’s using the full power of the discriminators. Our work uses optimism (via coupling bonus and discriminators) to carefully balance imitation and exploration. 2.5 Analysis This section presents a general theorem for MobILE that uses the notion of information gain Srinivas et al. (2009), and then specializes this result to common classes of stochastic MDPs such as discrete (tabular) MDPs, Kernelized nonlinear regulator Kakade et al. (2020c), and general function class with bounded Eluder dimension Russo and Roy (2013). Recall, Algorithm 2 generates one state-action trajectory τt := {st h, a t h} H h=0 at iteration t and estimates model P̂t based on Dt = τ 0, . . . , τt−1. We present our theorem under the assumption that model fitting gives us a model P̂ and a confidence interval of the model’s prediction. 26 Assumption 2 (Calibrated Model). For all iteration t with t ∈ N, with probability 1 − δ, we have a model P̂t and its associated uncertainty measure σt : S ×A 7→ R+, such that for all s, a ∈ S ×A4 ∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ≤ min {σt(s, a), 2} . Assumption 2 has featured in prior works (e.g., Curi et al. (2020)) to prove regret bounds in model-based RL. Below we demonstrate examples that satisfy the above assumption. Example 1 (Discrete MDPs). Given Dt, denote N(s, a) as the number of times (s, a) appears in Dt, and N(s, a, s′) number of times (s, a, s′) appears in Dt. We can set P̂t(s′|s, a) = N(s, a, s′)/N(s, a),∀s, a, s′. We can set σt(s, a) = Õ (√ S/N(s, a) ) . Example 2 (KNRs Kakade et al. (2020c)). For KNR, we have P⋆(·|s, a) = N ( W⋆ϕ(s, a), σ2I ) where feature mapping ϕ(s, a) ∈ Rd and ∥ϕ(s, a)∥2 ≤ 1 for all s, a.5 We can learn P̂t via Kernel Ridge regression, i.e., ĝt(s, a) = Ŵtϕ(s, a) where Ŵt = argmin W ∑ s,a,s′∈Dt ∥Wϕ(s, a) − s′∥22 + λ ∥W∥ 2 F where ∥ · ∥F is the Frobenius norm. The uncertainty measure σt(s, a) = βt σ ∥ϕ(s, a)∥Σ−1 t , βt = {2λ∥W⋆∥22 + 8σ2 · [ds ln(5) + 2 ln(t2/δ) + ln(4) + ln (det(Σt)/ det(λI))]}1/2, and, Σt =∑t−1 k=0 ∑H−1 h=1 ϕ(sk h, a k h)ϕ(sk h, a k h)⊤ + λI with λ > 0. See Proposition 34 for more details. Similar to RKHS, Gaussian processes (GPs) offers a calibrated model (Srinivas et al., 2009). Note that GPs offer similar regret bounds as RKHS; so we do not discuss GPs and instead refer readers to Curi et al. (2020). 4the uncertainty measure σt(s, a) will depend on the input failure probability δ, which we drop here for notational simplicity. When we introduce specific examples, we will be explicit about the dependence on the failure probability δ which usually is in the order of ln(1/δ). 5The covariance matrix can be generalized to any PSD matrix with bounded condition number. 27 Example 3 (General class G). In this case, assume we have P⋆(·|s, a) = N(g⋆(s, a), σ2I) with g⋆ ∈ G. Assume G is discrete (but could be exponentially large with complexity measure, ln(|G|)), and supg∈G,s,a ∥g(s, a)∥2 ≤ G ∈ R+. Suppose model learning step is done by least square: ĝt = argming∈G ∑t−1 k=0 ∑H−1 h=0 ∥∥∥g(sk h, a k h) − sk h+1 ∥∥∥2 2 . Compute a version space Gt = { g ∈ G : ∑t−1 k=0 ∑H−1 h=0 ∥∥∥g(sk h, a k h) − ĝt(sk h, a k h) ∥∥∥2 2 ≤ zt } , where zt = 2σ2G2ln(2t2|G|/δ) and use this for uncertainty computation. In particu- lar, set uncertainty σt(s, a) = 1 σ maxg1∈G,g2∈G ∥g1(s, a) − g2(s, a)∥2, i.e., the maximum disagreement between any two functions in the version space Gt. Refer to Proposition 36 for more details. The maximum disagreement above motivates our practical implementation where we use an ensemble of neural networks to approximate the version space and use the maximum disagreement among the models’ predictions as the bonus. We refer readers to Section 2.7 for more details. 2.5.1 Regret Bound We bound regret with the quantity named Information Gain I (up to some constant scaling factor) (Srinivas et al., 2009): IT := max Alg EAlg T−1∑ t=0 H−1∑ h=0 min { σ2 t (st h, a t h), 1 } , (2.3) where Alg is any adaptive algorithm (thus including Algorithm 2) that maps from history before iteration t to some policy πt ∈ Π. After the main theorem, we give concrete examples for IT where we show that IT has extremely mild growth rate with respect 28 to T (i.e., logarithimic). Denote Vπ as the expected total cost of π under the true cost function c and the real dynamics P⋆. Theorem 3 (Main result). Assume model learning is calibrated (i.e., Assumption 2 holds for all t) and Assumption 1 holds. In Algorithm 2, set bonus bt(s, a) := H min{σt(s, a), 2}. There exists a set of parameters, such that after running Algorithm 2 for T iterations, we have: E [ min t∈[0,...,T−1] Vπt − Vπe ] ≤ O H2.5 √ IT √ T + H √ ln(T H|F |) N  . Appendix A.1 contains proof of Theorem 3. This theorem indicates that as long as IT grows sublinearly o(T ), we find a policy that is at least as good as the expert policy when T and N approach infinity. For any discrete MDP, KNR Kakade et al. (2020c), Gaussian Processes models Srinivas et al. (2009), and general G with bounded Eluder dimension (Russo and Van Roy (2014); Osband and Van Roy (2014)), we can show that the growth rate of IT with respect to T is mild. Corollary 4 (Discrete MDP). For discrete MDPs, IT = Õ(HS 2A) where S = |S|, A = |A|. Thus: E [ min t∈[0,...,T−1] Vπt − Vπe ] = Õ H3S √ A √ T + H √ ln(|F |) N  . Note that Corollary 4 (proof in Appendix A.1.1) hold for any MDPs (not just injective MDPs) and any stochastic expert policy. The dependence on A,T is tight (see lower bound in Section 2.5.2). Now we specialize Theorem 3 to continuous MDPs below. Corollary 5 (KNRs (Example 2)). For simplicity, consider the finite dimension setting ϕ : S ×A 7→ Rd. We can show that IT = Õ ( Hd + Hdds + Hd2 ) (see Proposition 35 for details), where d is the dimension of the feature ϕ(s, a) and ds is the dimension of the 29 state space. Thus, we have 6 E [ min t∈[0,...,T−1] Vπt − Vπe ] = Õ H3 √ dds + d2 √ T + H √ ln(|F |) N  . Corollary 6 (General G with bounded Eluder dimension (Example 3)). For general G, assume that G has Eluder-dimension dE(ϵ) (Definition 3 in Osband and Van Roy (2014)). Denote dE = dE(1/T H). The information gain is upper bounded as IT = O ( HdE + dE ln(T 3H|G|) ln(T H) ) (see Proposition 38). Thus, E [ min t∈[0,...,T−1] Vπt − Vπe ] = Õ H3√dE ln(T H|G|) √ T + H √ ln(|F |) N  . Thus as long asG has bounded complexity in terms of the Eluder dimension Russo and Van Roy (2014); Osband and Van Roy (2014), MobILE with the maximum disagreement- based optimism leads to near-optimal guarantees. 2.5.2 Exploration in ILFO and the Exponential Gap between IL and ILFO To show the benefit of strategic exploration over random exploration in ILFO, we present a novel reduction of the ILFO problem to a bandit optimization problem, for which strategic exploration is known to be necessary (Bubeck and Cesa-Bianchi, 2012) for optimal bounds while random exploration is suboptimal; this reduction indicates that benefit of strategic exploration for solving ILFO efficiently. This reduction also demonstrate that there exists an exponential gap in terms of sample complexity between ILFO and classic IL that has access to expert actions. We leave the details of the reduction framework in Appendix A.1.4. The reduction allows us to derive the following lower bound for any ILFO algorithm. 6We use Õ to suppress log term except the ln(|G|) and ln(|F |) which present the complexity of F and G. 30 Theorem 7. There exists an MDP with number of actions A ≥ 2, such that even with infinitely many expert data, any ILFO algorithm must occur expected commutative regret Ω( √ AT ). Specifically we rely on the following reduction where solving ILFO, with even infinite expert data, is at least as hard as solving an MAB problem with the known optimal arm’s mean reward which itself occurs the same worst case √ AT cumulative regret bound as the one in the classic MAB setting. For MAB, it is known that random exploration such as ϵ-greedy will occur suboptimal regret O(T 2/3). Thus to achieve optimal √ T rate, one needs to leverage strategic exploration (e.g., optimism). Methods such as BC for IL have sample complexity that scales as poly ln(A), e.g., see (Agarwal et al., 2019, Theorem 14.3, Chapter 14) which shows that for tabular MDP, BC learns a policy whose performance is O(H2√S ln(A)/N) away from the expert’s performance (here S is the number of states in the tabular MDP). Similarly, in interactive IL setting, DAgger Ross et al. (2011b) can also achieve poly ln(A) dependence in sample complexity. The exponential gap in the sample complexity dependence on A between IL and ILFO formalizes the additional difficulty encountered by learning algorithms in ILFO. 2.6 Practical Instantiation of MobILE We present a brief practical instantiation MobILE’s components with details in Ap- pendix Section A.3. Dynamics model learning:We employ Gaussian Dynamics Models parameterized by an MLP (Rajeswaran et al., 2020; Kidambi et al., 2020a), i.e., P̂(s, a) := N(hθ(s, a), σ2I), where, hθ(s, a) = s + σ∆s · MLPθ(sc, ac), where, θ are MLP’s trainable parameters, 31 sc = (s − µs)/σs, ac = (a − µa)/σa with µs, µa (and σs, σa) being the mean of states, actions (and standard deviation of states and actions) in the replay bufferD. Next, for (s, a, s′) ∈ D, ∆s = s′− s and σ∆s is the standard deviation of the state differences ∆s ∈ D. We use SGD with momentum (Sutskever et al., 2013) for training the parameters θ of the MLP. Discriminator parameterization:We utilize MMD as our choice of IPM and define the discriminator as f (s) = w⊤ψ(s), where, ψ(s) are Random Fourier Features (Rahimi and Recht, 2008a). Bonus parameterization:We utilize the discrepancy between predictions of a pair of dy- namics models hθ1(s, a) and hθ2(s, a) for designing the bonus. Empirically, we found that using more than two models in the ensemble offered little to no improvements. Denote the disagreement at any (s, a) as δ(s, a) = ∥∥∥hθ1(s, a) − hθ2(s, a) ∥∥∥ 2 , and δD = max(s,a)∼D δ(s, a) is the max discrepancy of a replay bufferD. We set bonus as b(s, a) = λ ·min(δ(s, a)/δD, where λ > 0 is a tunable parameter. PG oracle:We use TRPO (Schulman et al., 2015b) to perform incremental policy opti- mization inside the learned model. 2.7 Experiments This section seeks to answer the following questions: (1) How does MobILE compare against other benchmark algorithms? (2) How does optimism impact sample efficiency/- final performance? (3) How does increasing the number of expert samples impact the quality of policy outputted by MobILE? We consider tasks from Open AI Gym (Brockman et al., 2016a) simulated with Mu- joco (Todorov et al., 2012a): Cartpole-v1, Reacher-v2, Swimmer-v2, Hopper-v2 and 32 1 2 3 4 5 Online Samples 1e4 0 100 200 300 400 500 R et ur n (V al ue ) CartPole-v1 (10 traj.) MobILE (Ours) BC Expert GAIL BC-O GAIFO ILPO 0.5 1.0 1.5 Online Samples 1e4 40 30 20 10 R et ur n (V al ue ) Reacher-v2 (10 traj.) 0.2 0.4 0.6 0.8 1.0 Online Samples 1e5 10 0 10 20 30 40 R et ur n (V al ue ) Swimmer-v2 (40 traj.) 0.5 1.0 1.5 Online Samples 1e6 0 1000 2000 3000 R et ur n (V al ue ) Hopper-v2 (10 traj.) 0.25 0.50 0.75 1.00 1.25 Online Samples 1e6 0 500 1000 1500 2000 2500 R et ur n (V al ue ) Walker2d-v2 (10 traj.) CartPole-v1 Reacher-v2 Swimmer-v2 Hopper-v2 Walker2d-v2 0.00 0.25 0.50 0.75 1.00 1.25 N or m al iz ed S co re Figure 2.2: Comparing MobILE (red) against BC (orange), BC-O (green), GAIL (purple), GAIFO (periwinkle), ILPO (green olive). The learning curves are obtained by averaging all algorithms over 5 seeds. MobILE outperforms BC-O, GAIL and matches BC’s behavior despite MobILE not having access to expert actions. The bar plot (bottom-right) presents the best performing policy outputted by each algorithm averaged across 5 seeds for each algorithm. MobILE clearly outperforms BC-O, GAIFO, ILPO while matching the behavior of IL algorithms like BC/GAIL which use expert actions. Walker2d-v2. We train an expert for each task using TRPO (Schulman et al., 2015b) until we obtain an expert policy of average value 460,−10, 38, 3000, 2000 respec- tively. We setup Swimmer-v2, Hopper-v2,Walker2d-v2 similar to prior model-based RL works (Kurutach et al., 2018; Nagabandi et al., 2018; Luo et al., 2018; Rajeswaran et al., 2020; Kidambi et al., 2020a). We compare MobILE against the following algorithms: Behavior Cloning (BC), GAIL (Ho and Ermon, 2016b), BC-O (Torabi et al., 2018a), ILPO (Edwards et al., 2019) (for environments with discrete actions), GAIFO (Torabi et al., 2018b). Furthermore, recall that BC and GAIL utilize both expert states and actions, information that is not available for ILFO. This makes both BC and GAIL idealistic targets for comparing ILFO methods like MobILE against. As reported by Torabi et al. (Torabi et al., 2018a), BC 33 outperforms BC-O in all benchmark results. Moreover, our results indicate MobILE out- performs GAIL and GAIFO in terms of sample efficiency. With reasonable amount of parameter tuning, BC serves as a very strong baseline and nearly solves deterministic Mujoco environments. We use code released by the authors for BC-O and ILPO. For GAIL we use an open source implementation (Hill et al., 2018), and for GAIFO, we modify the GAIL implementation as described by the authors. We present our results through (a) learning curves obtained by averaging the progress of the algorithm across 5 seeds, and, (b) bar plot showing expert normalized scores averaged across 5 seeds using the best performing policy obtained with each seed. Normalized score refers to ratio of policy’s score over the expert score (so that expert has normalized score of 1). For Reacher-v2, since the expert policy has a negative score, we add an constant before normalization. More details can be found in Appendix A.3. 2.7.1 Benchmarking MobILE on MuJoCo suite Figure 2.2 compares MobILE with BC, BC-O, GAIL, GAIFO and ILPO. MobILE consis- tently matches or exceeds BC/GAIL’s performance despite BC/GAIL having access to actions taken by the expert and MobILE functioning without expert action information. MobILE, also, consistently improves upon the behavior of ILFO methods such as BC-O, ILPO, and GAIFO. We see that BC does remarkably well in these benchmarks owing 2 4 # Online Samples 1e4 0 200 400 R et ur n (V al ue ) CartPole-v1 (10 traj.) Expert With optimism No optimism 0.5 1.0 1.5 # Online Samples 1e4 40 30 20 10 Reacher-v2 (10 traj.) 0.5 1.0 # Online Samples 1e5 0 20 40 Swimmer-v2 (40 traj.) 0.5 1.0 1.5 # Online Samples 1e6 0 1000 2000 3000 Hopper-v2 (10 traj.) 0.5 1.0 # Online Samples 1e6 0 1000 2000 Walker2d-v2 (10 traj.) Figure 2.3: Learning curves obtained by running MobILE with (red) and without (green) optimism. Without optimism, the algorithm learns slowly or does not match the expert, whereas, with optimism, MobILE shows improved behavior by automatically trading off exploration and imitation. 34 to determinism in the transition dynamics; in the appendix, we consider a variant of the cartpole environment with stochastic dynamics. Our results suggest that BC struggles with stochasticity in the dynamics and fails to solve this task, while MobILE continues to reliably solve this task. Also, note that we utilize 10 expert trajectories for all envi- ronments except Swimmer-v2; this is because all algorithms (including MobILE) present results with high variance. We include a learning curve for Swimmer-v2 with 10 expert trajectories in the appendix. The bar plot in Figure 2.2 shows that within the sample budget shown in the learning curves, MobILE (being a model-based algorithm), presents superior performance in terms of matching expert, thus indicating it is more sample efficient than GAIFO, GAIL (both being model-free methods), ILPO and BC-O. 2.7.2 Importance of the optimistic MDP construction Figure 2.3 presents results obtained by running MobILE with and without optimism. In the absence of optimism, the algorithm either tends to be sample inefficient in achieving expert performance or completely fails to solve the problem. Note that without optimism, the algorithm isn’t explicitly incentivized to explore – only implicitly exploring due to noise induced by sampling actions. This, however, is not sufficient to solve the problem efficiently. In contrast, MobILE with optimism presents improved behavior and in most cases, solves the environments with fewer online interactions. 2.7.3 Varying Number of Expert Samples Table 2.1: Expert normalized score and standard deviation of policy outputted by MobILE when varying number of expert tra- jectories as E1 and E2 (specific values repre- sented in parentheses) Environment E1 E2 Expert Cartpole-v1 1.07 ± 0.15 (5) 1.14 ± 0 (10) 1 ± 0.25 Reacher-v2 1.01 ± 0.05 (10) 0.997 ± 0.055 (20) 1 ± 0.11 Swimmer-v2 1.54 ± 1.1 (10) 1.25 ± 0.15 (40) 1 ± 0.05 Hopper-v2 1.11 ± 0.064 (10) 1.16 ± 0.03 (40) 1 ± 0.16 Walker2d-v2 0.975 ± 0.12 (10) 0.94 ± 0.038 (50) 1 ± 0.25 Table 2.1 shows the impact of increasing the number of samples drawn from the ex- 35 pert policy for solving the ILFO problem. The main takeaway is that increasing the number of expert samples aids MobILE in reliably solving the problem (i.e. with lesser variance). 2.8 Conclusion MobILEis a model-based ILFO approach that is applicable to MDPs with stochastic dynamics and continuous action spaces. MobILE trades-off exploration and imitation, and this perspective is shown to be important for solving the ILFO efficiently both in theory and in practice. Future works include exploring other means for learning dynamics models, performing strategic exploration and extending MobILE to problems with rich observation spaces (e.g. videos). By not even needing the actions to imitate, ILFO algorithms allow for learning algorithms to capitalize on large amounts of video data available online. Moreover, in ILFO, the learner is successful if it learns to imitate the expert. Any expert policy designed by bad actors can naturally lead to obtaining new policies that continue to imitate and be a negative influence to the society. With this perspective in mind, any expert policy must be thoroughly vetted in order to ensure ILFO algorithms including MobILE are employed in ways that benefit the society. 36 CHAPTER 3 MODEL-BASED OFFLINE IMITATION LEARNING This chapter studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL method such as behavior cloning (BC) fails completely. Source code is provided at https://github.com/jdchang1/milo. 37 https://github.com/jdchang1/milo 3.1 Introduction Covariate shift is a core issue in Imitation Learning (IL). Traditional IL methods like behavior cloning (BC) (Pomerlau, 1989), while simple, suffer from covariate shift, learning a policy that can make arbitrary mistakes in parts of the state space not covered by the expert dataset. This leads to compounding errors in the agent’s performance (Ross and Bagnell, 2010b), hurting the generalization capabilities in practice. Figure 3.1: (Left) Frames at timesteps 200, 400, 600, 800, and 1000 for Humanoid-v2 from policies trained with BC on 100 state-action pairs from the expert (blue), BC on 1M offline samples plus 100 expert samples (yellow), and our algorithm MILO(red). The expert has a performance of 3248 and the behavior policy used to collect the offline dataset has performance of 1505 ± 473 (≈ 46% of the expert’s). (Right) Expert performance normalized scores averaged across 5 seeds. Prior works have presented several means to combat this phenomenon in IL. One line of thought utilizes an interactive expert, i.e. an expert that can be queried at an arbitrary state encountered during the training procedure. Interactive IL algorithms such as DAgger (Ross et al., 2011c), LOLS (Chang et al., 2015a), and AggreVaTe(D) (Ross and Bagnell, 2014a; Sun et al., 2017b) utilize a reduction to no-regret online learning and demonstrate that under certain conditions, they can successfully learn a policy that imitates the expert. These interactive IL algorithms, however, cannot provably avoid covariate shift if the 38 expert is not recoverable. That is, Aπe(s, a) = Θ(H) where πe is the expert, Aπ is the usual (dis)advantage function,1 and H is the planning horizon (Rajaraman et al., 2020; Agarwal et al., 2019, Chapter 14). A second line of work that avoids covariate shift utilizes either a known transition dynamics model (Ziebart et al., 2008) or uses real world interactions (Ho and Ermon, 2016a; Brantley et al., 2019; Sun et al., 2019d; Kostrikov et al., 2019c; Reddy et al., 2020; Kidambi et al., 2021). Prior works have shown that with known transition dynamics or real world interactions, agents can provably avoid covariate shift in both tabular and general MDPs (Agarwal et al., 2019; Rajaraman et al., 2020) even without a recoverable expert. While these results offer strong theoretical guarantees and empirical performance, online interactions are often costly and prohibitive for real world applications where active trial-and-error exploration in the environment could be unsafe or impossible. A third perspective towards addressing this issue is to assume that the expert visits the entire state space (Spencer et al., 2021), where the expert effectively informs the learner what actions to take in every state. Unfortunately, such a full coverage expert distribution might be rare and holds only for special MDPs and expert policies (for e.g. an expert that induces ergodicity in the MDP). Here, we consider a new perspective towards handling the covariate shift issue in IL. In particular, we investigate a pure offline learning setting where the learner has access to neither the expert nor the environment for additional interactions. The learner, instead, has access to a small pre-collected dataset of state-action pairs sampled from the expert and a large batch offline dataset of state-action-next state transition triples sampled from a behavior policy that could be highly sub-optimal (see Figure 3.1 where BC on the offline data results in a low-quality policy). Unlike prior works that require online interactions, our proposed method, MILO performs high fidelity imitation in an offline, data-driven manner. Moreover, different from interactive IL, we do not require the expert to be 1here, we use cost instead of reward, thus we call Aπ disadvantage function. 39 present during learning, significantly relieving the expert’s burden. Finally, in contrast to the prior work (Spencer et al., 2021) that assumes the expert distribution covers the entire state space (i.e., maxπ maxs,a dπ(s, a)/dπ e (s, a) < ∞ where dπ denotes the state-action distribution of policy π), we require offline dataset to provide partial coverage, i.e., it only needs to cover expert state actions (i.e., maxs,a dπ e (s, a)/ρ(s, a) < ∞ where ρ is the offline distribution of some behavior policy).2 In summary, we list our main contributions below: 1. We propose Model based Imitation Learning from Offline data, MILO: a model- based framework that leverages offline batch data with only partial coverage (see Section 3.5.1 for definition) to overcome covariate shift in IL. 2. Our analysis is modular and covers common models such as discrete MDPs, linear models, and nonparametric models such as GP. Notably, our new result on non- parametric models (e.g. Gaussian Processes) with relative condition number is new even considering all existing results in offline RL: see Remark 14 for a partial coverage result, and remark 18 for robustness to compete against any comparator policy covered by the offline distribution. 3. The practical instantiation of our general framework leverages neural network model ensembles, and demonstrates its efficacy on benchmark MuJoco continuous control problems. Specifically, even under low-quality behavior policies, our approach can successfully imitate using an extremely small number of expert samples while algorithms like BC completely fail (Figure 3.1). 2In our analysis, we refine the density ratio dπ e (s, a)/ρ(s, a) via the concept of relative conditional number which allows us to extend it to large MDPs where the ratio is infinite but the relative condition number is finite. 40 3.2 Related work Imitation Learning As summarized above, avoiding covariate shift in IL is an impor- tant topic. Another relevant line of research is IL algorithms that use offline or off-policy learning. ValueDICE (Kostrikov et al., 2019c) presents a principled way to leverage off-policy data for IL. In theory, the techniques from ValueDICE (and more broadly, DICE (Nachum et al., 2019b; Zhang et al., 2020)) require the data provided to the agent to have global coverage. Moreover in practice, ValueDICE uses online interaction and maintains an increasing replay buffer which may eventually provide global coverage. Instead, we aim to study offline IL without any online interactions and are interested in the setting where offline data does not have global coverage. Another line of work (Jarrett et al., 2020; Chan and van der Schaar, 2021) studies IL in an offline setting by only using the expert dataset. In contrast to these works, our goal is to study the use of an additional offline dataset collected from a behavior policy to mitigate covariate shift, as information theoretically any algorithm that relies solely on expert data will still suffer from covariate shift in the worst case (Rajaraman et al., 2020). Offline RL In offline RL, algorithms such as FQI (Ernst et al., 2005) have finite-sample error guarantees under the global coverage (Munos and Szepesvári, 2008; Antos et al., 2008). Recently, many algorithms to tackle this problem have been proposed from both model-free (Wu et al., 2019; Touati et al., 2020; Liu et al., 2020; Fujimoto et al., 2019; Fakoor et al., 2021; Kumar et al., 2020) and model-based perspectives (Yu et al., 2020; Kidambi et al., 2020b; Matsushima et al., 2020) with some pessimism ideas. The idea of pessimism features in offline RL with an eye to penalize the learner from visiting unknown regions of the state-action space (Rashidinejad et al., 2021; Jin et al., 2020b; Yin et al., 2021; Buckman et al., 2020). We utilize pessimism within the IL context where, 41 unlike RL, the learner does not have access to an underlying reward signal. A by-product of our IL analysis is a set of new results for pure offline RL setting where ground truth cost is given: we expand prior theoretical results from offline RL by (a) replacing full coverage assumption by a much weak partial coverage assumption formalized in terms of relative condition number, and (b) providing a new type of robustness guarantee: we can learn a policy that is comparable to any policy (not necessarily the optimal one) that is covered by the offline distribution. In other words, as long as there is a high quality policy being covered by the offline distribution, we can learn a policy that can compete against it. We also refer readers to Appendix B.3 for a more detailed literature review on offline RL. 3.3 Setting We consider an episodic finite-horizon Markov Decision Process (MDP), M = {S,A, P,H, c, d0}, where S is the state space,A is the action space, P : S ×A → ∆(S) is the MDP’s transition, H is the horizon, d0 is an initial distribution, and c is the cost function S ×A → [0, 1]. A policy π : S → ∆(A) maps from state to distribution over actions. We denote dπP ∈ ∆(S × A) as the average state-action distribution of π under transition kernel P, that is, dπP = 1/H ∑H t=1 dπP,t, where dπP,t ∈ ∆(S×A) is the distribution of (s(t), a(t)) under π at t. Given a cost function f : S×A 7→ [0, 1], Vπ P, f denotes the expected cumulative cost of π under the transition kernel P and cost function f . Following a standard IL setting, the ground truth cost function c is unknown. Instead, we have the demonstrations by the expert specified by πe : S → ∆(A) (potentially stochastic and not necessarily optimal). Concretely, we have an expert dataset in the form of i.i.d tuples De = {si, ai} ne i=1 sampled from distribution dπe P . 42 In our setting, we also have an offline static dataset consisting of i.i.d tuples Do = {si, ai, s′i} no i=1 s.t. (s, a) ∼ ρ(s, a), s′ ∼ P(s, a), where ρ ∈ ∆(S×A) is an offline distribution resulting from some behavior policies. Note behavior policy could be a much worse policy than the expert πe. Our goal is to only leverage (De +Do) to learn a policy π that performs as well as πe with regard to optimizing the ground truth cost c. More specifically, our goal is to utilize the offline static dataDo to combat covariate shift and learn a policy that can significantly outperform traditional offline IL methods such as Behavior cloning (BC), without any interaction with the real world or expert. Function classes We introduce function approximation. Since we do not know the true cost function c and transition kernel P, we introduce a cost function class F ⊂ S → [0, 1] and a transition model class P : S ×A → ∆(S). We also need a policy class Π. For the analysis, we assume realizability: Assumption 8. c ∈ F , P ∈ P, πe ∈ Π. We use Integral Probability Metric (IPM) as a distribution distance mea- sure, i.e., given two distributions ρ1 and ρ2, IPM with F is defined as max f∈F [ E(s,a)∼ρ1[ f (s, a)] − E(s,a)∼ρ2[ f (s, a)] ] . 3.4 Algorithm The core idea of MILO is to imitate the expert by optimizing an IPM distance be- tween the agent and the expert with a penalty term for pessimism over the policy class. MILO consists of three steps: 43 Algorithm 3 Framework for model-based Imitation Learning with offline data (MILO) 1: Require: IPM class F , model class P, policy class Π, datasetsDe = {s, a}, Do := {s, a, s′} 2: Train Dynamics Model and Bonus: P̂ : S ×A → S and b : S ×A → R+ on offline dataDo 3: Pessimistic model-based min-max IL: with P̂, b, De, obtain π̂IL by solving the following: π̂IL = argmin π∈Π max f∈F [ E(s,a)∼dπ P̂ [ f (s, a) + b(s, a) ] − E(s,a)∼De[ f (s, a)] ] (3.1) 1. Model learning: fit a model P̂ from the offline dataDo to learn P, 2. Pessimistic penalty design: construct penalty function b(s, a) such that there is a high penalty on state-action pairs that are not covered by the offline data distribution ρ. 3. Offline min-max model-based policy optimization: optimize Eq. (3.1) Algorithm 3 provides the details of MILO. We explain each component in detail as follows. Model learning and Penalty: Our framework assumes we can learn a calibrated model (P̂, σ) from the datasetDo, in the sense that for any s, a, we have: ∥∥∥P̂(·|s, a) − P(·|s, a) ∥∥∥ 1 ≤ min{2, σ(s, a)}. Such model training is possible in many settings including classic discrete MDPs, linear models (KNR (Kakade et al., 2020b)), and non-parametric models such as GP. In practice, it is also common to train a model ensemble based on the idea of bootstrapping and then use the model-disagreement to approximate σ. With such calibrated model, the bonus will simply be b(s, a) = O(Hσ(s, a)). We will formalize this model learning assumption in Section 3.5. We give several examples below. For any discrete MDP, we use the empirical distribution, i.e., P̂(s′|s, a) = N(s′, s, a)/(N(s, a) + λ), where N(s, a) is the number of (s, a) inDo, and N(s′, s, a) is the 44 number of (s, a, s′) inDo, and λ ∈ R+. In this case, we can set σ(s, a) = Õ ( √ |S|/N(s, a) ) . See Example 4 for more details. For continuous Kernelized Nolinear Regulator (KNR (Kakade et al., 2020b)) model where the ground truth transition P(s′|s, a) is defined as s′ = W⋆ϕ(s, a) + ϵ, ϵ ∼ N(0,Σ), with ϕ being a (nonlinear) feature mapping, we can learn P̂ by classic Ridge regression on offline datasetDo. Here we can set σ(s, a) = Õ ( β √ ϕ(s, a)⊤Σ−1 no ϕ(s, a) ) for some β ∈ R+, where Σo is the data covariance matrix Σno := ∑no i=1 ϕ(si, ai)ϕ(si, ai)⊤ + λI. See Example 5 for more details. For non-parametric nonlinear model such as Gaussian Process (GP), under the assumption that P is in the form of s′ = g⋆(s, a) + ϵ, ϵ ∼ N(0,Σ) (here S ⊂ RdS), we can simply represent P̂ using GP posteriors induced byDo, i.e., letting GP posterior be GP(ĝ, kno), we have P̂(s′|s, a) being represented as s′ = ĝ(s, a) + ϵ. Then, we can set σ(s, a) = Õ ( βkno ((s, a), (s, a)) ) with some parameter β ∈ R+ (see Example 6 for more details). GP is a powerful model and has been being widely used in robotics problems, see (Ko et al., 2007; Deisenroth and Rasmussen, 2011; Bansal et al., 2017; Umlauft et al., 2018; Fisac et al., 2018) for examples. In practice, we can also use a model ensemble of neural networks with the maximum disagreement between models as σ. This has been widely used in practice (e.g., Osband et al. (2018b); Azizzadenesheli et al. (2018b); Pathak et al. (2019b)). We leave the details to Section 3.6 where we instantiate a practical version of MILO, and the experiment section. As we can see from the examples mentioned above, in general, the penalty b(s, a) = O(Hσ(s, a)) is designed such that it has a high value in state-action space that is not covered well by the offline dataDo, and has a low value in space that is covered byDo. Adding such a penalty automatically forces our policy to stay away from these regions where P̂ is not accurate. On the other hand, for regions where ρ has good coverage (thus 45 P̂ is accurate), we force π to stay close to πe. Pessimistic model-based min-max IL: Note Eq. 3.1 is purely computational, i.e., we do not need any real world samples. To solve such min-max objective, we can iteratively (1) perform the best response on the max player, i.e., compute the argmax discriminator f given the current π, and (2) perform incremental update on the min player, e.g., use policy gradient (PG) methods (e.g. TRPO) inside the learned model P̂ with cost function f (s, a) + b(s, a). We again leave the details to Section 3.6. 3.4.1 Specialization to offline RL In RL, the cost function c is given. The goal is to obtain π∗ = argmaxπ∈Π Vπ P,c. The pessimistic policy optimization procedure (Yu et al., 2021; Jin et al., 2020b) is π̂RL = argminπ∈Π E(s,a)∼dπ P̂ [c(s, a)+b(s, a)]. While this is not our main contribution, we will show a byproduct of our result is a novel non-parametric analysis for offline RL which does not assume ρ has global coverage (see Remarks 11,14,18). 3.5 Analysis Our algorithm depends on the model P̂ estimated from the offline data. We provide a unified analysis assuming that P̂ is calibrated in that its confidence interval is provided. Specifically, we assume: Assumption 9. With probability 1 − δ, the estimate model P̂ satisfies the following: ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2) ∀(s, a) ∈ S ×A. We set the bonus as b(s, a) = H min(σ(s, a), 2). 46 We give the following three examples. For details, refer to Appendix B.1. Example 4 (Discrete MDPs). Set uncertainty measure σ(s, a)= √ |S| log 2+log(2|S||A|/δ) 2{N(s,a)+λ} + λ N(s,a)+λ . Example 5 (KNRs). In KNRs, the ground truth model is s′ = W∗ϕ(s, a)+ϵ, ϵ ∼ N(0, ζ2I), where s ∈ RdS , a ∈ RdA , ϕ : S × A 7→ Rd is some known state-action feature mapping. The estimator is ĝ(·) = Ŵϕ(·), Ŵ = argmin W∈RdS×dA 1/no ∑ (s,a)∈Do [∥Wϕ(s, a) − s′∥22] + λ∥W∥2F , where ∥ · ∥F is a frobenius norm. We set the uncertainty measure σ(s, a): σ(s, a) = (1/ζ)βno √ ϕ⊤(s, a)Σ−1 no ϕ(s, a), Σno = no∑ i=1 ϕ(si, ai)ϕ⊤(si, ai) + λI with βno = {2λ∥W ∗∥22 + 8ζ2[dS log(5)+ log(1/δ)+ Īno)]} 1/2, where Īno = log(det(Σno/λI)). Example 6 (GPs). In GPs, the ground truth model is defined as s′ = g∗(s, a) + ϵ, ϵ ∼ N(0, ζ2I) where g⋆ belongs to an RKHSHk with a kernel k(·, ·). Denote x := (s, a), we have GP posterior as ĝ(·) = S (Kno + ζ 2I)−1k̄no(·), S = [s′1, · · · , s ′ no ] ∈ RdS×no , k̄no(x) = [k(x1, x), · · · , k(xno , x)]⊤, {Kno}i, j = k(xi, x j) (1 ≤ i ≤ no, 1 ≤ j ≤ no), kno(x, x′) = k(x, x′) − k̄no(x)⊤(Kno + ζ 2I)−1k̄no(x′), with σ(·) = βnokno(·, ·)/ζ, βno = O((dS log3(dSno/δ)Ino) 1/2), Ino = log(det(I + ζ−2Kno)). General results We show our general error bound results. For the proof, refer to Appendix B.2. For analytical simplicity, we assume |F | is finite (but the bound only depends on ln(|F |)) 3. 3When |F | is infinite, we can show that the resulting error bound scales w.r.t its metric entropy. 47 Theorem 10 (Bound of MILO). Suppose assumptions 8,9. Then, with probability 1 − 2δ, V π̂IL P,c − Vπe P,c ≤ Erro + Erre, Erro = 8H2E(s,a)∼dπe P [min(σ(s, a), 1)], Erre = 2H √ log(2|F |/δ) 2ne . We will show through a set of examples where E(s,a)∼dπe P [min(σ(s, a), 1)] shrinks to zero as no → ∞ under the partial coverage, i.e., when ρ covers dπe P . Asymptotically, Erre will dominate the bound. Note that Erre has two components, a linear H and a term that corresponds to the statistical error related to expert samples and function class complexity. Comparing to BC, which has a rate O(H2 √ log(|Π|)/ne) (Agarwal et al., 2019, Chapter 14) for some policy class Π ⊂ X 7→ ∆(A), we see that the horizon dependence is improved. Before going to each analysis of Erro, we highlight two important points in our analysis. First, our bound requires only the partial coverage, i,e., it depends on πe- concentrability coefficient which measures the discrepancy between the offline data and expert data. This is the first work deriving the bound with πe-concentrability coefficient in IL with offline data. Second, our analysis covers non-parametric models. This is a significant contribution as previous pessimistic offline RL finite-sample error results have been limited to the finite-dimensional linear models or discrete MDPs (Jin et al., 2020b; Rashidinejad et al., 2021). Remark 11 (Implications on offline RL). As in Theorem 10, we have V π̂RL P,c − V π̃ P,c = O(H2E(s,a)∼dπ̃P [σ(s, a)]) (Appendix B.2) for any comparator policy π̃ (not necessarily the optimal one). Note similar results have been obtained in (Yu et al., 2020; Kidambi et al., 2020b). Since this term is Erro by just replacing πe with π̃, this offline RL result is a by-product of our analysis. 48 3.5.1 Analysis: Discrete MDPs We start from discrete MDP as a warm up. Denote Cπe = max(s,a) dπe P (s, a)/ρ(s, a) as πe concentrability coefficient. Theorem 12. Suppose λ = Ω(1) and the partial coverage Cπe < ∞. With probability 1 − δ, Erro ≤ c1H2  √ Cπe |S|2|A| no + Cπe |S||A| no  · log(|S||A|c2/δ), where c1, c2 are universal constants. The error does not depend on supπ∈ΠCπ or C̄ = supπ∈Πmax(s,a) dπP(s, a)/dπe P (s, a). We only require the partial coverage Cπe < ∞, which is much weaker than supπ∈ΠCπ < ∞ (ρ has global coverage) and C̄ < ∞ (dπe P has global coverage (Spencer et al., 2021)). When Cπe is small and no is large enough, Erre = O ( H √ |S||A|/ne ) dominates Erro in Theorem 10. Then, the error is linear in horizon H. 3.5.2 Analysis: KNRs and GPs for Continuous MDPs Now we move to continuous state-action MDPs. In continuous MDPs, assuming the boundedness of density ratio Cπe is still a strong assumption. As we dive into the KNR and the nonparametric GP model, we will replace the density ratio with a more refined concept relative condition number. KNRs Let Σρ = E(s,a)∼ρ[ϕ(s, a)ϕ(s, a)⊤] and Σπe = E(s,a)∼dπe P [ϕ(s, a)ϕ(s, a)⊤]. We define the relative condition number as Cπe = supx∈Rd ( x⊤Σπe x x⊤Σρx ) . Even when density ratio is 49 infinite, this number could still be finite as it concerns subspaces on ϕ(s, a) rather than the whole S ×A. To further gain its intuition, we can consider discrete MDPs and the feature mapping ϕ(s, a) ∈ R|S||A| which is a one-hot encoding vector that has zero everywhere except one at the entry corresponding to the pair (s, a). In this case, the relative condition number is reduced to max(s,a) dπe P (s, a)/ρ(s, a), i.e., the density ratio. Theorem 13 (Error for KNRs). Suppose sups,a ∥ϕ(s, a)∥ ≤ 1, λ = Ω(1), ζ2 = Ω(1), ∥W∗∥2 = Ω(1) and the partial coverage Cπe < ∞. With probability 1 − δ, Erro ≤ c1H2 ( rank2(Σρ) + rank(Σρ) log( c2 δ ) ) √ dSCπe no · log1/2(1 + no), (3.2) where c1 and c2 are some universal constants. Theorem 13 suggests Erro is Õ(H2rank[Σρ]2 √ d|S|Cπe/no). In other words, when Cπe , rank[Σρ] are small and the offline sample size no is large enough, Erre dominates Erro in Theorem 10. Again, in this case, Erre = O ( H √ ln(F )/ne ) , and we see that it grows linearly w.r.t horizon H. Our result is distribution dependent and captures the possible low-rankness of the offline data, i.e., rank[Σρ] depends on ρ and could be much smaller than the ambient dimension of feature ϕ(s, a). The quantity Cπe corresponds to the discrepancy measured between the batch data and expert data. This is much smaller than the worst-case concentrability coefficient: C̃ = supπ∈ΠCπ. Remark 14 (Implication on offline RL: Partial coverage). In RL, a similar quantity has been analyzed in (Jin et al., 2020b), which studies the error bound of linear FQI with pessimism. Comparing to our result only requiring partial coverage, (Jin et al., 2020b, Corollary 4.5) assumes the global coverage, i.e., Σρ is full-rank, which is stronger than Cπe < ∞. 50 GPs Now we specialize our main theorem to non-parametric GP models. For simplicity, following (Srinivas et al., 2010), we assume S ×A is a compact space. We also suppose the following. Recall x := (s, a). Assumption 15. k(x, x) ≤ 1,∀x ∈ S ×A. k(·, ·) is a continuous and positive semidefinite kernel. Under the Theorem 15, we can use Mercer’s theorem (Wainwright, 2019), which shows that there exists a set of pairs of eigenvalues and eigenfunctions {µi, ψi} ∞ i=1, where∫ ρ(x)ψi(x)ψi(x)dx = 1 for all i and ∫ ρ(x)ψi(x)ψ j(x)dx = 0 for i , j. Eigenfunctions and eigenvalues essentially defines an infinite-dimensional feature mapping ϕ(x) := [ √ µ1ψ1(x), . . . , √ µ∞ψ∞(x)]⊤. Here, k(x, x) = ϕ(x)⊤ϕ(x), and any function f ∈ Hk can be represented as f (·) = α⊤ϕ(·). Note that the eigenvalues and eigenfunctions are defined w.r.t the offline data distribution ρ, thus our result here is still distribution dependent rather than a worst case analysis which often appears in online RL/IL settings (Srinivas et al., 2010; Kakade et al., 2020b; Yang et al., 2020; Chowdhury and Gopalan, 2019). Assume eigenvalues {µ1, . . . , µ∞} is in non-increasing order, we define the effective dimension, Definition 16 (Effective dimension). d∗ = min{ j ∈ N : j ≥ B( j + 1)no/ζ 2}, B( j) =∑∞ k= j µk. The effective dimensions d∗ is widely used and calculated for many kernels (Zhang, 2005; Bach, 2017; Valko et al., 2013; Janz et al., 2020). In finite-dimensional linear kernels {x 7→ a⊤ϕ(x); a ∈ Rd} (k(x, x) = ϕ⊤(x)ϕ(x)), we have d∗ ≤ rank[Σρ]. Thus, d∗ is considered to be a natural extension of rank[Σρ] to infinite-dimensional models. Theorem 17 (Error for GPs). Let Σπe = Ex∼dπe P [ϕ(x)ϕ(x)⊤],Σρ = Ex∼ρ[ϕ(x)ϕ(x)⊤]. Sup- pose Theorem 15, ζ2 = Ω(1) and the partial coverage Cπe = sup∥x∥2≤1(x⊤Σπe x/x⊤Σρx) < 51 ∞. With probability 1 − δ, Erro ≤ c1H2 ( (d∗)2 + d∗ log(c2/δ) ) √ dSCπe no · √ log3(c2dSno/δ) log(1 + no), (3.3) where c1, c2 are universal constants. The theorem suggests that Erro is Õ(H2d∗2 √ dSCπe/no). Thus, when Cπe , d∗ are not so large and no is large enough, Erre asymptotically dominates Erro in Theorem 10 (again Erre is linear in H). While we defer the detailed proof of the above theorem to Appendix B.3.3, we highlight some techniques we used here. The analysis is reduced to how to bound the information gain Ino and Ex∼dπe P [kno(x, x)]. In both cases, we analyze them into two steps: transforming them into the variational representation and then bounding them via the uniform low with localization (Lemma 69). Remark 18 (Implication to Offline RL: Robustness). As related literature, in model- free offline RL, (Uehara et al., 2021; Duan et al., 2021) obtain the finite-sample error bounds using nonparametric models. Though their bounds can be characterized by the effective dimension, their bounds assume full coverage, i.e., max(s,a) 1/ρ(s, a) < ∞. Specializing our result in Theorem 17 to offline RL, we achieve the following optimality gap V π̂RL P,c − V π̃ P,c ≤ Õ ( H2(d∗)2 √ dSCπ̃/no ) with Gaussian process, i.e., we are able to compare against any comparator policy, as long as its relative condition number Cπ̃ < ∞. Thus our result indicates a robustness guarantee: among the policies that are covered by the offline distribution in terms of bounded relative condition number, we can find a policy that can compete against the best one. 3.6 Practical Implementation 52 Algorithm 4 A practical instantiation of MILO 1: Require: expert datasetDe = {s, a}, offline datasetDo := {s, a, s′} 2: Train an ensemble of neural network models {ĝ1, . . . , ĝn} where each Pi starts with different random initialization; 3: Set bonus b(s, a) = maxi, j ∥gi(s, a) − g j(s, a)∥2 and initialize πθ0 . 4: for t = 0→ T − 1 do 5: Set wt = arg max∥w∥2≤1 w⊤ ( E(s,a)∼dπ P̂ [ϕ(s, a)] − E(s,a)∼De[ϕ(s, a)] ) , ft(s, a) := w⊤t ϕ(s, a) 6: θt+1 = θt − ηF−1 θt ( E(s,a)∼d πθt P̂ [ ∇ ln πθt(a|s)Aπθt P̂, ft+b (s, a) ] + λE(s,a)∼De [ ∇ℓ(a, s, πθt) ]) 7: end for In this section we instantiate a practical version of MILO using neural networks for the model class P and policy class Π. We use the Maximum Mean Discrepancy (MMD) with a Radial Basis Function kernel as our discriminator class F . Note using MMD as our discrepancy measure allows us to compute the exact maximum discriminator argmax f∈F in closed form. We use a KL-based trust-region formulation for incremental policy update inside the learned model P̂. Based on Eq. (3.1), we first formalize the following constrained optimization framework: min π∈Π max f∈F ( E(s,a)∼dπ P̂ [ f (s, a) + b(s, a) ] − E(s,a)∼De[ f (s, a)] ) s.t.E(s,a)∼De[ℓ (a, s, π)] ≤ δ where ℓ : A×S×Π 7→ R is a loss function (e.g., negative log-likelihood or any supervised learning loss one would use in BC). Essentially, since we haveDe available, we use it together with any supervised learning loss to constrain the policy hypothesis space Π. Note for a deterministic expert πe, the expert policy is always a feasible solution. Thus adding this constraint reduces the complexity of the policy class but does not eliminate the expert policy, and our analysis in Section 3.5 still applies. In our practical instantiation, we replace the hard constraint instead by a Lagrange multiplier, i.e. we use the behavior cloning objective as a regularization term when 53 solving the min-max problem: min π∈Π max f∈F [ E(s,a)∼dπ P̂ ( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)] ] + λ · E(s,a)∼De[ℓ (a, s, π)]. Note there always exists a regularization parameter λ that makes this regularized op- timization problem equivalent to the constrained one. Iteratively, given policy πθt (θt denotes the parameters), we first update the discriminator ft (line 5 in Algorithm 4); then, with a fixed ft, we incrementally update π using NPG as in line 6 in Algorithm 4, where Aπθ P̂, f+b is the disadvantage function of πθ under transition P̂ and cost ft + b, and Fθt := E(s,a)∼d πθt P̂ [∇ ln πθt(a|s)∇ ln πθt(a|s)⊤] is the fisher information matrix. We summa- rize the above procedure in Algorithm 4.4 3.7 Experiments We aim to answer the following questions with our experiments: (1) How does MILO per- form relative to other offline IL methods, (2) What is the impact of pessimism on MILO’s performance? (3) How does the behavior policy’s coverage impact MILO’s performance? (4) How does MILO’s result vary when we increase the number of samples drawn from the expert policy? We evaluate MILO on five environments from OpenAI Gym (Brockman et al., 2016b) simulated with MuJoCo (Todorov et al., 2012a): Hopper-v2, Walker2d-v2, HalfCheetah-v2, Ant-v2, and Humanoid-v2. We compare MILO against the following baselines: (1) ValueDICE (Kostrikov et al., 2019c), a state-of-the-art off-policy IL method modified for the offline IL setting; (2) BC on the expert dataset; and (3) BC on both the offline and expert dataset. Note we modify ValueDICE to be offline by first populating 4In line 6, we use TRPO which means that we set η via a linear search procedure used in TRPO. 54 the replay buffer with the offline dataset and then training the policy with the frozen replay buffer and expert data. Environment Expert Performance Behavior Performance Hopper-v2 3012 752 (25%) Walker2d-v2 3082 1383 (45%) HalfCheetah-v2 5986 3972 (66%) Ant-v2 3072 1208 (40%) Humanoid-v2 3248 1505 (46%) Table 3.1: Performance for expert and behavior policy used to collect expert and offline datasets respectively. For the expert dataset, we first train expert policies and then randomly sample (s, a)- pairs from a pool of 100 expert trajectories collected from these expert policies. We randomly sample to create very small expert (s, a)-pair datasets where BC struggles to learn. Note that BC is effective at imitating the expert for MuJoCo tasks even with a single trajectory; prior works (Ho and Ermon, 2016a; Kostrikov et al., 2019a,c) have used similar sub-sampling strategies to create expert datasets to make it harder for BC to learn. While we focus on the setting with an extremely small expert dataset consisting of expert’s state-action pairs, in appendix Figure B.1, we verify that MILO can also successfully match to the expert performance using a single expert trajectory. The offline datasets are collected building on prior Offline RL works (Wu et al., 2019; Kidambi et al., 2020b); each dataset contains 1 million samples from the environment. We first train behavior policies with mean performances often less than half of the expert performance (Table 3.1, column 2). All results are averaged over five random seeds. See appendix for details on hyperparameters, environments, and dataset composition. 55 Figure 3.2: Learning curves across five seeds for MILO plotted against the best perfor- mance of BC after 1000 epochs of training on the expert/offline+expert data and the best performance of ValueDICE after 10 thousand iterations. The bottom right bar graph shows the expert performance normalized scores where we plot the performance at the last iteration for MILO. 3.7.1 Evaluation on MuJoCo Continuous Control Tasks Figure 3.2 presents results comparing MILO against benchmarks. MILO is able to achieve close to expert level performance on three out of the five environments and outperforms both BC and ValueDICE on all five environments. We significantly outperform BC’s performance when trained on the expert dataset (note that the expert datasets in Figure 3.2 only contain 100-200 expert (s, a) pairs), suggesting MILO indeed mitigates covariate shift through the use of a static offline dataset of (s, a)-pairs. BC on both the offline and expert dataset does improve the performance, but this still cannot successfully imitate the expert since BC has no way of differentiating random/sub-optimal trajectories from the expert samples. ValueDICE, on the other hand, does explicitly aim to imitate the expert samples; however, in theory, it would require either the offline data (i.e. the replay buffer) or the expert samples to have full coverage over the state-action space. Since our offline dataset is mainly collected from a sub-optimal behavior policy and our expert 56 samples are from a high quality expert, neither our offline nor our expert dataset is likely to have full coverage globally; thus potentially hurting the performance of algorithms like ValueDICE. We emphasize that in the bar plot of Figure 3.2, for MILO, we use the performance of the policy at the last iteration while for other baselines we use the performance of the best policy over the entire training process. This indicates that the learning process of MILO is stable. Figure 3.3: (Left 2) Learning curves for Hopper and Walker2d with (red) and without (blue) pessimism. MILO generally performs worse without pessimism. (Right 2) Learning curve for Walker2d and Humanoid with more expert samples. 3.7.2 Ablation Impact of Pessimism Figure 3.3 (Left 2) presents MILO’s performance on two represen- tative environments with and without pessimism (i.e., setting penalty to be zero) added to the imitation objective. Pessimism stabilizes and improves the final performance for MILO. We find that pessimism is necessary in other environments as well except Ant where we find that MILO achieves expert level performance even without pessimism (see Appendix Figure B.2). Behavior with more expert samples We investigate whether MILO is able to achieve expert performance with more expert samples in the two environments (walker and humanoid) that it did not solve with very small expert datasets in Figure 3.2. Figure 3.3 57 (Right 2) shows that with one trajectory worth of expert samples (1000 expert state-action pairs), MILO is able to achieve expert performance on walker and humanoid. Impact of Coverage As our analysis suggests, MILO’s performance degrades as the offline data’s coverage over the expert’s state-action space decreases. We use the behavior policy’s value as a surrogate for coverage, i.e. a lower value potentially suggests lower coverage. We generate two additional offline datasets for each environment by lowering the performance of the behavior policy. The three datasets are: (1) the original offline datasets used in Table 3.1 (≈ 25% for Hopper-v2 and ≈ 50% for others); (2) ones that have roughly half the performance of (1) (12% for Hopper-v2 and ≈ 25% for others); and (3) ones collected from a random behavior policy (Random). Table 3.2 shows that MILO performs reasonably on three environments even with a lower coverage dataset (second column): matches to expert performance on Ant-v2, and achieves approximately 70% of the expert performance on Hopper-v2and Humanoid-v2. For the random datasets, MILO achieves around 50% of the expert performance on Hopper-v2, around 20% on Walker2d-v2 and Ant-v2, but fails on HalfCheetah-v2 and Humanoid-v2. Environment ≈ 50% ≈ 25% Random Hopper-v2 0.95 ± 0.01 0.66 ± 0.33 0.42 ± 0.36 Walker2d-v2 0.72 ± 0.02 0.27 ± 0.06 0.23 ± 0.12 HalfCheetah-v2 0.96 ± 0.01 0.01 ± 0.02 0.01 ± 0.02 Ant-v2 1.02 ± 0.02 0.99 ± 0.01 0.21 ± 0.52 Humanoid-v2 0.88 ± 0.10 0.72 ± 0.03 0.08 ± 0.01 Table 3.2: Expert performance normalized scores on three different offline datasets collected from behavior policies with approximately 50%, 25%, and random performance relative to the expert. 58 3.8 Conclusion MILO investigates how to mitigate covariate shift in IL using an offline dataset of environ- ment interactions that has partial coverage of the expert’s state-action space. We show the effectiveness of MILO both in theory and in practice. In future works, we hope to scale to image-based control to further scale MILO to real world settings where an offline IL algorithm may be effective. 59 CHAPTER 4 MODEL-FREE OFF-POLICY IMITATION LEARNING Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al., 2019b) demonstrating the effectiveness of off-policy learning algo- rithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC’s empirical success, the original AIL objective is on-policy and DAC’s ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019b, 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight com- puted based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environ- ments, AILBoost outperforms ValueDICE and IQ-Learn(Garg et al., 2021), achieving competitive performance with as little as one expert trajectory. 60 4.1 Introduction Adversarial Imitation Learning (AIL) is an incredibly successful approach for imitation learning (Ho and Ermon, 2016a; Fu et al., 2018; Kostrikov et al., 2019b; Ke et al., 2020). These methods cast IL as a distribution matching problem whereby the learning agent minimizes the divergence between the expert demonstrator’s distribution and the state- action distribution induced by the agent. First introduced by (Ho and Ermon, 2016a), this divergence minimization can be achieved in an iterative procedure reminiscent of GAN algorithms (Goodfellow et al., 2014) with our learned reward function and policy being the discriminator and generator respectively. Originally, a limitation of many AIL methods was that they were on-policy. That is, for on-policy AIL methods like GAIL (Ho and Ermon, 2016a) and AIRL (Fu et al., 2018), the algorithm would draw fresh samples from the current policy in every iteration for the distribution matching process while discarding all old samples, rendering the sample complexity of these algorithms to be prohibitively large in many applications. Follow- up works (Kostrikov et al., 2019b; Sasaki et al., 2019) attempt to relax the on-policy requirement by creating off-policy methods that utilize the entire history of observed data during the learning process. This history is often represented by a replay buffer and methods such as Discriminator Actor Critic (DAC) show large improvements in scalability and sample complexity over their on-policy counterparts. However, these methods modify the distribution matching objective as a divergence minimization between the replay buffer’s and the expert’s distribution, losing the guarantee of matching the expert’s behavior. Algorithms like ValueDICE (Kostrikov et al., 2020) address this problem by deriving a new formulation of the AIL divergence minimization objective to be entirely off- 61 policy. ValueDICE, however, in principle relies on the environments to have deterministic dynamics.1 In this chapter, we consider a new perspective towards making AIL off- policy. We present a new principled off-policy AIL algorithm, AILBoost, via the gradient boosting framework (Mason et al., 1999). AILBoost maintains an ensemble of properly weighted weak learners or policies as well as a weighted replay buffer to represent the state-action distribution induced by our ensemble. Our distribution matching objective is then to minimize the divergence between the weighted replay buffer’s distribution (i.e., the state-action distribution induced by the ensemble) and the expert demonstrator’s distribution, making the divergence minimization problem an off-policy learning problem. Similar to boosting and gradient boosting, at every iteration, we aim to find a weak learner, such that when added to the ensemble, the divergence between the updated ensemble’s distribution and the expert’s distribution decreases. In other words, our approach can be understood as performing gradient boosting in the state-action occupancy space, where black-box RL optimizer is used a weak learning procedure to train weak learners, i.e., policies. We evaluate AILBoost on the DeepMind Control Suite (Tassa et al., 2018) and compare against a range of off-policy AIL algorithms (Behavior cloning, ValueDICE, DAC) as well as a state-of-the-art IL algorithm, IQ-Learn. We show that our algorithm is comparable to or more sample efficient than state-of-the-art IL algorithms in various continuous control tasks, achieving strong imitation performance with as little as one expert demonstration. We also show that our approach scales to vision-based, partially observable domains, where we again outperform DAC. 1One cannot derive an unbiased estimate of the objective function proposed in ValueDICE unless it has infinite expert samples and the transition is deterministic (Kostrikov et al., 2020). See section 4.3.3 for more detailed discussion. 62 4.2 Related works Off-policy and Offline IL There has also been a wide variety of research conducted on off-policy and offline IL, where the goal is to be either more sample efficient or safer by utilizing a replay buffer or not collecting any environmental transitions during training, respectively. The most prominent of said methods, and the closest to our work, is Discriminator-Actor-Critic (DAC) (Kostrikov et al., 2019b), which essentially replaces the on-policy RL algorithm in the adversarial IL setup with an off-policy one such as DDPG (Lillicrap et al., 2019) or SAC (Haarnoja et al., 2018a). However, as mentioned previously, DAC doesn’t necessarily guarantee a distribution match between the expert and the learned policy, prompting further work to be done. Further work has primar- ily focused on weighting on-policy and off-policy data differently in both the policy update and the discriminator update. ValueDICE (Kostrikov et al., 2020) mitigates this problem by deriving an objective from the original distribution matching problem that only requires off-policy samples to compute. More recently, methods such as IQ-Learn (Garg et al., 2021) have been developed to learn soft Q functions over the environment space, which encodes both a reward and a policy for inverse reinforcement learning, and model-based methods such as V-MAIL (Rafailov et al., 2021) have shown that using expressive world models (Hafner et al., 2020) leads to strong imitation results in domains with high-dimensional observations. Other off-policy IL works include SoftDICE (Sun et al., 2021), SparseDICE (Camacho et al., 2021), and AdVIL/AdRIL/DAeQuIL (Swamy et al., 2021). Orthogonally, on the offline side, where environment interaction is prohibited, works both on the model-based side (Chang et al., 2021) and the model-free side (Kim et al., 2022; Yu et al., 2023) has shown that distribution matching is still possible in these 63 settings. These approaches generally operate either by learning a transition model of the environment, with which to roll out in to do policy optimization (Chang et al., 2021), or optimizing a modified version of the objective introduced in (Kostrikov et al., 2020) by using samples from the suboptimal offline dataset as opposed to on-policy samples for computation. Boosting style approach in deep learning & RL The idea of using boosting for policy learning is not new in the deep learning or reinforcement learning literature. On the deep learning side, AdaGAN (Tolstikhin et al., 2017) apply standard adaptive boosting to GANs (Goodfellow et al., 2014) to address and fix issues such as mode collapse, while concurrent work (Grover and Ermon, 2017) showed benefits of boosting in general Bayesian mixture models. In RL, the conservative policy iteration (CPI) (Kakade and Langford, 2002a) can be understood as performing gradient boosting in the policy space (Scherrer and Geist, 2014). The authors in (Hazan et al., 2019) use a gradient boosting style approach to learn maximum entropy policies. Here, we perform gradient boosting in the space of state-action occupancy measures, which leads to a principled off-policy IL approach. 4.3 Preliminaries We consider a discounted infinite horizon MDPM = ⟨S, P,A, r, γ, µ0⟩ where S is the state of states,A is the set of actions, r : S ×A 7→ R is the reward function and r(s, a) is the reward for the given state-action pair, γ ∈ (0, 1) is the discount factor, µ0 ∈ ∆(S) is the initial state distribution, and P : S × A 7→ ∆(S) is the transition function. A policy π : S → ∆(A) interacts in said MDP, creating trajectories τ composed of state- action pairs {(st, at)}Tt=1. We denote dπt to represent the state-action visitation distribution 64 induced by π at timestep t and dπ = (1− γ) ∑∞ t=0 γ tdπt as the average state-action visitation distribution induced by policy π. We define the value function and Q-function of our policy as Vπ(s) = Eπ[ ∑∞ t=0 γ tr(st)|s0 = s] and Qπ(s, a) = r(s, a) + Es′∼P(·|s,a)[Vπ(s′)]. The goal of RL is to find a policy that maximizes the expected cumulative reward. In imitation learning, instead of having access to the reward function, we assume access to demonstrationsDe = {(si, ai)}Ni=1 from an expert policy πe that our policy can take advantage of while training. Note that πe might not necessarily be a Markovian policy. It is possible that πe is an ensemble of weighted Markovian policies, i.e., πe = {αi, πi} n i=1 with αi ≥ 0, ∑ i αi = 1, which means that for each episode, πe will first randomly sample a policy πi with probability αi at t = 0, and then execute πi for the entire episode (i.e., no switch to other policies during the execution for an episode). It is well known that the space of state action distributions induced by such ensembles is larger than the space of state-action distributions induced by Markovian policies (Hazan et al., 2019). The goal in IL is then to learn a policy that robustly mimics the expert. The simplest imitation learning algorithm to address this issue is behavior cloning (BC): argminπ∈Π E(s,a)∼De[ℓ(π(s), a)] where ℓ is a classification loss and Π is our policy class. Though this objective is simple, it is known to suffer from covariate shift at test time (Pomerleau, 1988; Ross et al., 2011a). Instead of minimizing action distribution divergence conditioned on expert states, algorithms such as inverse RL (Ziebart et al., 2008) and adversarial IL (Ho and Ermon, 2016a; Finn et al., 2016a; Ke et al., 2020; Sun et al., 2019d) directly minimize some divergence metrics between state-action distributions, which help address the covariate shift issue (Agarwal et al., 2019). 65 4.3.1 Adversarial Imitation Learning (AIL) The goal of AIL is to directly minimize some divergence between some behavior policy state-action visitation dπ and an expert policy state-action visitation dπ e . The choice of divergence results in variously different AIL algorithms. The most popular AIL algorithm is Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon, 2016a) which minimizes the JS-divergence. This algorithm is a on-policy adversarial imitation learning algorithm that connects Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and maximum entropy IRL (Ziebart et al., 2008). GAIL trains a binary classifier called the discriminator D(s, a) to distinguish between samples from the expert distribution and the policy generated distribution. Using the discriminator to define a reward function, GAIL then executes an on-policy RL algorithm such as Trust Region Policy Optimization (TRPO) (Schulman et al., 2017a) or Proximal Policy Optimization (PPO) (Schulman et al., 2017b) to maximize the reward. That gives us the following adversarial objective: min π max D Es,a∼π [ log D(s, a) ] + Es,a∼πe [ log(1 − D(s, a)) ] − λH(π) (4.1) where H(π) is an entropy regularization term. The first term in Equation (4.1) can be viewed as a pseudo reward that can be optimized with respect to the the policy π on-policy samples. Note that GAIL typically optimizes both policies and discriminators using on- policy samples, making it quite sample inefficient. Using different divergences, there are various reward functions that can be optimized with this framework (Orsini et al., 2021). In this work, while our proposed approach in general is capable of optimizing many common divergences, we mainly focus on reverse KL divergence in our experiments. Reverse KL divergence has been studied in prior works including Fu et al. (2018); Ke et al. (2020). But different from prior works, we propose an off-policy method for optimizing 66 reverse KL by leveraging the framework of boosting. 4.3.2 Discriminator Actor Critic (DAC) One reason GAIL need a lot of interactions with the environment to learn properly is because of the dependency on using on-policy approaches to optimize discriminators and policies. In particular, GAIL does not reuse any old samples. Discriminator Actor Critic (DAC) (Kostrikov et al., 2019b) extends GAIL algorithms to take advantage of off-policy learning to optimize the discriminators and policies. DAC introduces a replay buffer R to represent the history of transitions observed throughout training in the context of IRL. This replay buffer allows DAC to perform off-policy training of the policy and the discriminator (similar to (Sasaki et al., 2019)). Formally, DAC optimizes its discriminator with the objective: max D Es,a∼R [ log D(s, a) ] + Es,a∼πe [ log(1 − D(s, a)) ] . (4.2) where this objective minimize the divergence between the expert distribution and the replay buffer R distribution. Intuitively, this divergence does not strictly capture the divergence of our policy distribution and the expert distribution, but a mixture of evenly weighted policies learned up until the current policy. To rigorously recover a divergence between our policy distribution and the expert distribution we need to apply importance weights: min π max D Es,a∼R [ pπ(s,a) pR(s,a) log D(s, a) ] + Es,a∼πe [ log(1 − D(s, a)) ] − λH(π). While this objective recovers the on-policy objective of GAIL (Equation (4.1)), the authors note that estimating the density ratio is difficult and has high variance in practice. Furthermore, they note that the not using importance weights (Equation (4.2)) works well in practice, but does not guarantee successful imitation, especially when the distribution induced by the replay buffer, R, is far from our current policy’s state-action distribution. This is a 67 fundamental problem of DAC. 4.3.3 ValueDICE ValueDICE (Kostrikov et al., 2020) was proposed to address the density estimation issue of off-policy AIL algorithms formalized in DAC (see Section 4.3.2). ValueDICE aims to minimize the reverse KL divergence written in its Donsker-Varadhan (Donsker and Varadhan, 1983) dual form: −KL(dπ||dπe) = min x:S×A7→R logE(s,a)∼dπe [ex(s,a)] − E(s,a)∼dπ[x(s, a)] (4.3) Motivated from DualDICE (Nachum et al., 2019a), ValueDICE performs a change of variable using the Bellman operator Bπ2 with respect to the policy π; x(s, a) = ν(s, a) − Bπ(s, a); resulting the following objective: max π min ν:S×A→R logEs,a∼πe [ exp (ν(s, a) − Bπν(s, a)) ] − (1 − γ)Es0∼µ0, a0∼π [ν(s0, a0)] . (4.4) Now the objective function does not contain on-policy distribution dπ (in fact only the initial state distribution µ0 and the expert distribution). Despite being able to only using dπ e and µ0, the authors have identified two aspects of the objective that will yield biased estimates. First, the first expectation has a logarithm outside of it which would make mini- batch estimates of this expectation biased. Moreover, inside the first expectation term, we have ν(s, a) − Bπν(s, a) with Bπ being the Bellman operator. This limits ValueDICE’s objective to only be unbiased for environments with deterministic transitions. This is related to the famous double sampling issue in TD learning. Although many popular RL benchmarks have deterministic transitions (Bellemare et al., 2013; Tassa et al., 2018; Todorov et al., 2012b), this was a limitation not present in the GAIL. 2A bellman operator Bπ is defined as follows: given any function f (s, a), we have Bπ f (s, a) := r(s, a) + Es′∼P(s,a) f (s′, π(s′),∀s, a. 68 We take a different perspective than ValueDICE to derive an off-policy AIL algorithm. Different from ValueDICE, our approach is both off-policy and is amenable to mini-batch updates even with stochastic environment transition dynamics. 4.4 Algorithm Our algorithm, Adversarial Imitation Learning via Boosting (AILBoost) – motivated by classic gradient boosting algorithms (Friedman, 2001; Mason et al., 1999) – attempts to mitigate a fundamental issue related to off-policy imitation learning formalized in DAC (see Section 4.3.2). The key idea is to treat learned policies as weak learners, form an ensemble of them (with a proper weighting scheme derived from a gradient boosting perspective), and update the ensemble via gradient boosting. Weighted policy ensemble. Our algorithm will learn a weighted ensemble of policies, denoted as π := {αi, πi} n i=1 with αi ≥ 0, ∑ i αi = 1 and πi being some Markovian policy. The way the mixture works is that when executing π, at the beginning of an episode, a Markovian policy πi is sampled with probability αi, and then πi is executed for the entire episode (i.e., no policy switch in an episode). Note that π itself is not a Markovian policy anymore due to the sampling process at the beginning of the episode, and in fact, such mixture policy’s induced state-action distribution can be richer than that from Markovian policies (Hazan et al., 2019). This is consistent with the idea of boosting: by combining weak learners, i.e., Markovian policies, we form a more powerful policy. Given the above definition of π, we immediately have dπ := ∑ i αidπi , i.e., the weighted mixture of the state-action distributions induced by Markovian policies πi. Notation wise, given a dataset D, we denote ÊD[ f (x)] as the empirical function average across the dataset, i.e., ÊD[ f (x)] = ∑ x∈D f (x)/ |D|. 69 4.4.1 AILBoost: Adversarial Imitation Learning via Boosting We would like to minimize the reverse KL divergence between our policy state-action distribution dπ and the expert distribution dπ e – denoted by ℓ(dπ, dπ e ) = KL(dπ ||dπ e ) :=∑ s,a dπ(s, a) ln(dπ(s, a)/dπ e (s, a)). The reasons that we focus on reverse KL is that (1) it has been argued that the mode seeking property of reverse KL is more suitable for imitation learning (Ke et al., 2020), (2) reverse KL is on-policy in nature, i.e., it focuses on minimizing the divergence of our policy’s action distribution and the expert’s at the states from our policy, which help address the covariate shift issue, and (3) the baselines we consider in experiments, DAC and ValueDICE, all minimize the reverse KL divergence such as AIRL in practice 3. At a high level, our approach directly optimizes ℓ(dπ, dπ e ) via gradient boosting (Mason et al., 1999) in the state-action occupancy space. Our ensemble π induces the following mixture state-action occupancy measure: dπ := t∑ i=1 αidπi , αi ≥ 0. To compute a new weak learner πt+1, we will first compute the functional gradient of loss ℓ with respect to dπ, i.e., ∇ℓ(d, dπ e )|d=dπ . The new weak learner πt+1 is learned via the following optimization procedure: πt+1 = argmaxπ̃∈Π⟨d π̃,−∇ℓ(d, dπ e )|d=dπ⟩. Namely, we aim to search for a new policy πt+1 such that its state-action occupancy measure dπt+1 is aligned with the negative gradient −∇ℓ as much as possible. Note that the above optimization problem can be understood as an RL procedure where the reward function is defined as −∇ℓ(d, dπe)|d=dπ ∈ R S A. Once we compute the weak learner πt+1, we mix it into the policy ensemble with a fixed learning rate α ∈ (0, 1) – denoted as dπ′ = (1 − α)dπ + αdπt+1 . Note that the above mixing step can be interpreted as gradient boosting in the state-action occupancy space directly: we re-write the update procedure as dπ′ = dπ + α(dπt+1 − dπ), where the ascent direction dπt+1 − dπ is approximating the 3See the official repository 70 https://github.com/google-research/google-research/tree/master/dac Algorithm 5 AILBOOST (Adversarial Imitation Learning via Boosting) Require: number of iterations T , expert dataDe, weighting parameter α 1: Initialize π1 weight α1 = 1, replay buffer B = ∅ 2: for t = 1, . . . ,T do 3: Construct the t-th datasetDt = {(s j, a j)}Nj=1 where s j, a j ∼ dπt ∀ j. 4: Compute discriminator ĝ using the weighted replay buffer: ĝ = argmax g Ês,a∈Dπe [ − exp(g(s, a)) ] + t∑ i=1 αiÊs,a∈Di [ g(s, a) ] (4.5) 5: Set B ← B ∪Dt 6: Compute weak learner πt+1 via an off-policy RL approach (e.g., SAC) on reward −ĝ(s, a) with replay buffer B 7: Set αi ← αi(1 − α) for i ≤ t, and αt+1 = α 8: end for 9: Return Ensemble π = {(αi, πi)}Ti=1 (negative) functional gradient −∇ℓ, since argmaxπ⟨d π − dπ,−∇ℓ⟩ = πt+1 by the definition of πt+1. It has been shown that such procedure is guaranteed to minimize the objective function (i.e., reverse KL in this case) as long as the objective is smooth (our loss ℓ will be smooth as long as dπ is non-zero everywhere) (e.g., see (Hazan et al., 2019) for the claim).4 Algorithmically, we first express the reverse KL divergence in its variational form (Nowozin et al., 2016; Ke et al., 2020): KL(dπ ||dπ e ) := max g [ Es,a∼dπe [ − exp(g(s, a)) ] + Es,a∼dπg(s, a) ] where g : S × A 7→ R is a discriminator. The benefit of using this variational form is that computing the functional (sub)-gradient of the reverse KL with respect to dπ is easy, which is ĝ = argmaxg [ Es,a∼dπe [ − exp(g(s, a)) ] + Es,a∼dπg(s, a) ] , i.e., we have ĝ being a functional sub-gradient of the loss KL(dπ ||dπ e ) with respect to dπ. The maximum discriminator ĝ will serve as a reward function for learning the next weak learner πt+1, 4Note that similar to AdaBoost, each weaker is not directly optimizing the original objective, but the weighted combination of the weaker learners optimizes the original objective function – the reverse KL in our case. 71 that is πt+1 = argmax π Es,a∼dπ [ −ĝ(s, a) ] = argmax π ⟨dπ,−ĝ(s, a)⟩. (4.6) To compute ĝ in practice, we need unbiased estimates of the expectations via sample averaging which can be done easily in our case. The expectation Es,a∼dπe can be easily approximated by the expert dataset De. To approximate Es,a∼dπ where dπ is a mixture distribution, we maintain a replay buffer Di for each weak learner πi which contains samples s, a ∼ dπi , and then weightDi via the weight αi associated with πi. In summary, we optimize g as shown in Equation (4.5) in Algorithm 5 (the highlighted red part denotes the empirical expectation induced by weighted replay buffer). The optimization problem in Equation (4.5) can be solved via stochastic gradient ascent on g.5 With ĝ, we can optimize for πt+1 using any off-shelf RL algorithm, making the entire algorithm off-policy. In our experiments, we use SAC as the RL oracle for argmaxπ Es,a∼dπ[−ĝ(s, a)]. Once πt+1 is computed, we mix πt+1 into the mixture, and adjust the weights of older policies accordingly, i.e., αt+1 = α, and αi ← αi(1 − α),∀i ≤ t. Note that this weighting scheme ensures that older policies get less weighted in the ensemble. Remark 19. The use of SAC as the weak learning algorithm and the new way of com- puting discriminator from Equation (4.5) make the whole training process completely off-policy. Particularly, unlike most adversarial IL approaches, which compute discrimi- nators by comparing on-policy samples from the latest policy and the expert samples, we train the discriminator using all the data collected so far (with proper weighting derived based on the boosting framework). The connection to boosting and the proper weighting provides a principled way of leveraging off-policy samples for updating discriminators. As we will show, compared to DAC which also uses off-policy samples for training policies and discriminators, our principled approach leads to better performance. 5Note that unlike ValueDICE, here we can easily use a finite number of samples to obtain an unbiased estimate of the loss by replacing expectations by their corresponding sample averages. 72 Algorithm 5 AILBoost, summarizes the above procedure. In Line 6, we use SAC as the RL oracle for computing the weak learner. In practice, we do not run SAC from scratch every time in Line 6. Instead, SAC maintains its own replay buffer which contains all interactions it has with the environment so far. When computing πt+1, we first update the reward in the replay buffer using the latest learned reward function −ĝ, and we always warm start from πt. We include the detailed algorithmic description in Appendix C.1. Memory cost. Note that at the end, our algorithm returns a weighted ensemble of Markovian policies. Comparing to prior works such as DAC, the maintenance of weak learners may increase additional memory cost. However, the benefit of the weighted ensemble is that it induces richer state-action distributions than that of Markovian policies. In practice, if memory cost really becomes a burden (not in our experiments even with image-based control policies), we may just keep the latest few policies (note that very old policy has exponentially small weight anyway). 4.5 Experiments In this section we aim to empirically investigate the following questions: (1) How does AILBoost perform relative to other off-policy and state-of-the-art IL methods? (2) Does AILBoost enjoy the sample complexity and scalability benefits of modern off-policy IL methods? (3) How robust is AILBoost across various different adversarial training schedules? Task Difficulty Ball in Cup Catch Easy Walker Walk Easy Cheetah Run Medium Quadruped Walk Medium Humanoid Stand Hard Table 4.1: Spread of environments eval- uated from the DeepMind Control Suite with hardness designations from (Yarats et al., 2022). We evaluate AILBoost on 5 environ- ments on the DeepMind Control Suite bench- mark(Tassa et al., 2018): Walker Walk, Cheetah Run, Ball in Cup Catch, 73 Quadruped Walk, and Humanoid Stand. For each game, we train an expert RL agent us- ing the environment’s reward and collect 10 demonstrations which we use as the expert dataset throughout our experiments. We com- pare AILBoost against the following base- lines: DAC, an empirically succesful off- policy IL algorithm; IQ-Learn, a state-of-the-art IL algorithm; ValueDICE, another off-policy IL method; and BC on the expert data used across all algorithms. We em- phasize our comparison to IQ-Learn, as it has been shown to outperform many other imitation learning baselines (e.g., SQIL (Reddy et al., 2019)) across a variety of control tasks (Garg et al., 2021). The base RL algorithm we used for training the expert, as well as for AILBoost and DAC, was SAC for controller state-based experiments and DrQ-v2 (Yarats et al., 2022) for image-based experiments. For IQ-Learn and ValueDICE, we used their respective codebases and hyperparameters provided by the authors and both methods use SAC as their base RL algorithm. Please refer to Appendix C.2 for experimental details, training hyperparameters, and expert dataset specifications. 4.5.1 Controller State-based Experiments Figure 4.1 shows our aggregate results across the five DeepMind Control Suite (DMC) tasks that we tested on. We chose these five tasks by difficulty as shown in Table 4.1. For evaluation, we follow the recommendations of (Agarwal et al., 2021b) and report the aggregate inter-quartile mean, mean, and optimiality gap of AILBoost and all the 74 10 Demos 5 Demos 1 Demo Expert Normalized Score 0.60 0.75 0.90 BC ValueDICE IQ-Learn DAC AILBoost IQM 0.60 0.75 0.90 Mean 0.15 0.30 0.45 Optimality Gap 0.4 0.6 0.8 1.0 BC ValueDICE IQ-Learn DAC AILBoost 0.45 0.60 0.75 0.90 0.15 0.30 0.45 0.60 0.25 0.50 0.75 1.00 BC ValueDICE IQ-Learn DAC AILBoost 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Figure 4.1: Aggregate metrics on DMC environments with 95% confidence intervals (CIs) based on 5 environments spanning easy, medium, and hard tasks. Higher inter- quartile mean (IQM) and mean scores (right) and lower optimality gap (left) is better. The CIs were calculated with percentile bootstrap with stratified sampling over three random seeds and all metrics are reported on the expert normalized scores. AILBoost outperforms DAC, ValueDICE, IQ-Learn, and BC across all metrics, amount of expert demonstrations, and tasks. baselines on the DMC suite with 95% confidence intervals. We find that AILBoost not only outperforms all baselines but also consistently matches the expert with only 1 expert trajectory. When we inspect the 1 trajectory case closer, Figure 4.2 shows the learning curves on three representative (1 easy, 1 medium, 1 hard task) environments where we see AILBoost maintain high sample efficiency and strong imitation while state-of-the-art baselines like IQ-Learn completely fail on Humanoid Stand. Finally, we note that AILBoost greatly outperforms ValueDICE which aimed to make AIL off-policy from a different perspective. We refer readers to Figure C.2 in the appendix for the learning curves on all five environments with different numbers of expert demonstrations. 75 Figure 4.2: Learning curves with 1 expert trajectory across 3 random seeds. Note AILBoost successfully imitates expert on all environments where other baselines fail and achieves better sample complexity than DAC. Note that when the environment difficulty level increases, our method shows a larger performance gap compared to baselines (e.g., humanoid stand). 0 50000 100000 150000 200000 250000 Samples 0 200 400 600 800 M ea n Sc or e Walker Walk 0 50000 100000 150000 200000 250000 Samples 0 200 400 600 800 Cheetah Run Expert BC DAC AILBoost Figure 4.3: Image based: performance on image-based DMC environments, Walker Walk and Cheetah Run, comparing AILBoost, DAC, and BC on three random seeds. 4.5.2 Image-based Experiments Figure 4.3 demonstrates the scalability of AILBoost on a subset of environments with 10 expert trajectories. For these experiments, we use DrQ-v2 (Yarats et al., 2022) as the underlying off-policy RL algorithm for both DAC and AILBoost. On Walker Walk and Cheetah Run, we see comparable to better performance than DAC demonstrating that our boosting strategy successfully maintains the empirical, scaling properties of 76 DAC. Furthermore, our use of different off-policy RL algorithms show the versatility of AILBoost for IL. 4.5.3 Sensitivity to gradient-based optimization for weak learners and discriminators 0.0 0.5 1.0 1.5 Samples 1e6 0 200 400 600 800 1000 M ea n Sc or e Ball in Cup Catch 0.0 0.5 1.0 1.5 Samples 1e6 Walker Walk Expert 1000 P, 100 D 1000 P, 10 D 1000 P, 1 D 100 P, 100 D Figure 4.4: Policy and Discriminator Update Schedules: Learning curves for AILBoost on two representative DMC environments, Walker Walk and Ball in Cup Catch, when optimizing with varying policy and discriminator update schemes across 3 seeds. Our algorithm relies on solving optimization problems in Eq. 4.6 and Eq. 4.5 for weak learners and discriminators, where weak learner is optimized by SAC and discriminators are optimized by SGD. While it is hard to guarantee in general that we can exactly solve the optimization problem due to our policies and discriminators are both being non-convex neural networks, we in general found that approximately solving Eq. 4.6 and Eq. 4.5 via gradient based update is enough to ensure good performance. In this section, we test AILBoost across a variety of optimization schedules. Overall, we find that AILBoost to be robust to optimization schedules — approximately optimizing Eq. 4.6 and Eq. 4.5 with sufficient amount of gradient updates ensures successful imitation; however, there exists a sample complexity cost when over-optimizing either the discriminator or the policy. Figure 4.4 shows our investigation of how sensitive AILBoost is to different opti- 77 mization schedules for both the policy and discriminator on two representative DMC environments. In particular, we test with 5 expert demonstrations, where we vary the number of discriminator and policy updates. We test the following update schemes: • 1000 policy updates per 100 discriminator updates • 1000 policy updates per 10 discriminator updates • 1000 policy updates per 1 discriminator update • 100 policy updates per 100 discriminator updates These ranges, test various optimization schemes around the schedule that we chose for the main results. We find that the more policy updates we do per discriminator update, the algorithm becomes significantly less sample efficient despite asymptotically reaching expert performance. We also found that an insufficient amount of updates on the discriminator general hurts the performance. This is also expected since insufficient update on the discriminators may result a ĝ which does not optimize Eq. 4.5 well enough. 4.6 Conclusion We present a fully off-policy adversarial imitation learning algorithm, AILBoost. Differ- ent from previous attempts at making AIL off-policy, via the gradient boosting framework, AILBoost provides a principled way of re-using old data for learning discriminators and policies. We show that our algorithm achieves state-of-the-art performance on state-based results on the DeepMind Control Suite while being able to scale to high-dimensional, pixel observations. We are excited to extend this framework to discrete control as well as investigate imitation learning from observations alone under this boosting framework. 78 Part II IL and RL for Generative Models 79 CHAPTER 5 LEARNING TO GENERATE BETTER THAN YOUR LLM Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we investigate RL algorithms beyond general purpose algorithms like Proximal Policy Optimization (PPO). In particu- lar, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We provide two ways for the guide LLM to interact with the LLM to be optimized for maximizing rewards. The guide LLM can generate text which serves as additional starting states for the RL optimization procedure. The guide LLM can also be used to complete the partial sentences generated by the LLM that is being optimized, treating the guide LLM as an expert to imitate and surpass eventually. We experiment on the IMDB positive sentiment, CommonGen, and TL;DR summarization tasks. We show that our RL algorithms achieve higher performance than supervised learning (SL) and the RL baseline PPO, demonstrating the benefit of interaction with the guide LLM. On both CommonGen and TL;DR, we not only outperform our SL baselines but also improve upon PPO across a variety of metrics beyond the one we optimized for. Our code can be found at https://github.com/Cornell-RL/tril. 80 https://github.com/Cornell-RL/tril 5.1 Introduction Large Language Models (LLMs) have become very capable in various real-world ap- plications ranging from being able to answer open-ended questions on numerous top- ics (Zhang et al., 2022), write articles from short descriptions (Goyal et al., 2022), generate code (Github, 2023), follow robot commands (Huang et al., 2022), solve puz- zles (Bubeck et al., 2023), and even showcased as assistive models for education (Khan Academy, 2023) and healthcare (Lee et al., 2023c). However, using supervised learning (SL) to train LLMs presents a challenging metric mismatch (Wiseman and Rush, 2016) between the training and testing regimes. The met- ric mismatch arises from the training metric being the log-loss while the testing metrics are task-specific such as BLEU or user satisfaction rating. This discrepancy is magnified when fine-tuning LLMs on downstream tasks where the main goal is not just producing fluent text but also being proficient at solving the specific task. Another mismatch is the training and testing distributions mismatch. SL methods train model on the given static datasets, while in inference time, the LLMs need to make prediction conditioned on the text it has generated by itself. Such a distribution mismatch during training and testing has been widely observed in literature such as Imitation Learning and RL (Ross et al., 2011a), robotics (Ross et al., 2013), and NLP (Bengio et al., 2015; Arora et al., 2022). Reinforcement Learning (RL) addresses these mismatches by directly optimizing the metrics through reward feedback on the states generated by the RL agent itself. The ability to test in real world and obtain reward feedback to correct and improve the agents’ behav- iors on the fly makes RL a more powerful learning paradigm than SL. Recently, OpenAI fine-tuned LLMs with RL from human feedback (RLHF) to better align LLMs to human intentions, leading to the great success of ChatGPT (OpenAI, 2023). Following this, mul- 81 Figure 5.1: (Left) Reinforcement Learning with Guided Feedback (RLGF) flow chart showing how breaking up generations into two parts, rollins and rollouts done by different LLMs opens up a rich framework of interaction when compared to (Right) Reinforcement Learning (RL). RLGF uses a guide policy πg to guide the policy training on π for maximizing a reward function. The guide policy πg can be used to complete the partial sentences generated by the policy π, which allows the RL algorithms to treat πg as an expert to imitate and surpass. RLGF can also use πg to generate partial sentences from which the RL algorithms start optimizing π. RLGF treats πg as a black-box model which gives RLGF the flexibility of using different pre-trained LLMs (or even a human expert) as πg. We mainly experiment using a supervised fine tuned model followed by nucleus sampling (SFT+nucleus) as πg. Our experiments show that RLGF is capable of learning a policy that is better than πg and the policy learned by standard RL alone. tiple other models trained with RL such as Anthropic’s Claude2 (Anthropic, 2023) and Meta’s LLama2 (Touvron et al., 2023) further proved the effectiveness of RL. Recently, GRUE benchmark (Ramamurthy et al., 2022a) systematically studied RL versus SL when finetuning LLMs on downstream tasks with predefined rewards. GRUE’s preliminary results demonstrate the benefit of RL when fine-tuning LLMs, leading to the release of popular codebases such as RL4LMs (Ramamurthy et al., 2022a), TRLx (CarperAI, 2023) and AlpacaFarm (Dubois et al., 2023), that enables RL for language models. However, ChatGPT, RL4LMs, TRLX, and AlpacaFarm all use vanilla policy gradient methods known to be sample inefficient and sensitive to local minima due to the combinatorially large search space of natural language generation (Ramamurthy et al., 2022a). Here, we focus on more efficient RL methods for fine-tuning LLMs on downstream tasks with predefined rewards (e.g., well-defined metric such as Bleu, or reward learned 82 from human preference feedback). Our approach is motivated by the classic prior work on RL with rich reset distributions (Kakade and Langford, 2002b; Bagnell et al., 2003) and Imitation Learning (IL) (Ross et al., 2011a; Sun et al., 2017a; Chang et al., 2015b), which often leverages an existing guide policy (not necessarily an optimal policy) to reduce the search space for more efficient and optimal learning. Our key observation is that since modern pre-trained LLMs exhibit impressive general language capabilities, they can serve as guide policies to improve the RL procedure. Our framework, which we call, RL with guided feedback (RLGF), integrates a guide policy into a policy gradient framework (Figure 5.1). When the guide policy can provide reasonable but potentially sub-optimal predictions for downstream tasks, our framework can then leverage to learn a near-optimal strategy. We introduce simple and novel algorithms for fine-tuning LLMs using our RLGF framework while capturing various existing IL and RL algorithms. Our proposed algorithms are simple and introduce little overhead on computation and memory compared to PPO (especially when using LoRA adapters), making it straightforward to replace PPO by our algorithms in any RLHF pipeline. We evaluate on three tasks. The first is IMDB where the goal is to generate a positive and fluent review given an initial context. The second is CommonGen where the goal is to write a fluent text that uses a given set of words. Finally, we test on the TL;DR summarization task where the objective is to learn to generate summaries using human preference data. For all tasks, we find evidence of metric mismatch from SL-based fine-tuning approaches and show that RL-based methods which utilize reward signals outperform on the task metric. We then demonstrate RLGF outperforming PPO on reward, fluency, as well as automated lexical metrics such as Rouge. In our experiments, our guide policy is the SFT model equipped with nucleus sampling. Thus comparing to the baseline PPO which uses the SFT model as a warm start, our algorithms use the same amount of information and thus is a fair comparison to PPO. Finally, we investigate how 83 various baselines and RLGF algorithms balance the inherent trade-off between reward optimization and the KL constraint in the RLHF objective. We provide both theoretical justification and empirical evidence to show the benefit of using RLGF for fine-tuning LLMs on downstream tasks. 84 5.2 Related Work Here we present the most relevant works at the intersection of IL, RL, and natural language generation. Please see Appendix D.1 for a more thorough treatment of the literature. IL for Structured Prediction: Algorithms such as Schedule Sampling (SS) (Ben- gio et al., 2015), methods using SS (Duckworth et al., 2019; Mihaylova and Martins, 2019; Goyal et al., 2017), SEARNN (Leblond et al., 2017), Bridging the Gap (Zhang et al., 2019b), Mixer (Ranzato et al., 2015) been inspired by IL for structured predic- tion algorithms DAGGER (Ross et al., 2011a), DAD (Venkatraman et al., 2015), and SEARN (Daumé III et al., 2009). Our work is inspired by AggreVaTeD (Sun et al., 2017a) (Differentiable AggreVaTe Ross and Bagnell (2014b)) where the algorithm makes use of differentiable policies and multi-step feedback rather than immediate one-step predictions to imitate. Similarly, we present a differentiable version of LOLS (Chang et al., 2015b) as well as an improvement, D2LOLS. LLM Fine-tuning from Human Preferences: Recent advancements in fine-tuning of Large Language Models (LLMs) have shown incredible success in tasks through learning from human preferences. Being simpler to accumulate human preferences, Re- inforcement Learning from Human Feedback (RLHF) (Stiennon et al., 2020) introduced a paradigm to utilize RL to improve downstream performance on translation (Kreutzer et al., 2018b), summarization (Stiennon et al., 2020), storytelling (Ziegler et al., 2019), and instruction following (OpenAI, 2023). Another family of work use supervised learn- ing style methods for fine-tuning LLMs (Zhao et al., 2023; Yuan et al., 2023; Rafailov et al., 2023a; Liu et al., 2023c). DPO, SLiC, RRHF, and RSO are methods that optimize for compatibility with a preference dataset under a preference reward model (either 85 explicitly modeling a reward function or implicitly representing a reward function via an LLM itself) such as the Bradley Terry model (Bradley and Terry, 1952). Whether or not one should use RL or SFT to fine-tune LLM is not the question we aim to address here, instead, our work mainly focus on improving PPO for fine-tuning LLMs, and our key contribution is novel RL algorithms that can outperform PPO on various tasks. LLM Distillation: With an ever growing arsenal of powerful, black-box LLMs, recent work has aimed to distill specific capabilities into a smaller model. Knowledge dis- tillation (Buciluǎ et al., 2006; Hinton et al., 2015) in autoregressive models investigated matching sequence level log probabilities (Kim and Rush, 2016), model hidden states (Jiao et al., 2019), or attention scores (Wang et al., 2020b). Recently, more sophisticated methods, inspired from the IL literature, are being proposed to better imitate the expert LLM’s performance (Lin et al., 2020a; Agarwal et al., 2023; Mukherjee et al., 2023), with ORCA (Mukherjee et al., 2023) reaching parity performance with ChatGPT (OpenAI, 2023) by distilling the reasoning traces from GPT4 (OpenAI, 2023). Distinct from this line of work, RLGF does not aim to replicate the guidance policy. Rather, our objective is to leverage generation traces derived from a guide policy to condense the search space for RL algorithms. More importantly, our goal goes beyond imitation of the guidance policy and focuses on algorithms that better optimize a reward with guidance policy feedback. 5.3 Preliminaries The sequential nature in the task of Text generation with LLMs allows one to model it via RL. In this setting, we are given a set of prompts {xi}Ni=1, and a reward function R that measures some user-specified quality of the generated text. The reward R can be pre-defined evaluation metrics or a learned reward model from human preference 86 datasets. The text generation RL problem can then be defined as a token-level finite- horizon MDP ⟨S,A, P,R,H, µ⟩ using a finite vocabulary V. We are given a labeled datasetD = { (xi, yi) }N i=1 of N samples , where xi is a prompt text and yi is the target text generation. We define µ ∈ ∆(D) as the initial distribution over prompts, and the action space A as the set of tokens in our vocabulary V. The state space S = ∪h=1,··· ,HV h is the set of all possible token sequences and a state sh ∈ S is the prompt x and previously generated tokens (a0, a1, . . . , ah−1), i.e., sh = (x, a0, a1, . . . , ah−1). The transition function P : S × A → ∆(S) is a deterministic known transition function that appends the next action ah to the state sh+1. The time horizon H ∈ Z+ is the maximum generation length. Finally, R : S → R is the reward function such as the task evaluation metric or a metric learned from a preference dataset. We define our policy π as an LLM that maps from state (i.e. prompt + partial generation) to action (next token). Let dπh represent the state distribution of visiting a state at time h. Let dπ = 1 H ∑H h=0 dπh be the average visitation if we follow π for H steps in a trajectory. With an LLM policy π, we define the value function and Q-function as Vπ h (s) = Eπ[ ∑H h′=h R(sh′)|sh = s] and Qπ h(s, a) = R(s) + Es′∼P(·|s,a)[Vπ h+1(s′)] respectively. Finally, we define the advantage function for an LLM policy π as Aπ(s, a) = Qπ(s, a) − Vπ(s). Guide policy πg In our setting, we additionally assume access to a black-box LLM- based guide policy πg that can assist our policy π. The guide policy can be used to alter the initial state distribution µ and to compute the advantage function Aπg (s, a). In our experiments, we mainly investigate using a supervised fine-tuned (SFT) model followed by some decoding strategy (e.g., Nucleus sampling (Holtzman et al., 2019)) as πg. Note, RLGF treats πg as a query-able, black-box model that we do not need update. This allows for πg to be any black-box model such as GPT4 or a human-expert. Our work aims to show that RLGF is capable of learning policies that are (much) better than πg, and by 87 Two roads to separate pathsdiverged from the street in a wood rollin rollout reward 1 2 3 Figure 5.2: RLGF’s main mechanism of incorporating guidance through interactions between two LLMs: rollin and rollout policies. (1) the rollin policy generates a trajectory. (2) the rollout policy restarts to a sampled point in the generation (i.e. s2) and completes the generation. (3) the rollout policy receives a score (i.e. reward) for the generation. leveraging πg, it can outperform standard RL algorithm PPO. 5.4 Reinforcement Learning from Guided Feedback Unlike other tasks studied in RL (e.g., robotics control problems), text generation prob- lems have two key properties: a deterministic transition function and a policy’s ability to restart to any state. Because our state is the set of previously generated tokens, we can easily alter the words in the generation (add, remove or swap), and restart our policy πθ to any point of the generation. Restarts allow us to execute rollin and rollout policies as seen in Figure 5.2. The rollin policy is used to generate sequences that the rollout policy evaluates. Specifically, we sample a prompt x from our initial distribution µ. We then generate an entire trajectory using our rollin policy starting from the sampled prompt. We combine the state-action pairs from the collected rollin trajectory with the initial prompts – creating a modified initial state for the rollout policy. The rollout policy samples a state along the rollin generation, restarts to this state and performs rollouts. The rollout policy collects a 88 reward at the end of the generation. The rollin and rollout policies can be our LLM policy πθ, guide policy πg. Depending on the choice of rollin and rollout policies, we invoke different algorithms. Note that PPO uses πθ for both rollin and rollout policies. PPO: Rollin πθ and Rollout πθ Under this schematic, notice how when both the rollin and rollout policies are our current LLM policy πθ that is being fine-tuned, the resulting RL algorithm is PPO. That is, we would be collecting generations from a single LLM. This configuration does not take advantage of the ability to modify the initial state distribution nor the availability of a guide policy πg. Algorithm 6 PPO++ 1: Input: πθ, guide πg, iterations T , mixing parameter β ∈ [0, 1], dataset D ={ (xi, yi) }N i=1 2: for t ∈ [T ] do 3: Rollin with (s, a) ∼ βdπ g + (1 − β)dπ t θ starting from x ∼ D 4: Rollout with πt θ to collect trajectories 5: Update Vπt θ ϕ with trajectories and compute advantage estimates Aπt θ 6: Update πθ using PPO loss with Aπt θ 7: end for 8: return πθ PPO++: Rollin πg and Rollout πθ The new scheme we propose is rollin with our guide policy πg and rollout with our LLM policy πθ. This strategy is motivated from a popular Approximate Policy Iteration algorithm (Bertsekas, 2011): Conservative Policy Iteration (CPI) (Kakade and Langford, 2002b). CPI proposes to use a diverse initial state distri- bution to address the exploration issue in PG methods. Particularly, it proposes to use an initial state distribution that covers some high-quality policy’s state distribution. The first key idea of PPO++ is to take advantage of a guide policy πg to provide an enlarged initial state distribution – so that the rollout policy, πθ, can visit diverse and relevant states it would otherwise not visit. The second key idea of PPO++ is using a mixture policy with 89 state distribution βdπ g + (1 − β)dπθ , for rollin (see Algorithm 6 Line 3). This ensures that with probability (1 − β), PPO++ is executing the default PPO update, making sure PPO++ maintains the benefits of PPO and never underperforms PPO. Algorithm 7 AggreVaTeD 1: Input: πθ, guide πg, iterations T , mixing parameter β ∈ [0, 1], dataset D ={ (xi, yi) }N i=1 2: for t ∈ [T ] do 3: Rollin with (s, a) ∼ (1 − β)dπ t θ + βdπ g starting from x ∼ D 4: Rollout with πg to collect trajectories 5: Update Vπg ϕ with trajectories and compute advantage estimates Aπg 6: Update πθ using PPO loss with Aπg 7: end for 8: return πθ AggreVaTeD: Rollin πθ and Rollout πg The next scheme performs rollin with our LLM policy πθ and rollout with our guide policy πg – the opposite of PPO++. This scheme is an interactive imitation learning algorithm, AggreVaTeD (Sun et al., 2017a), a differ- entiable policy gradient version of AggreVaTe (Aggregate Values to Imitate (Ross and Bagnell, 2014b)) as seen in Algorithm 7. AggreVaTeD is an API algorithm similar to CPI and also uses a mixture policy with state distribution βdπ g + (1 − β)dπθ for rollin. This algorithm first generates rollins with the mixture policy to collect sequences. Then AggreVaTeD generates rollouts with the guide policy and evaluates the quality of the generated rollouts. It then uses the rollouts to train a value network Vπg ϕ that measures the reward-to-go of πg, which in turn is used to construct the advantage of πg: Aπg . With this advantage Aπg , AggreVaTeD updates the policy like PPO (i.e., update πθ so that it increases the probabilities of selecting actions with larger Aπg ). Intuitively, the algorithm aims to learn the policy argmaxa Aπg (s, a), which ensures that that the LLM policy πθ can be at least as good as or better than the guide policy πg. 90 Algorithm 8 D2LOLS 1: Input: πθ, guide πg, iterations T , datasetD = { (xi, yi) }N i=1 2: Run π1 θ = AggreVaTeD(πθ, πg, αT, β1,D) 3: Run π2 θ = PPO++(π1 θ , π g, (1 − α)T, β2,D) 4: return π2 θ D2LOLS: combines PPO++ and AggreVaTeD Given the previous approaches of inter- action, we can come up with multiple ways to combine PPO, PPO++, and AggreVaTeD. In Algorithm 8, we present Direct and Differentiable Locally Optimal Learning to Search (D2LOLS), which is a simple approach to combine the previous methods. D2LOLS is a differentiable policy gradient version of Locally Optimal Learning to Search (LOLS)(Chang et al., 2015b) and addresses limitations of how LOLS combines PPO, PPO++, and AggreVaTeD. The original formulation of LOLS requires computing cost-sensitive classification similar to AggreVaTe; instead we take inspiration from AggreVaTeD’s differentiable approach to develop a differentiable version of LOLS. Furthermore, LOLS (Algorithm 15) has a mixing probability parameter α which directly merges the advantage function between PPO and AggreVaTeD, leading to theoretical issues. D2LOLS removes this mixing probability and replaces it with a mixing time variable α that decides how many iterations to perform AggreVaTeD before switching to PPO++. This simple modi- fication not only makes D2LOLS more practical to optimizing LLMs, but also fixes LOLS’s issue arising from interweaving guidance. Thus D2LOLS should be understood as a more practical and more principled alternative of LOLS. 5.5 Theoretical Justification In this section, we provide theoretical justification for various rollin and rollout schemes mentioned in Section 5.4. Each algorithmic scheme takes advantage of a guide policy 91 πg, the ability to restart the policy to any state, and access to the reward signal. Our theoretical justification are derived from the original algorithms that each method has built upon. Interactive Imitation Learning: AggreVaTeD In our interactive IL setting, we assume access to the ground truth reward and to a guide policy πg that may not necessarily be an expert policy π⋆ (i.e. optimal at the task). Our AggreVaTeD (Algorithm 7) implementation is a modification of the original AggreVaTeD (Sun et al., 2017a) to incorporate a PPO policy gradient loss. The overall idea is to perform policy gradient updates on the loss function ℓt(π) := Es∼dπtEa∼π(·|s)[Aπg (s, a)], where πt is our latest learned policy. We can define the average-regret and best policy performance in our policy class over T -iterations as: ϵregret = 1 T − T∑ t=0 ℓt(πt) +max π∈Π T∑ t=0 ℓt(π)  ϵclass = max π∈Π 1 T T∑ t=0 Es∼dπt [ Aπg (s, π(s)) ] . If the gradient update procedure achieves no-regret, i.e., ϵregret → 0 as T → ∞, AggreVaTeD achieves the following guarantee; there exists t ∈ [T ], such that: Vπt ≥ Vπg + Hϵclass. When the guide policy is included in our policy class πg ∈ Π, e.g., when our policy πθ and our guide πg have the same GPT2 model architecture, then our ϵclass term is guaranteed to be non-negative. Furthermore, this term is positive when πg is not globally optimal with respect to its advantage function (i.e., maxa Aπg (s, a) can be positive). Thus when ϵregret → 0 (i.e., no-regret), AggreVaTeD guarantees to learn a policy πt that outperforms the guide policy by a margin. This was originally confirmed empirically in Sun et al. (2017a) and is also confirmed in our experiments. With our SFT model with nucleus sampling as πg, AggreVaTeD learns a policy πt outperforming πg. 92 Reinforcement Learning with better restart distribution: PPO++ Although AggreVaTeD is capable of outperforming πg, it is an imitation learning algorithm, mean- ing by design, its performance is limited by the performance of πg. In contrast, RL has the potential to learn the near optimal policy, but popular RL approaches suffer from a lack of exploration. We propose to leverage rollin’s with the guide policy to overcome RL’s exploration issues. PPO++ Algorithm 6 implements this idea using a PPO loss. We can interpret the rollin policy distribution with the guide policy, as a restart distribution that alters the initial distribution of our policy, i.e., µmix := (1 − β)µ + βdπ g , where recall µ ∈ ∆(D) is the original initial state distribution over our data. Policy gradient theory (Kakade and Langford, 2002b; Bagnell et al., 2003; Agarwal et al., 2019, 2021a) ensures that as long as a near optimal policy is covered by the restart distribution, we can learn to perform as well as the near optimal policy. More formally, consider the special case where β = 1/2, and π⋆ is the globally optimal policy; and assume that at some iteration t one-step local improvement over πt is small, i.e., Es,a∼dπt µmix [ maxa Aπt (s, a) ] ≤ ϵ, then with some small ϵ we have: Vπt ≥ Vπ⋆ − O ( H2 max s ( dπ ⋆ (s) dπg(s) ) ϵ ) We refer readers to the proof of theorem 6.2 in Kakade and Langford (2002b). Note that compared to the result from AggreVaTeD, we are able to compare against the globally optimal policy π⋆ under the condition that πg’s state distribution covers π⋆’s state distri- bution (i.e., the guide policy has a good sense of what states π⋆ will likely visit). In our experiments, we mainly use a SFT model with nucleus sampling as our guide policy πg. While we do not expect the SFT policy πg is as good as the optimal π⋆, it is reasonable to expect that dπ g provides coverage to dπ ⋆ . Our experiments verify that restarting based on states from dπ g improves the performance of PPO. 93 Combine Reinforcement Learning and Imitation Learning: D2LOLS D2LOLS is the simplest approach to combine AggreVaTeD and PPO++. This algorithm runs AggreVaTeD for a fixed period of time and then PPO++ for the remaining time. If our policy gradient algorithm is Trust-region policy optimization (TRPO) 1 (Schulman et al., 2015a) or CPI (Kakade and Langford, 2002b), then our algorithm has a guaranteed monotonic policy improvement. This means that upon convergence, we achieve two properties: (1) our learned policy is at least as good or better than the guide policy πg, (2) our policy is locally optimal, i.e., the local one-step improvement, Es,a∼dπµmix [maxa Aπ(s, a)], has to be small (otherwise TRPO and CPI can keep improving). There exist several algorithms in the literature that combine RL and IL (Cheng et al., 2018; Sun et al., 2018; Chang et al., 2015b; Rajeswaran et al., 2017a; Nair et al., 2018). The key difference between D2LOLS and LOLS is how PPO++ and AggreVaTeD is combined. LOLS uses a mixing probability α to combine our πθ and the guide policy πg advantage function αAπt θ + (1 − α)Aπg (s, a); whereas D2LOLS uses a mixing time parameter α to decide when to switch from doing AggreVaTeD to PPO++ for the remainder of training. LOLS can achieve the property of outperforming better than πg and also being locally optimal, but only under the assumption that the following gap is small: ∀π : ∣∣∣∣Es∼dπ [ max a Aπg (s, a) +max a Aπ(s, a) ] − Es∼dπ max a [ Aπg (s, a) + Aπ(s, a) ] ∣∣∣∣ ≤ ε, with some small ε. However, such a gap can exist in practice and does not vanish even with enough training data. Intuitively this gap is non-trivial when the one-step improvement over π contradicts with the one-step improvement over πg. The simplest approach D2LOLS works the best, and achieves the guarantee that LOLS aimed for without the additional assumption of the above gap being small. 1in our experiments, instead of using TRPO, we use PPO – a scalable version of TRPO that is more suitable for high-dimensional problems. However we emphasize the TRPO and PPO use the same principle for policy optimization: make conservative policy update (Kakade and Langford, 2002b) to ensure monotonic improvement. 94 IMDB Sentiment CommonGen Algorithms Semantic and Fluency Metrics Lexical and Semantic Metrics Sentiment Score Perplexity Output-Perplexity Bleu-4 CIDEr-D SPICE (↑) (↓) (↓) (↑) (↑) (↑) Zero-Shot 0.48 ± 0.00 32.55 ± 0.00 5.64 ± 0.00 0.00 ± 0.00 6.02 ± 0.55 15.02 ± 0.40 SFT 0.55 ± 0.00 35.67 ± 0.00 6.19 ± 0.00 22.31 ± 0.12 14.32 ± 0.15 31.73 ± 0.34 SFT+PPO 0.97 ± 0.01 44.92 ± 1.78 3.17 ± 0.62 27.98 ± 0.32 16.91 ± 0.29 32.61 ± 0.06 SFT+PPO++ 0.97 ± 0.01 44.83 ± 2.10 3.34 ± 0.80 28.48 ± 0.24 16.94 ± 0.53 32.75 ± 0.21 SFT+AggreVaTeD 0.95 ± 0.03 52.56 ± 5.38 5.04 ± 2.30 28.14 ± 0.31 16.90 ± 0.09 32.44 ± 0.02 SFT+LOLS 0.93 ± 0.05 53.30 ± 16.70 3.44 ± 4.96 28.15 ± 0.16 16.91 ± 0.22 32.80 ± 0.20 SFT+D2LOLS 0.97 ± 0.00 43.88 ± 2.37 2.92 ± 0.13 28.54 ± 0.12 16.96 ± 0.18 32.83 ± 0.09 Table 5.1: IMDB and CommonGen Results: We compute the mean and standard deviation over 3 seeds for both the IMDB and the CommonGen tasks. For our reward function each task we use the bold metric(s). The zero-shot model is the performance of the pretrained model used for IMDB and CommonGen, GPT-2 and T5 respectively. SFT+Alg indicates running Alg after supervised finetuning. SFT+nucleus is used as our guide policy πg for all experiments. 5.6 Experiments We perform all of our experiments using a modified PPO objective Jppo (Ouyang et al., 2022; Wu et al., 2016). This objective combines the original PPO objective with a maximum-likelihood estimation (MLE) objective of the ground-truth dataset’sD refer- ences: Jppo(πθ) = E(s,a)∼πθ [ R(s) − λKL(πθ(a|s)||π0(a|s)) ] + ηE(s,a)∼D [ log πθ(a|s) ] , where λ is the KL coefficient and η is the MLE coefficient. For all of our proposed RLGF algorithms discussed in section 5.4 we consider setting πg to the supervised fine- tuned model (SFT) with nucleus sampling for decoding (i.e., πg =SFT+nucleus). We treat SFT+nucleus as a black-box model that we can only query for text generation and do not perform updates to it. By using SFT+nucleus as our guide policy, we run all of our experiments under the exact same conditions as those of RLHF. Note, RLHF already requires keeping SFT to compute the KL constraint, KL(πθ||π0), in Jppo. 95 Task Details In our experiments, perplexity measures how likely our learned model, πθ, is to generate the references in the task dataset, whereas output perplexity computes how likely a general LLM (e.g. GPTJ) is to generate the generations from our learned policy, πθ. Both perplexity metrics have been reported as a measure of fluency (Fedus et al., 2018; Ramamurthy et al., 2022a). We perform experiments on three tasks. IMDB is the first task and the objective is to generate fluent and positively sentiment-ed text continuations for IMDB (Maas et al., 2011) movie reviews prompts. We use a sentiment classifier (Sanh et al., 2019) as our reward function that is trained on review texts and sentiment labels from the dataset, which then provides sentiment scores indicating how positive a given piece of text is. For training supervised SFT baselines, we consider only the examples with positive labels. We chose GPT2 (Radford et al., 2019) as the base language model (LM) for this task. We evaluate all algorithms on three metrics: sentiment reward score, perplexity, and output-perplexity. Next, we consider CommonGen (Lin et al., 2020b), a challenging constrained, text generation task that tests the ability of generative common sense reasoning. We optimize the SPIDER (Liu et al., 2017) reward function, a weighted combination of the CIDEr-D and SPICE metric. We chose T5-base (Raffel et al., 2020) as our base LLM and prefixed each concept set input with: "generate a sentence with:". We report four metrics: BLEU (Papineni et al., 2002), CIDEr-D (Vedantam et al., 2015), and SPICE (Anderson et al., 2016). For IMDB and CommonGen, we perform one epoch of supervised finetuning for our SFT models. The final task we consider is Reddit TL;DR summarization dataset (Völske et al., 2017) where the objective is to generated summaries. We use the filtered dataset with additional human preference data used in Stiennon et al. (2020). The base LLM that 96 TL;DR Summarization Algorithms Semantic and Fluency Metrics RM Score Perplexity Output-Perplexity Win Rate Rouge 1 Rouge 2 RougeL (↑) (↓) (↓) (↑) (↑) (↑) (↑) Zero-Shot 1.57 14.07 11.51 44.12% 0.27 0.07 0.18 SFT 5.68 14.09 12.81 44.29% 0.34 0.25 0.25 Best-of-N (N = 8) 5.98 14.09 12.86 47.60% 0.36 0.13 0.27 SFT+PPO 6.01 15.05 17.67 54.25% 0.35 0.13 0.27 SFT+PPO++ 6.11 14.53 16.15 55.01% 0.36 0.14 0.27 SFT+AggreVaTeD 5.93 14.69 16.41 48.98% 0.36 0.15 0.29 SFT+PPO (N = 8) 6.20 14.87 16.53 57.53% 0.36 0.15 0.27 SFT+PPO++ (N = 8) 6.52 13.42 15.23 60.30% 0.38 0.15 0.28 SFT+AggreVaTeD (N = 8) 6.11 13.53 15.61 54.12% 0.37 0.16 0.28 Table 5.2: TL;DR Summarization Results: We report the mean over 1 seed. Our RM Score is under our trained preference reward model and the Win Rate is evaluated by Llama2-13B-Chat. We use SFT+nucleus as πg. We also report Best-of-8 results with our trained policies. we use for this task is GPT-J (Wang and Komatsuzaki, 2021) and we train all models in our algorithms using LoRA adapters (Hu et al., 2021). We evaluate all algorithms on 5 metrics: reward score, perplexity, output-perplexity, win rate and Rouge (Lin, 2004). For win rate, we use the open source Llama2-13B-chat (Touvron et al., 2023) model as our evaluator model. We compare all algorithm generations to the preferred summary references. For our SFT model, we use an open-source GPT-J model2.Refer to Appendix D.3.2, for the exact Win Rate prompt, example evaluations and implementation details. 5.6.1 Experimental Results RLGF vs. RLHF Performance Table 5.1 and Table 5.2 compares all of the RLGF algorithms proposed in Section 5.4 against standard RLHF algorithms and baselines. For all tasks, our πg is SFT+nucleus which is sub-optimal, performing worse than all RL based algorithms across most lexical and semantic metrics. Utilizing this πg, for IMDB, 2https://huggingface.co/CarperAI/openai_summarize_tldr_sft 97 https://huggingface.co/CarperAI/openai_summarize_tldr_sft D2LOLS outperforms PPO on all metrics while PPO++ outperforms PPO on both semantic reward and perplexity, and for CommonGen, D2LOLS outperforms PPO in all metrics including the ones that are not included in the reward function. Finally, for TL;DR summarization we see that PPO++ performs better than PPO as well as a competitive baseline, Best-of-N (Dubois et al., 2023). Furthermore, when applying Best-of-N inference on our trained policies, we see that PPO++ improves even more beyond PPO. Notably, with or without best-of-N procedure, PPO++ outperforms PPO on all metrics. Supporting our justification from Section 5.5, AggreVaTeD improves beyond our guide policy, providing an alternative as a warm-starting methodology to warm-starting with SFT. PPO++, on the other hand, is better than or competitive to our RL baseline demonstrating a simple, yet powerful alternative to PPO as the RL procedure. Even in practice, we observe the benefit of restarting from an initial state distribution that better covers an optimal policy’s state distribution. The combination of these two, D2LOLS, achieves the best of both worlds and fully leverages the capabilities of utilizing a guide policy. Reward Optimization Tradeoff In Figure 5.3 we evaluate how well RLGF algorithms trade-off optimizing the reward while minimizing the perplexity and kl-constraint √ KL. For fair comparisons, we kept λ and η the same across all algorithms. For both plots, the top right corner indicates the policy has both high reward and low perplexity and low divergence from π0. For each algorithm we plot 5 checkpoints ranging from 20 to 100 iterations.PPO++ mostly matches or has higher reward than PPO while maintaining a lower perplexity. Separately, AggreVaTeD trade-offs reward for perplexity, and has comparable reward scores as PPO while drastically reducing its perplexity. For the kl-constraints plot on the left of Figure 5.3 we see that although PPO has a set of points with high reward, most of these points also have high KL divergences. Whereas, a subset of PPO++ matches 98 5.6 5.8 6.0 6.2 RM Score (→) 0 2 4 6√ K L (π ||π 0 ) (→ ) 5.6 5.8 6.0 6.2 RM Score (→) 14.0 14.5 15.0 15.5 p er p le xi ty (→ ) PPO++ AggreVaTeD PPO SFT Figure 5.3: We investigate the reward optimization, kl-constriant, and fluency trade-off in our TL;DR summarization task. The dashed line represents our SFT policy’s performance across each metric. Both PPO++ and AggreVaTeD learn a policy that has a better trade-off than PPO. easy hard prompt difficulty 6 8 10 12 m ea n CI De r-D sc or e T5 SFT PPO AggreVaTeD LOLS PPO++ D2LOLS Figure 5.4: Comparison of CIDer-D scores grouped by prompt difficulty on CommonGen. The performance gap between easy and hard prompts is evident for SFT, and PPO++, while our proposed algorithms AggreVaTeD, LOLS and D2LOLS exhibit a significantly smaller gap, showcasing their effectiveness on challenging prompts. or has higher reward than PPO while having a lower kl-constraint. 99 RLGF Performance on Difficult Prompts Our evaluation was carried out on the CommonGen task where we categorized the prompts based on their difficulty level. For CommonGen, we classify the prompts into easy and hard based on the number of unseen concepts in the prompt. Specifically, we categorized prompts with 3 concepts as easy and more than 3 concepts as hard. Figure 5.4 presents a comparison of scores for different algorithms grouped by prompt difficulty. The results reveal a notable performance gap between easy and hard prompts for algorithms such as SFT and PPO, whereas our proposed algorithms PPO++, AggreVaTeD, LOLS and D2LOLS exhibit a smaller gap, with D2LOLS having the least gap . In other words, even on challenging prompts, our interactive algorithms produce better text continuations. See Appendix D.5 for example generations. MLE and KL coefficient Sensitivity We test the sensitivity of PPO and RLGF al- gorithms to two regularization hyperparameters in the Jppo objective, namely the KL coefficient, λ, and the MLE coefficient, η. The left 2 plots in Figure 5.5 show the reward and perplexity when we keep η fixed and vary λ while the right 2 show the performance when we keep λ fixed and vary η. As shown in the left two figures, all RL algorithms are robust to varying KL coefficients. We see that when we varying λ, while our algorithms PPO++, D2LOLS and the baseline PPO has similar rewards, our algorithms consistently maintain a lower (or equal) perplexity than PPO. From the right two figures, we observe much more instability on perplexity when relaxing our MLE regularization with both PPO and RLGF algorithms’ perplexities blowing up. Note that when increasing η, our algorithm PPO++ consistently has higher rewards and lower (or equal) perplexity than PPO. 100 10010−110−210−3 kl coefficient (λ) 0.6 0.7 0.8 0.9 1.0 re w ar d sc or e (→ ) 10010−110−210−3 kl coefficient (λ) 40 60 80 p er p le xi ty (← ) 10010−110−210−3 MLE (η) 0.8 0.9 1.0 re w ar d sc or e (→ ) 10010−110−210−3 MLE (η) 102 103 p er p le xi ty (← ) PPO AggreVaTeD PPO++ LOLS D2LOLS Figure 5.5: Jppo KL coefficient (λ) and MLE coefficient (η) ablation. We show the sensitivity of PPO and RLGF algorithms to each regularization term in the objective. Note that all RL algorithms are robust to changes in KL coefficient with relatively minor changes in the Perplexity while being more sensitive to changes in MLE objective (Right) with blowups in the perplexity. 5.7 Conclusion and Future Work We presented a unifying framework of incorporating a guide policy to enhance rein- forcement learning for natural language generation. Through theoretical justification and experimental validation, we demonstrate that our RLGF framework can outperform PPO for fine-tuning LLMs. Our proposed algorithms PPO++ and D2LOLS only require black-box access to the guide policy and are conceptually simple and easy to implement based on PPO. While in our experiment, we demonstrate that supervised fine-tuned models with standard decoding strategies is a good candidate for the guide policy, our framework is general enough to leverage any large LLMs as the guide policy, including those that are not open-sourced. Finally, RLGF’s contributions to the broader large language model literature is complementary to model enhancements, dataset improvements, and prompt- ing discoveries such as in-context prompting. We leave it to exciting future work to test the full capabilities of bootstrapping the state-of-the-art advancements in each research direction with RLGF to improve reinforcement learning for natural language generation. 101 CHAPTER 6 PROVABLY EFFICIENT RL WITH PREFERENCE-BASED FEEDBACK VIA DATASET RESET Reinforcement Learning (RL) from Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RL algorithm that can learn from preference-based feedback with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Policy Optimization (DPO), under the metric of GPT4 win-rate. 6.1 Introduction Reinforcement learning aims at maximizing a cumulative reward function. However, specifying a reward function in practice can be challenging (Wirth et al., 2017). Rein- forcement Learning with Human Feedback (RLHF) has become an effective approach 102 when a reward function does not exist (Christiano et al., 2017). Operating under a setting where human labelers provide preference-based feedback (e.g., ranking of generations from an RL agent), RLHF learns a reward model and then optimizes the reward model via RL techniques. RLHF has found applications across various domains, including games (MacGlashan et al., 2017; Christiano et al., 2017; Warnell et al., 2018), large language models (LLMs) (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021; Nakano et al., 2021; Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Ramamurthy et al., 2022b; Liu et al., 2023b), and robot learning (Brown et al., 2019; Shin et al., 2023). RLHF typically consists of the following two steps: (1) fitting a reward model using a pre-collected offline preference-based dataset (often generated from some pre-trained models and labeled by humans), (2) and learn a policy via online RL (e.g., Proximal Policy Optimization (Schulman et al., 2017b)) to optimize the learned reward model. These two steps are often done separately in the sense that once the reward model is learned, step (2) only optimizes the reward model without ever using the offline preference dataset. Is there any benefit of re-using the offline data during the procedure of optimizing the reward model via online RL? Prior work on hybrid RL (Song et al., 2022; Ball et al., 2023) demonstrated that combining offline data and online data can often significantly boost learning efficiency. Can we achieve a similar boost in learning efficiency for RLHF? Towards answering this, we propose an algorithm called Dataset Reset Policy Opti- mization (DR-PO), operating under the assumption of being able to reset, i.e., we can go back to any state and start policy optimization and data collection from that point (as opposed to reseting to initial states). While being able to reset is certainly an assumption, it is naturally satisfied in when using RL to fine-tune generative models like language models and diffusion models (Lee et al., 2023b). This is because the underlying Markov 103 transitions are simple, known, and deterministic. Our algorithm, DR-PO, is a hybrid RL approach which integrates offline data into an online RL procedure: when collecting online data, DR-PO resets the policy optimizer to the states in the offline dataset for exploration. Algorithmically, DR-PO is simple: it iteratively collects a batch of online data by resetting the policy to states in the offline data, performs policy rollouts, and optimizes the policy using the online batch via policy optimization techniques such as Natural Policy Gradient (NPG) (Kakade, 2001a) or Actor-critic methods (e.g., PPO (Schulman et al., 2017b)). While DR-PO is as simple to implement as most of the existing policy optimization algorithms, we demonstrate that DR-PO achieves strong theoretical guarantees under natural assumptions. Specifically, DR-PO is capable of learning a policy that is at least as good as any policy which is covered by the offline data in terms of maximizing the ground truth rewards, and DR-PO achieves this result under general function approximation with finite sample complexity. DR-PO is also computationally tractable since it only requires supervised learning style oracles such as a Maximum Likelihood Estimation (MLE) oracle (for fitting reward models) and a Least Squares Regression oracle (for learning value functions). Thus DR-PO advances the status of the theoretical work on RLHF (see more detailed discussion in Section 6.1.1). To support our new theory, we test our approach on a standard RLHF dataset: TL;DR summarization (Stiennon et al., 2020). We demonstrate that the summaries generated by DR-PO outperform those from PPO and DPO (Rafailov et al., 2023b) in terms of GPT4 win-rate. We also show that when transferring the policies trained on TL;DR to the CNN/DailyMail news articles in a zero-shot manner, policies trained via DR-PO again generate summaries that outperform those from PPO and DPO. Finally, we test how DR-PO scales on Anthropic HH (Bai et al., 2022b) across three different model scales and show that DR-PO scales just as well as PPO while still outperforming baselines. 104 6.1.1 Related Work Provably efficient RLHF. The theoretical investigation on online RLHF started in bandit setting with the notion of dueling bandits (Yue et al., 2012; Zoghi et al., 2014; Dudík et al., 2015), which aims at identifying the optimal arm with human preference feedback over action pairs. Extending this discussion to tabular MDPs, Novoseller et al. (2020) proposes a dueling posterior sampling algorithm that requires computing and sampling from the posterior of the model dynamics and reward function, leading to potential computational inefficiency. Another PAC RLHF algorithm for tabular MDPs is presented by Xu et al. (2020). However, this method involves computing complicated bonus terms to guide exploration. Additionally, Pacchiano et al. (2021); Chen et al. (2022) have designed online RLHF algorithms with provable guarantees by updating a confidence set of the policies iteratively, which, unfortunately, are not practically feasible either. In a more recent study, Zhan et al. (2023b) tackles the problem of reward-free RLHF. Nevertheless, their algorithm introduces a series of non-convex optimization problems which are challenging to solve. Notably, these works either only focus on tabular MDPs Novoseller et al. (2020); Xu et al. (2020); Pacchiano et al. (2021) or rely on specialized function approximation such as linear parametrization (Pacchiano et al., 2021; Zhan et al., 2023b) and function classes with small Eluder dimension (Chen et al., 2022; Wu and Sun, 2023), which further restricts their application in practice. In contrast, we focus on the setting where preference-labeled data is only available offline, which is more consistent with the settings considered in applications of fine-tuning language models. Also by using the idea of dataset reset, our algorithm works with function approximation that is much more general than the above prior works. The study on theoretical offline RLHF is more limited. Li et al. (2023) focuses on learning the reward from a human’s behavior in dynamic discrete choice models 105 rather than from human preference feedback, and thus, the setting is different. Zhu et al. (2023a) studies PAC algorithms for linear models and Zhan et al. (2023a) extends the analysis to general function approximation. However, both of their algorithms are not computationally efficient because they rely on constructing a confidence set for the reward function and solving a constrained maximin problem. Tiapkin et al. (2023) studied the setting where high-quality expert demonstrations exist. They use behavior cloning to train a policy using expert demonstrations and then run an Upper-confidence-bound style algorithm to optimize a reward function under a KL regularization to the behavior-cloned policy. They show that for tabular and linear MDP, the expert demonstrations reduce the sample complexity of online RL. We consider preference-based offline datasets, which may not necessarily come from a high-quality expert, and function approximation that is significantly more general than linear and tabular functions. Note that UCB based algorithms can quickly become computationally intractable beyond tabular and linear settings (e.g., Jiang et al. (2016); Du et al. (2021)). Our algorithm uses the idea of dataset reset for exploration and does not involve any optimism-based exploration strategy, making it computationally tractable even when dealing with general function approximation. We think that the key idea of dataset reset can also be used in the setting from Tiapkin et al. (2023) to make their algorithm extend beyond the tabular and linear MDP settings. Empirical RLHF algorithms. This work continues the recent literature of RLHF algorithms that perform online RL (Zhu et al., 2023b; Wu et al., 2023; Chang et al., 2023) to finetune large generative models. There have also been efforts to build on top of DPO (Rafailov et al., 2023b) with algorithms such as IPO (Azar et al., 2011) and KTO (Contextual.ai, 2023). In this paper, our work is complementary to many of these efforts in augmenting RL through the incorporation of dataset resets in online generation. Ideas 106 from this work could directly be applied to existing online RLHF algorithms such as P3O (Wu et al., 2023) and APA (Zhu et al., 2023b). Given the recent work (Yuan et al., 2024) in incorporating online generations to improve DPO, an offline RLHF method, the idea of dataset resets could also be relevant in this space of hybrid RLHF methods. Using reset in RL The idea of reset is not new in RL (Kakade et al., 2003a; Bagnell, 2004; Nair et al., 2018; Salimans and Chen, 2018; Yin et al., 2022; Uchendu et al., 2023; Silver et al., 2016a; Agarwal et al., 2019; Daumé III and Marcu, 2005; Daumé III et al., 2009). When resetting is available, it helps address exploration and credit assignment problems. In this work, we show that resetting to an offline dataset helps in RLHF. The key challenge in RLHF is that the reward model is learned purely from offline data which may not have a global coverage to the entire state space. Our algorithm incorporates KL regularization to ensure the learned policies do not deviate too much from the offline data so that we do not over-optimize the learned reward model (e.g., reward hacking). While the idea of KL-regularization was also used in prior empirical RLHF works (e.g.,Stiennon et al. (2020); Bai et al. (2022a)), we show that by combining the two key ideas, KL regularization and dataset reset, our algorithm achieves strong performance in both theory and practice. We also demonstrate the efficacy of our approach in the application of fine-tuning language models. 6.2 Preliminaries Markov Decision Processes. In this paper we consider an episodic time- inhomogeneous Markov Decision Process (MDP) M with state space S = {Sh} H h=1, action spaceA and horizon H. Here Sh is the subspace of all states at step h. We suppose the states incorporate the information of the current step and thus {Sh} H h=1 are mutually 107 disjoint. We assume that every episode begins at the same state s1 and ends at the dummy state sH+1, but our analysis can be extended to a random starting state easily. In each episode, at step h ∈ [ht], the agent observes the current sh and executes an action ah. Then the environment generates a reward r⋆(sh, ah) (which can be unobservable to the agent), and transits to a new state sh+1, which is sampled from the transition probability P(·|sh, ah). Here we suppose the reward function r⋆ : S ×A 7→ [0, 1] is bounded, and for any possible trajectory τ = (sh, ah)H h=1, we have ∑H h=1 r⋆(sh, ah) ≤ rmax. Note that when the reward is sparse, rmax can be much smaller than H. A policy π : SÕ∆A specifies the action selection probability of the agent conditioned on the current state. Given a policy π, we define its state-action visitation measure as dπh(s, a) = Pπ(sh = s, ah = a) for all s ∈ Sh, a ∈ A, h ∈ [ht] where Pπ(·) denotes the distribution of the trajectory when executing policy π. We will also use dπh(s) =∑ a∈A dπh(s, a) to denote the state visitation measure and dπ(τ) to denote the distribution of the trajectory under policy π. We can further define the associated value functions and Q functions of policy π and reward function r as Vπ,r(s) = Eπ[ ∑H t=h r(st, at) | sh = s],Qπ,r(s, a) = Eπ[ ∑H t=h r(st, at) | sh = s, ah = a] for all h ∈ [ht], s ∈ Sh, a ∈ A.1 They characterize the expected cumulative reward under policy π starting from a state or a state-action pair. We aim to find an ϵ-optimal policy π̂ with respect to the true reward r⋆ and a target policy π⋆ which we denote as some high-quality policy (π⋆ is not necessarily the globally optimal policy), i.e., Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ ϵ. Particularly, we would only utilize common oracles such as Maximum Likelihood Estimator (MLE) and Least Squares Regression (LSR). We also want our algorithms to be able to leverage general function classes. 1For notation simplicity, we drop the usual subscript h in value functions, as we have assumed state s contains the information of time step h. 108 RL from Human Feedback (RLHF). We consider the setting where the true reward r⋆ is unobservable. Instead, we have access to an offline trajectory-pair dataset DR = {(τ0 m, τ 1 m, om)M m=1} labeled with human preference, where the trajectories τ0 m and τ1 m are i.i.d. sampled from some pre-trained policy πSFT (e.g., in NLP tasks, this can be the instruction fine-tuned policy, which is also called supervised fine-tuned (SFT) policy). In this work, we do not explicitly consider the learning procedure of πSFT, and we assume it is given to us. Here om ∈ {0, 1} characterizes the human preference over the trajectory pairs (τ0 m, τ 1 m) and we suppose the human preference is modeled by a monotonically increasing link function Φ: P(o = 1 | τ0, τ1) = P(τ1 ≻ τ0) = Φ(r⋆(τ1) − r⋆(τ0)), where we use r⋆(τ) to denote ∑H h=1 r⋆(sh, ah) for any trajectory τ = (sh, ah)H h=1. A widely- used model is the Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952) where the link fucntion is chosen to be the sigmoid function σ(x) = 1/{1 + exp(−x)}.We will use κ = 1 infx∈[−rmax ,rmax] Φ′(x) to measure the non-linearity of the link function Φ, which in turn reflects the hardness of learning the reward model from the human preference. Given DR, we can learn a reward model r̂ using MLE: r̂ = argmin r∈R M∑ m=1 − log P(o = om | τ 0 m, τ 1 m), (6.1) With the BTL model, the above NLL becomes 1I(om = 1) · log ( 1 + exp(r(τ0 m) − r(τ1 m)) ) + 1I(om = 0) · log ( 1 + exp(r(τ1 m) − r(τ0 m)) ) , which is a loss function that has been used in many prior RLHF works(Christiano et al., 2017; Stiennon et al., 2020). We also assume that we have an unlabeled dataset DTR = {τn} N n=1 where τn is i.i.d. sampled from πSFT. The Ability to Reset. We consider the setting where we can reset the system. More formally, given any state sh at time step h, we can reset the RL agent directly to sh and 109 rollout a policy π. While this is certainly an assumption, it is satisfied in many important applications, e.g., fine-tuning generative models such as LLMs (Ouyang et al., 2022) and Diffusion models (Lee et al., 2023b) with RL. In text generation, a state sh typically means a partial sentence. Resetting from this state would then mean that we feed the partial sentence sh to a transformer based policy and have it generate new tokens one by one starting from the given partial sentence. We emphasize that in the RL literature, prior works (e.g., PPO and many RL theoretical works (Agarwal et al., 2021a; Azar et al., 2017; Jin et al., 2020a; Zhan et al., 2022)) typically do not assume the ability to reset – they often assume the agent has to always start from some initial states. However, when reset is available, it is often a game changer, in both theory Yin et al. (2022) and in practice (e.g., AlphaGo Silver et al. (2016a)). 6.3 Dataset Reset Policy Optimization We present a meta-algorithm here to provide the details of how we leverage the idea of dataset reset to collect online batch data. We abstract away the policy optimization oracle here for the purpose of emphasizing the novelty in terms of how we interact with the environment for online data collection. Once the online batch data is collected, we feed it to a policy optimization oracle, e.g., PG, NPG, Actor-critic methods, or a PPO-style update 2. Algorithm 9 summarizes the key idea of dataset reset in DR-PO. The key difference between DR-PO and a more standard policy optimizer is that in DR-PO, for each episode, the policy collects online trajectories via resetting to a state randomly sampled from 2Here we mean the specific actor-critic style policy optimization formulation where clipping is used to ensure small policy update, and critic is learned via GAE, on a gvien online batch data Schulman et al. (2017b). 110 Algorithm 9 Dataset Reset Policy Optimization (DR-PO) 1: Input: Preference datasetDR, unlabeled datasetDTR, reward function class R, total number of iterations T . 2: Initialize: π1 = πSFT. 3: Learn a reward model r̂ via MLE based on Eq. equation 6.1. 4: for t = 1, · · · ,T do 5: Initialize an empty online batchDon. /* Online data collection */ 6: for n = 1, · · ·N do 7: Randomly sample a trajectory inDTR and a state sh from it where h ∈ [ht]. 8: Reset πt to sh and rollout πt to generate trajectory {sh, ah, . . . , sH, aH}. 9: Add trajectory {sh′ , ah′ , r̂(sh′ , ah′), ln(πt(ah′ |sh′)/πSFT(ah′ |sh′))}Hh′=h toDon. 10: end for 11: Policy update: πt+1 ⇐ PO(πt,Don). {PG, NPG / TRPO, CPI, Actor-Critic, PPO} 12: end for some trajectory in the offline datasetDTR. In other words, we do not rollout the policy π from the initial state s1 as typically done in standard policy optimization algorithms like PG. The online data collection procedure collects a batch of online trajectoriesDon. Note for each online trajectory, we record each state-action pair’s reward measured under the learned reward model r̂, and also the log ratio of πt and πSFT which serves as an empirical estimate of the policy KL divergence, i.e., KL(πt(sh′)||πSFT(sh′)). Such a KL divergence term can be optionally used as a reward penalty to ensure the learned policies do not deviate too far from πSFT so that the reward model r̂ stays as a good approximation of the true reward r⋆ under learned policies’ trajectory distributions. We use this KL penalty both in theory and in practice. Once the online data is collected, we feed it to a policy optimization oracle PO for a policy update. A PO oracle can be a PG, NPG, or PPO style update. To be more specific, for a PPO style update procedure, we use Don to fit a critic for advantage estimation Â(s, a)3 (e.g., via generalized advantage estimation used in PPO), and then update the policy onDon with the clipping trick: πt+1 ⇐ argmaxπ ∑ s,a∈Don Clip ( π(a|s) πt(a|s) ) Â(s, a). This 3when using KL penalty, this advantage function measures the advantage under KL regularized reward — r̂ − λKL with λ ∈ R+ as coefficient for the KL penalty. 111 is the policy update that we use in our experiments. In our theory, we use NPG as the PO oracle. While PPO and NPG are different when it comes to exact implementation, PPO can be understood as a heuristic that approximates NPG for the purpose of being more scalable for large-scale optimization (e.g., the clipping trick induced by PPO is approximately trying to ensure that the new policy does not deviate too much from the old one – a key property that NPG methods advocated for (Kakade, 2001a; Kakade and Langford, 2002a; Bagnell and Schneider, 2003; Schulman et al., 2015a)). 6.4 Theoretical Analysis In this section, we analyze the DR-PO (Alg 9) by instantiating the policy optimization oracle PO to be a Natural Policy Gradient (NPG) oracle. For completeness, we describe PO in Algorithm 10, which in high level consists of policy evaluation via least square regression, and then policy update via Mirror Descent style procedure. We leave the detailed full description of the algorithm in Appendix E.1. In Alg. 10, we use the online data to fit a Q function estimate of the current policy πt. Note that here we do not use the KL penalty ln(πt(ah′ |sh′)/πSFT(ah′ |sh′)) directly when calculating the trajectory total reward. In Appendix E.3, we provide a version of NPG which includes the KL penalty when calculating the trajectory’s total reward and corresponding analysis. Once we learn the critic, we perform policy update via running KL-based Mirror Descent. Note that this step has a closed-form expression for πt+1: πt+1(a|s) ∝ ( πSFT(a|s) ) ηλ ηλ+1 · ( πt(a|s) ) 1 ηλ+1 · exp ( η ηλ + 1 · Q(s, a) ) Note that the KL penalty to πSFT in the policy update procedure is important to ensure that πt+1 does not deviate too much from πSFT. Also this type of updates ensures that the support of πt(·|s) is always a subset of the support of πSFT(s) for all state s. 112 Algorithm 10 NPG update for the PO oracle in Alg. 9 1: Input: Online datasetDon, the previous policy πt, Q function class F , regularization parameter λ, learning rate η 2: Create an empty regression datasetD. 3: for each (partial) trajectory τ inDon do 4: Take the first state-action pair (sh, ah) in τ and calculate the total reward y =∑H h′=h r̂(sh′ , ah′) 5: Add ((sh, ah), y) toD 6: end for 7: Learn critics: Q = argmin f∈F 1 |D| ∑ (s,a,y)∈D [ ( f (s, a) − y)2 ] . 8: Policy update: πt+1(s) = argmin p∈∆(A) ⟨−Q(s, ·), p⟩ + λKL(p∥πSFT(s)) + 1 η KL(p∥πt(s)),∀s. Remark 20. Tiapkin et al. (2023) also investigates the theoretical guarantee of KL divergence regularization in RLHF. However, there are two key differences between our work and theirs. First, they consider a behavior cloning setting where there exists an expert demonstration dataset from a near optimal expert, while our offline dataset and the supervised finetuned policy πSFT can be quite sub-optimal (we only need πSFT to cover the target policy π⋆ in the subsequent analysis). Second, they implement a UCBVI-type algorithm for policy alignment, which requires optimistic planning and is limited to tabular MDPs and linear MDPs only. Designing computationally tractable UCB-style algorithms beyond tabular or linear models is more challenging. In contrast, our algorithm combines offline data and online data and modifies NPG by resetting it to the offline dataset. While resetting is an assumption, it naturally holds in applications such as fine-tuning LLMs. We also believe the idea of dataset reset can be used in the setting from Tiapkin et al. (2023) to extend their results beyond tabular or linear models. Remark 21. Though we mainly focus on the settings where we can reset, when resetting is not possible (e.g., real robotics applications), we can implement the reset by a roll-in 113 and roll-out procedure since we have access to πSFT: we roll-in πSFT to some sh, and then continue by rolling out our policy that is being optimized. This procedure is closely related to the PPO++ algorithm proposed in Chang et al. (2023), where the authors empirically demonstrated that it outperforms vanilla PPO on some RLHF benchmarks (but no detailed theoretical investigation). When resetting is available, by directly resetting to the offline data generated by πSFT, we further reduce computation. 6.4.1 Theoretical Sample Complexity Now we introduce the required assumptions in our analysis. Function classes. We first assume that the reward function class and Q function class are realizable and bounded: Assumption 22 (reward function classes). Suppose that we have r⋆ ∈ R. In addition, assume that 0 ≤ r(τ) ≤ rmax for all r ∈ R and trajectory τ. Assumption 23 (Q function classes). Suppose that we have Qπt ,̂r ∈ F for all t ∈ [T ]. In addition, assume that 0 ≤ f (s, a) ≤ rmax for all f ∈ F , s ∈ S, a ∈ A. Realizability is a standard assumption used in the theoretical analysis of supervised learning. It is possible to extend our analysis to the setting where model-misspecification exists, and we leave this extension as a future work. Concentrability. Then we assume that πSFT can cover the comparator policy π⋆: 114 Assumption 24 (single-policy concentrability). Suppose that we have: (1) max τ dπ ⋆ (τ) dπSFT(τ) = CTR < ∞; (2) max h∈[ht],s∈Sh,a∈A dπ ⋆ h (s, a) dπSFT h (s, a) = CST < ∞. Note that in Assumption 24 we need πSFT to cover π⋆, both trajectory-wise and state-action-wise. In particular, we always have CST ≤ CTR. Assuming trajectory-wise covering is necessary in RLHF because the human feedback is also trajectory-wise, as shown by the lower bounds in Zhan et al. (2023a). Intuitively, if the offline data only covers low performance policies’ traces, then the learned reward model cannot guarantee to recognize trajectories from a high performance policy during test time (because it has never seen such things in training). Remark 25. We can indeed relax Assumption 24 by leveraging the information in R and F , as shown in the discussion in Appendix E.2. Under the above assumptions, we have the following theorem to characterize the suboptimality of π̂ returned by Algorithm 16. Recall that κ = 1 infx∈[−rmax ,rmax] Φ′(x) measures the non-linearity of the link function Φ. Theorem 26. Suppose Assumption 22,23,24 hold. For any δ ∈ (0, 1], let ϵMLE := Θ  √ κ2 M log |R| δ  , ϵeval := Θ  √ r2 max N log T |F | δ  , and set T = H 2 5 r 4 5 max ϵ 4 5 MLE ∧ rmax ϵeval , η = √ 1 Tr2 max , λ = T 1 3 r 1 3 maxϵ 2 3 MLE H 1 3 , then with probability at least 1− δ, we have Algorithm 9 with NPG update (Algorithm 10) returns a policy π̂ which satisfies Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ H 4 5 r 3 5 maxϵ 2 5 MLE log CST + √ CTRϵMLE + H √ rmaxCSTϵeval. 115 Theorem 26 indicates that the suboptimality of π̂ scales with 1 M and 1 N polynomially. In particular, from Theorem 26, we can easily obtain the sample complexity of DR-PO, as shown in the following corollary: Corollary 27. Suppose Assumption 22,23,24 hold and set T = H 2 5 r 4 5 max ϵ 4 5 MLE ∧ rmax ϵeval , η = √ 1 Tr2 max , λ = T 1 3 r 1 3 maxϵ 2 3 MLE H 1 3 , then if we have M = Ω (( H4r3 max log5 CST ϵ5 + CTR ϵ2 ) κ2 log |R| δ ) , N = Ω ( H4r4 maxC 2 ST ϵ4 log T |F | δ ) , we have with probability at least 1 − δ that Algorithm 9 with NPG update (Algorithm 10) returns a policy π̂ which satisfies Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ ϵ. Theorem 26 and Corollary 27 indicate that DR-PO with NPG update is capable of finding an ϵ-optimal policy with polynomial sample complexity, i.e., Õ(1/ϵ5) labeled trajectory pairs and Õ(1/ϵ4) unlabeled trajectories under single-policy concentrability. Algorithmically, our algorithm does not require pessimism and is model-free, which is much easier and more practical than the pessimistic model-based algorithm proposed in Zhan et al. (2023a). Remark 28 (Tighter bounds). In Appendix E.3, assuming the KL penalty does not blow up throughout training, we further reduce the sample complexity of labeled trajectories to Õ(1/ϵ2) by directly including the KL penalty ln(πt(a|s)/πSFT(a|s)) into the reward. Remark 29. In Theorem 26 and Corollary 27 we assume R and F are finite, but our results can be extended to infinite classes directly by replacing |R|(|F |) with their covering numbers. 116 6.5 Experiments We empirically evaluate DR-PO’s ability to learn from dataset resets. First, we test how well DR-PO is able to both efficiently optimize the reward score as well as minimize the KL-divergence with the reference policy. We also test the generation quality of our resulting policies in terms of Rouge (Lin, 2004) and win rate (Rafailov et al., 2023b) against human references measured by GPT4 (Achiam et al., 2023). Next, we conduct an ablation study, incrementally relaxing the the proportion of dataset resets in our online data collection to study how sensitive DR-PO is to this hyperparameter. Finally, we investigate DR-PO’s performance when transferring to another summarization task such as CNN/DailyMail (See et al., 2017). We find that collecting online generations with dataset resets results in a policy with a better tradeoff between reward optimization and KL-divergence, leading to improved generations over baseline RL algorithms, PPO (Schulman et al., 2017b) and Direct Preference Optimizaion (DPO) (Rafailov et al., 2023b). Task We evaluated DR-PO on the TL;DR summarization dataset used in Stiennon et al. (2020)4 and tested scaling performance on the Anthropic Helpful Harmful (HH) task(Bai et al., 2022b). For TL;DR, a model is trained to generate summaries of online Reddit posts guided by human preference data. The task consists of two datasets: one with human reference summaries and another with preference data. Following the standards set by both Stiennon et al. (2020) and Rafailov et al. (2023b), we train our reward models and DPO baseline on the preference dataset while performing online RL (for PPO and DR-PO) on the human reference dataset. We set the maximum context length to be 512 and the maximum generation length to be 53, ensuring that it is possible to generate all 4Dataset can be obtained from https://github.com/openai/summarize-from-feedback 117 https://github.com/openai/summarize-from-feedback references in the dataset. For Anthropic HH, the model is asked to respond to a dialogue sequence in a helpful, harmless manner. We follow much of design choices from TRLx5 for dataset processing, context length, and generation length. For more details about the dataset, please see Appendix E.6 Evaluation To test the performance of DR-PO against our baselines we evaluate each method by its tradeoff between reward model score and KL-divergence with the reference policy, testing the effectiveness of the algorithm in optimizing the regularized RLHF objective. Furthermore, we compute the Rouge score and GPT4 win rate to evaluate the generation quality of our resulting policies. Note for our win rate calculation, we report the win rate of a randomly sampled subset (10%) of the test set for a total of 600 samples. Please see Appendix E.6.3 for the prompt used to query GPT4 as well as an example response. When evaluating the on CNN/DailyMail we make use of the constructed preference dataset from Stiennon et al. (2020) and for training a supervised finetuned model, we use HuggingFace’s dataset version 2.0.06. Methods We instantiate DR-PO by using PPO style policy optimization (Schulman et al., 2017b) as the policy optimizer (PO in Algorithm 9). First for TL;DR, we maintain the same pretrained LLM and supervised finetuned model for all of our experiments. For supervised finetuning, we trained a Pythia 2.8B7 (Biderman et al., 2023) parameter model for 1 epoch over the dataset with human references as labels. Similarly for the reward model, we trained a Pythia 2.8B parameter model for 1 epoch over the preference labeled dataset. Then, for DPO, PPO, and DR-PO, we trained our policy and critic with low rank adapters (LoRA) (Hu et al., 2022) on top of our supervised finetuned (SFT) 5https://github.com/CarperAI/trlx 6https://huggingface.co/datasets/cnn_dailymail 7HuggingFace Model Card: EleutherAI/pythia-2.8b-deduped 118 https://github.com/CarperAI/trlx https://huggingface.co/datasets/cnn_dailymail EleutherAI/pythia-2.8b-deduped Algorithms TL;DR Summarization Win Rate RM Score KL(π||πre f ) Rouge 1 Rouge 2 RougeL (↑) (↑) (↓) (↑) (↑) (↑) SFT 31.6 ± 0.2% -0.51 ± 0.04 - 32.17 ± 1.01 12.27 ± 0.67 24.87 ± 1.22 DPO 52.6 ± 0.4% - 37.33 ± 2.01 30.03 ± 3.23 7.93 ± 1.02 22.05 ± 0.83 PPO 62.3 ± 2.5% 1.17 ± 0.13 16.32 ± 1.46 33.73 ± 2.34 11.97 ± 0.91 24.97 ± 1.03 DR-PO 70.2 ± 1.7% 1.52 ± 0.09 16.84 ± 0.83 33.68 ± 1.78 11.90 ± 0.06 25.12 ± 0.76 Table 6.1: TL;DR Summarization Results: Our RM Score is under our trained prefer- ence reward model and the win rate is evaluated by GPT4. All evaluated policies except for SFT are models with LoRA adapters. We present results across 3 seeds. 0 20 40 KL(π||πref ) (←) 0 1 2 R M S co re (→ ) TL;DR Summarization DR-PO PPO SFT Reference Figure 6.1: Reward vs KL-Divergence Frontier: Plotting the regularized optimization tradeoff between DR-PO and our baselines over the entire test set. DR-PO is able to achieve a much better tradeoff by learning higher reward generations with lower KL. The average reference and SFT scores under the RM are shown as dashed lines. model and our reward model (RM) respectively. Finally for our scaling experiments for Anthropic HH, we trained Pythia 125M, 1B, and 6.9B parameter models for 1 epoch over the HH dataset for both SFT and RM training. Please see Appendix E.6 for details. 6.5.1 How well can DR-PO optimize the RLHF objective? Table 6.1 compares DR-PO against PPO, DPO, and supervised finetuning. The KL- regularized reward optimization broadly used in RLHF as well as analyzed in Section 5.5 119 balances reward exploitation and deviation from a reference policy. When computing the KL-divergence, we use our SFT policy as our reference policy for all our methods. Notably, DR-PO scores a higher RM value over the test set over all baselines with a slightly larger KL discrepancy than PPO. We also see that with GPT4 win rate, DR-PO achieves the highest preference over human references showcasing the benefit of learning from resets. Figure 6.1 plots a more detailed frontier of the reward and KL tradeoff for DR-PO and PPO. We generate this plot by binning the test scores according to KL. We see that for most KL values, DR-PO is able to achieve a higher score than PPO. 6.5.2 Analysis of Dataset Reset Proportion Algorithms Win Rate RM Score KL(π||πre f ) (↑) (↑) (↓) PPO 60.7% 1.14 15.08 DR-PO (β = 0.25) 61.7% 1.28 14.77 DR-PO (β = 0.5) 66.5% 1.28 15.63 DR-PO (β = 0.75) 64.3% 1.25 14.32 DR-PO (β = 1.0) 68.5% 1.47 16.65 Table 6.2: DR-PO Ablation of Datset Reset Proportion: Our RM Score is under our trained preference reward model and the Win Rate is evaluated by GPT4. β represents the proportion of online data generated from dataset resets with 1.0 being all generations are from resets and 0.0 being PPO (i.e., always reset to initial prompts). Next, we investigate how sensitive DR-PO is to the amount of dataset resets done during online generation. We define β as the proportion of generations in a given online batch of generations with dataset resets. More specifically, our main results are with β = 1.0 which translates to all generations during online training of DR-PO starting from a randomly sampled reset from the human references. Note that a β value of 0 recovers the baseline PPO (e.g., all generations start from initial prompts). Table 6.2 shows the expected RM score, KL, and win rate of DR-PO as we increase the mixing proportion 120 0 25 KL(π||πref ) (←) 1.0 1.5 2.0 2.5 R M S co re (→ ) P ro p ortio n o f D a ta set R esets0 1 Figure 6.2: Ablation of Dataset Reset: Plotting the RM score and KL-Divergence tradeoff as a function of dataset reset proprotion. Blue represents no mixing while red represents every online generation starting from a reset. from 0% (PPO) to 100% (DR-PO) after 2 epochs of training. Notably, even with a small amount of dataset resets DR-PO is able to learn higher scoring generations with a lower KL than PPO. Moreover, we see that DR-PO with any amount of reference resets leads to higher win rates than PPO. Figure 6.2 plots the RM score/KL-divergence frontier of our learned policies on the test set. Note that DR-PO is robust to the amount of dataset resets in optimizing the regularized RLHF objective. Finally, supporting our analysis from Section 5.5, DR-PO generally performs better the more online data we gather from resets with a 100% reset proportion performing the best. 6.5.3 DR-PO Transfer Performance Finally, we investigate DR-PO’s ability to do zero-shot transfer to another summarization task, ensuring that learning a policy by reseting from human references does not diminish the generalization observed with PPO in Stiennon et al. (2020). Specifically, we investi- gate whether leveraging human references on TL;DR has the unintended consequence of overfitting to the specific dataset rather than learning more generally to summarize. For our baselines, we test the zero-shot capabilities of both PPO and DPO as well as report the performance of a supervised finetuned policy on CNN/DailyMail using the same base 121 Algorithms CNN/DM Summarization Win Rate Rouge 1 Rouge 2 RougeL (↑) (↑) (↑) (↑) SFT (CNN/DM) 10.5% 25.60 12.27 19.99 DPO 6.0% 20.71 9.47 15.70 PPO 8.5% 23.62 12.29 18.56 DR-PO 12.0% 29.53 15.36 22.88 Table 6.3: Zero-shot transfer to CNN/DM: the Win Rate is evaluated by GPT4. model, Pythia 2.8B. Table 6.3 demonstrates DR-PO’s zero-shot capabilities, being the only policy to outperform a supervised finetuned model on all metrics. Therefore, we see that learning from resets not only improves RLHF on the training task but also the zero-shot transfer performance to another summarization task. 6.5.4 DR-PO Scaling Performance on Anthropic HH Figure 6.3 shows DR-PO’s performance across different model scales on Anthropic HH task. Specifically we tested three model sizes: 125M, 1B, and 6.9B. We specifically trained on the Pythia models (Biderman et al., 2023) using TRLx8. We see that DR-PO has similar scaling improvements at PPO while still producing generations that are more preferred than those from our baselines. 6.6 Conclusion We present DR-PO, a provably efficient algorithm that exploits a generative model’s ability to reset from offline data to enhance RL from preference-based feedback. Both in 8https://github.com/CarperAI/trlx 122 https://github.com/CarperAI/trlx 108 109 1010 Model size 0.15 0.20 0.25 0.30 0.35 G P T 4 W in R at e Anthropic HH Scaling SFT DPO PPO DR-PO Figure 6.3: Scaling on Anthropic HH: The GPT4 win rate of DR-PO when tested across 3 model scales: 125M, 1B, and 6.9B. Reported winrates are mean and std across 3 seeds. theory and in practice, we demonstrate the effectiveness of incorporating dataset resets into online RL. While in our experiments we specifically demonstrate dataset resets on a PPO style policy optimizer, the idea of dataset reset is both general and simple to implement into any online data collection component of other RLHF algorithms. We leave it to exciting future work to test the full capabilities of dataset resets in other RLHF methods. 123 CHAPTER 7 RL FOR CONSISTENCY MODELS: FASTER REWARD GUIDED TEXT-TO-IMAGE GENERATION Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruc- tion following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a frame- work for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. RLCM improves upon RL fine-tuned diffusion models on text-to-image generation capabilities and trades computation during inference time for sample quality. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Our code is available at https://rlcm.owenoertell.com. 124 https://rlcm.owenoertell.com Inference Time (Seconds) 1 2 3 4 20 RLCM DDPO Train Time Test Time Figure 7.1: Reinforcement Learning for Consistency Models (RLCM). We propose a new framework for finetuning consistency models using RL. On the task of optimizing aesthetic scores of a generated image, comparing to a baseline which uses RL to fine-tune diffusion models (DDPO), RLCM trains (left) and generates images (right) significantly faster, with higher image quality measured under the aesthetic score. Images generated with a batch size of 8 and RLCM horizon set to 8. 7.1 Introduction Diffusion models have gained widespread recognition for their high performance in various tasks, including drug design (Xu et al., 2022) and control (Janner et al., 2022). In the text-to-image generation community, diffusion models have gained significant popularity due to their ability to output realistic images via prompting. Despite their success, diffusion models in text-to-image tasks face two key challenges. First, generating the desired images can be difficult for downstream tasks whose goals are hard to specify via prompting. Second, the slow inference speed of diffusion models poses a barrier, making the iterative process of prompt tuning computationally intensive. To enhance the generation alignment with specific prompts, diffusion model inference can be framed as sequential decision-making processes, permitting the application of reinforcement learning (RL) methods to image generation (Black et al., 2024; Fan et al., 2023). The objective of RL-based diffusion training is to fine-tune a diffusion model to maximize a reward function directly that corresponds to the desirable property. 125 Diffusion models also suffer from slow inference since they must take many steps to produce competitive results. This leads to slow inference time and even slower training time. Even further, as a result of the number of steps we must take, the resultant Markov decision process (MDP) possesses a long time horizon which can be hard for RL algorithms optimize. To resolve this, we look to consistency models. These models directly map noise to data and typically require only a few steps to produce good looking results. Although these models can be used for single step inference, to generate high quality samples, there exits a few step iterative inference process which we focus on. Framing consistency model inference, instead of diffusion model inference, as an MDP (as shown in Figure 7.2) admits a much shorter horizon. This enables faster RL training and allows for generating high quality images with just few step inference. More formally, we propose a framework Reinforcement Learning for Consistency Models (RLCM), a framework that models the inference procedure of a consistency model as a multi-step Markov Decision Process, allowing one to fine-tune consistency models toward a downstream task using just a reward function. Algorithmically, we instantiate RLCM using a policy gradient algorithm as this allows for optimizing general reward functions that may not be differentiable. In experiments, we compare to the current more general method, DDPO (Black et al., 2024) which uses policy gradient methods to optimize a diffusion model. In particular, we show that on an array of tasks (compressibility, incompressibility, prompt image alignment, and LAION aesthetic score) proposed by DDPO, RLCM outperforms DDPO in most tasks in training time, inference time, and sample complexity (i.e., total reward of the learned policy vesus number of reward model queries used in training) (Section 7.5). Our contributions in this work are as follows: • In our experiments, we find that RLCM has faster training and faster inference 126 than existing methods. • Further, that RLCM, in our experiments, enjoys better performance on most tasks under the tested reward models than existing methods. 7.2 Related Works Diffusion Models Diffusion models are a popular family of image generative models which progressively map noise to data (Sohl-Dickstein et al., 2015). Such models generate high quality images (Ramesh et al., 2021; Saharia et al., 2022) and videos (Ho et al., 2022; Singer et al., 2022). Recent work with diffusion models has also shown promising directions in harnessing their power for other types of data such as robot trajectories and 3d shapes (Janner et al., 2022; Zhou et al., 2021). However, the iterative inference procedure of progressively removing noise yields slow generation time. Consistency Models Consistency models are another family of generative models which directly map noise to data via the consistency function (Song et al., 2023). Such a function provides faster inference generation as one can predict the image from randomly generated noise in a single step. Consistency models also offer a more fine-tuned trade-off between inference time and generation quality as one can run the multistep inference process (Algorithm 17, in Appendix F.1) which is described in detail in Section 7.3.2. Prior works have also focused on training the consistency function in latent space (Luo et al., 2023) which allows for large, high quality text-to-image consistency model generations. Sometimes, such generations are not aligned with the downstream for which they will be used. The remainder of this work will focus on aligning consistency models to fit downstream preferences, given a reward function. 127 RL for Diffusion Models Popularized by Black et al. (2024); Fan et al. (2023), training diffusion models with reinforcement learning requires treating the diffusion model inference sequence as an Markov decision process. Then, by treating the score function as the policy and updating it with a modified PPO algorithm (Schulman et al., 2017b), one can learn a policy (which in this case is a diffusion model) that optimizes for a given downstream reward. Further work on RL fine-tuning has looked into entropy regularized control to avoid reward hacking and maintain high quality images (Uehara et al., 2024). Another line of work uses deterministic policy gradient methods to directly optimize the reward function when the reward function is differentiable (Prabhudesai et al., 2023). Note that when reward function is differentiable, we can instantiate a deterministic policy gradient method in RLCM as well. We focus on REINFORCE (Williams, 1992) style policy gradient methods for the purpose of optimizing a black-box, non-differentiable reward functions. 7.3 Preliminaries 7.3.1 Reinforcement Learning We model our sequential decision process as a finite horizon Markov Decision Process (MDP),M = (S,A, P,R, µ,H). In this tuple, we define our state space S, action space A, transition function P : S × A → ∆(S), reward function R : S × A → R, initial state distribution µ, and horizon H. At each timestep t, the agent observes a state st ∈ S, takes an action according to the policy at ∼ π(at|st) and transitions to the next state st+1 ∼ P(st+1|st, at). After H timesteps, the agent produces a trajectory as a sequence of states and actions τ = (s0, a0, s1, a1, . . . , sH, aH). Our objective is to learn a policy π that 128 maximizes the expected cumulative reward over trajectories sampled from π. JRL(π) = Eτ∼p(·|π)  H∑ t=0 R(st, at)  7.3.2 Diffusion and Consistency Models Generative models are designed to match a model with the data distribution, such that we can synthesize new data points at will by sampling from the distribution. Diffusion models belong to a novel type of generative model that characterizes the probability distribution using a score function rather than a density function. Specifically, it produces data by gradually modifying the data distribution and subsequently generating samples from noise through a sequential denoising step. More formally, we start with a distribution of data pdata(x) and noise it according to the stochastic differential equation (SDE) (Song et al., 2020b): dx = µ(xt, t)dt + σ(t)dw for a given t ∈ [0,T ], fixed constant T > 0, and with the drift coefficient µ(·, ·), diffusion coefficient σ(·), and {w}t∈[0,T ] being a Brownian motion. Letting p0(x) = pdata(x) and pt(x) be the marginal distribution at time t induced by the above SDE, as shown in Song et al. (2020b), there exists an ODE (also called a probability flow) whose induced distribution at time t is also pt(x). In particular: dxt = [ µ(xt, t) − 1 2 σ(t)2∇ log pt(xt) ] dt The term ∇ log pt(xt) is also known as the score function (Song and Ermon, 2019; Song et al., 2020b). When training a diffusion models in such a setting, one uses a technique called score matching (Dinh et al., 2016; Vincent, 2011) in which one trains a network to approximate the score function and then samples a trajectory with an ODE 129 solver. Once we learn such a neural network that approximates the score function, we can generate images by integrating the above ODE backward in time from T to 0, with xT ∼ pT which is typically a tractable distribution (e.g., Gaussian in most diffusion model formulations). This technique is clearly bottle-necked by the fact that during generation, one must run a ODE solver backward in time (from T to 0) for a large number of steps in order to obtain competitive samples (Song et al., 2023). To alleviate this issue, Song et al. (2023) proposed consistency models which aim to directly map noisy samples to data. The goal becomes instead to learn a consistency function on a given probability flow. The aim of this function is that for any two t, t′ ∈ [ϵ,T ], the two samples along the probability flow ODE, they are mapped to the same image by the consistency function: fθ(xt, t) = fθ(xt′ , t′) = xϵ where xϵ is the solution of the ODE at time ϵ. At a high level, this consistency function is trained by taking two adjacent timesteps and minimizing the consistency loss d( fθ(xt, t), fθ(xt′ , t′)), under some image distance metric d(·, ·). To avoid the trivial solution of a constant, we also set the initial condition to fθ(xϵ , ϵ) = xϵ . Inference in consistency models After a model is trained, one can then trade inference time for generation quality with the multi-step inference process given in Appendix F.1, Algorithm 17. At a high level, the multistep consistency sampling algorithm first par- titions the probability flow into H + 1 points (T = τ0 > τ1 > τ2 . . . > τH = ϵ). Given a sample xT ∼ pT , it then applies the consistency function fθ at (xT ,T ) yielding x̂0. To further improve the quality of x̂0, one can add noises back to x̂0 using the equation x̂τn ← x̂0+ √ τ2 n − τ 2 Hz, and then again apply the consistency function to (x̂τn , τn), getting x̂0. One can repeat this process for a few more steps until the quality of the generation is satisfied. For the remainder of this work, we will be referring to sampling with the multi-step procedure. We also provide more details when we introduce RLCM later. 130 7.3.3 Reinforcement Learning for Diffusion Models Black et al. (2024) and Fan et al. (2023) formulated the training and fine-tuning of conditional diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020) as an MDP. Black et al. (2024) defined a class of algorithms, Denoising Diffusion Policy Optimization (DDPO), that optimizes for arbitrary reward functions to improve guided fine-tuning of diffusion models with RL. Diffusion Model Denoising as MDP Conditional diffusion probabilistic models con- dition on a context c (in the case of text-to-image generation, a prompt). As intro- duced for DDPO, we map the iterative denoising procedure to the following MDP M = (S,A, P,R, µ,H). Let r(s, c) be the task reward function. Also, note that the probability flow proceeds from xT → x0. Let T = τ0 > τ1 > τ2 . . . > τH = ϵ be a partition of the probability flow into intervals: st := (c, τt, xτt) π(at|st) := pθ ( xτt+1 |xτt , c ) P(st+1|st,at) := (δc, δτt+1 , δxτt+1 ) at := xτt+1 µ := ( p(c), δτ0 ,N(0, I) ) R(st,at) =  r(st, c) if t = H 0 otherwise where δy is the Dirac delta distribution with non-zero density at y. In other words, we are mapping images to be states, and the prediction of the next state in the denosing flow to be actions. Further, we can think of the deterministic dynamics as letting the next state be the action selected by the policy. Finally, we can think of the reward for each state being 0 until the end of the trajectory when we then evaluate the final image under the task reward function. 131 This formulation permits the following loss term: LDDPO = ED T∑ t=1 [ min { r(x0, c) pθ(xt−1|xt, c) pθold(xt−1|xt, c) , r(x0, c)clip ( pθ(xt−1|xt, c) pθold(xt−1|xt, c) , 1 − ϵ, 1 + ϵ )}] where clipping is used to ensure that when we optimize pθ, the new policy stay close to pθold , a trick popularized by the well known algorithm Proximal Policy Optimization (PPO) (Schulman et al., 2017b). In diffusion models, horizon H is usually set as 50 or greater and time T is set as 1000. A small step size is chosen for the ODE solver to minimize error, ensuring the generation of high-quality images as demonstrated by Ho et al. (2020). Due to the long horizon and sparse rewards, training diffusion models using reinforcement learning can be challenging. 7.4 Reinforcement Learning for Consistency Models To remedy the long inference horizon that occurs during the MDP formulation of diffusion models, we instead frame consistency models as an MDP. We let H also represent the horizon of this MDP. Just as we do for DDPO, we partition the entire probability flow ([0,T ]) into segments, T = τ0 > τ1 > . . . > τH = ϵ. In this section, we denote t as the discrete time step in the MDP, i.e., t ∈ {0, 1, . . . ,H}, and τt is the corresponding time in the continuous time interval [0,T ]. We now present the consistency model MDP formulation. 132 3/7/24, 10:09 PMUntitled Diagram Page 1 of 1https://app.diagrams.net/ Data Noise Multi-Step Inference as MDP Consistency Models Figure 7.2: Consistency Model As MDP: In this instance, H = 3. Here we first start at a randomly sampled noised state s0 ∼ ( N(0, I), δτ0 , p(c) ) . We then follow the policy by first plugging the state into the the consistency model and then noising the image back to τ1. This gives us a0 which, based off of the transition dynamics becomes s1. We then transition from s1 by applying π(·), which applies the consistency function to x̂0 and then noises up to τ2. We repeat this process until we reach timestep H. To calculate the end of trajectory reward, we apply the consistency function one more time to get a final approximation of x̂0 and apply the given reward function to this image. Consistency Model Inference as MDP We reformulate the multi-step inference pro- cess in a consistency model (Algorithm 17) as an MDP: st := (xτt , τt, c) π(at|st) := fθ ( xτt , τt, c ) + Z P(st+1|st,at) := (δxτt+1 , δτt+1 , δc) at := xτt+1 µ := ( N(0, I), δτ0 , p(c) ) RH(sH) = r( fθ(xτH , τH, c), c) where is Z = √ τ2 t − τ 2 Hz which is noise from line 5 of Algorithm 17. Further, where r(·, ·) is the reward function that we are using to align the model and RH is the reward at timestep H. At other timesteps, we let the reward be 0. We can visualize this conversion from the multistep inference in Figure 7.2. Modeling the MDP such that the policy π(s) := fθ(xτt , τt, c) + Z instead of defining π(·) to be the consistency function itself has a major benefit in the fact that this gives us a stochastic policy instead of a deterministic one. This allows us to use a form of clipped 133 Algorithm 11 Policy Gradient Version of RLCM 1: Input: Consistency model policy πθ = fθ(·, ·) + Z, finetune horizon H, prompt set P, batch size b, inference pipeline P 2: for i = 1 to M do 3: Sample b contexts from C, c ∼ C. 4: X0 ← P( fθ,H, c) {where X0 is the batch of images, x0 } 5: Normalize rewards r(x0, c) per context 6: Split X0 into k minibatches. 7: for each minibatch do 8: for t = 0 to H do 9: Update θ using rule: ∇θED ∑T t=1 [ min { r(x0, c) · πθi+1 (at |st) πθi (at |st) , r(x0, c) · clip ( πθi+1 (at |st) πθi (at |st) , 1 − ϵ, 1 + ϵ )}] 10: end for 11: end for 12: end for 13: Output trained consistency model fθ(·, ·) importance sampling like Black et al. (2024) instead of a deterministic algorithm (e.g. DPG (Silver et al., 2014)) which we found to be unstable and in general is not unbiased. Thus a policy is made up of two parts: the consistency function and noising with Gaussian noises. The consistency function takes the form of the red arrows in Figure 7.2 whereas the noise is the green arrows. In other words, our policy is a Gaussian policy whose mean is modeled by the consistency function fθ, and covariance being (τ2 t − ϵ 2)I (here I is an identity matrix). Notice that in accordance with the sampling procedure in Algorithm 17, we only noise part of the trajectory. Note that the final step of the trajectory is slightly different. In particular, to calculate the final reward, we just transition with the applying the consistency function (red/yellow arrrow) and obtain the final reward there. Policy Gradient RLCM We can then instantiate RLCM with a policy gradient opti- mizer, in the spirit of Black et al. (2024); Fan et al. (2023). Our algorithm is described in Algorithm 11. In practice we normalize the reward per prompt. That is, we create 134 a running mean and standard deviation for each prompt and use that as the normalizer instead of calculating this per batch. This is because under certain reward models, the average score by prompt can vary drastically. 7.5 Experiments In this section, we hope to investigate the performance and speed improvements of training consistency models rather than diffusion models with reinforcement learning. We compare our method to DDPO (Black et al., 2024), a state-of-the-art policy gradient method for finetuning diffusion models. First, we test how well RLCM is able to both efficiently optimize the reward score and maintain the qualitative integrity of the pretrained generative model. We show both learning curves and representative qualitative examples of the generated images on tasks defined by Black et al. (2024). Next we show the speed and compute needs for both train and test time of each finetuned model to test whether RLCM is able to maintain a consistency model’s benefit of having a faster inference time. We then conduct an ablation study, incrementally decreasing the inference horizon to study RLCM’s tradeoff for faster train/test time and reward score maximization. Finally, we qualitatively evaluate RLCM’s ability to generalize to text prompts and subjects not seen at test time to showcase that the RL finetuning procedure did not destroy the base pretrained model’s capabilities. For fair comparison, both DDPO and RLCM finetune the Dreamshaper v71 and its latent consistency model counterpart respectively2 (Luo et al., 2023). Dreamshaper v7 is a finetune of stable diffusion (Rombach et al., 2022). For DDPO, we used the same 1https://huggingface.co/Lykon/dreamshaper-7 2https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7 135 https://huggingface.co/Lykon/dreamshaper-7 https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7 hyperparameters and source code3(Black et al., 2024) provided by the authors. We found that the default parameters performed best when testing various hyperparamters. Please see Appendix F.2.2 for more details on the parameters we tested. Compression The goal of compression is to minimize the filesize of the image. Thus, the reward received is equal to the negative of the filesize when compressed and saved as a JPEG image. The highest rated images for this task are images of solid colors. The prompt space consisted of 398 animal categories. Figure 7.3: Qualitative Generations: Representative generations from the pretrained models, DDPO, and RLCM. Across all tasks, we see that RLCM does not compromise the image quality of the base model while being able to transform naturalistic images to be stylized artwork that maximizes an aesthetic score, removes background content to maximize compression, and generate images of animals in fictional scenarios like riding a bike to maximize prompt-alignment. 3https://github.com/kvablack/ddpo-pytorch 136 https://github.com/kvablack/ddpo-pytorch 0K 20K Reward Queries −150 −75 N e g F il e s iz e ( k b ) Compression 0K 20K Reward Queries 300 600 F il e s iz e ( k b ) Incompression 0K 20K Reward Queries 6 7 8 L A IO N A e s t h e t ic Aesthetic 0K 30K 60K Reward Queries 0.76 0.77 0.78 L L a V A 1 3 B Prompt-Image Alignment RLCM DDPO Figure 7.4: Learning Curves: Training curves for RLCM and DDPO by number of reward queries on compressibility, incompressibility, aesthetic, and prompt image alignment. We plot three random seeds for each algorithm and plot the mean and standard deviation across those seeds. RLCM seems to produce either comparable or better reward optimization performance across these tasks. Incompression Incompression has the opposite goal of compression: to make the filesize as large as possible. The reward function here is just the filesize of the saved image. The highest rated mages for this task are random noise. Similar to the comparison task, this task’s prompt space consisted of 398 animal categories. Aesthetic The aesthetic task is based off of the LAION Aesthetic predictor (Schuh- mann, 2022) which was trained on 176,000 human labels of aesthetic quality of images. This aesthetic predictor is a MLP on top of CLIP embeddings (Radford et al., 2021). The images which produce the highest reward are typically artwork. This task has a smaller set of 45 animals as prompts. Prompt Image Alignment We use the same task as Black et al. (2024) in which the goal is to align the prompt and the image more closely without human intervention. This is done through a procedure of first querying a LLaVA model (Liu et al., 2023a) to determine what is going on in the image and taking that response and computing the BERT score (Zhang et al., 2019a) similarity to determine how similar it is to the original prompt. This values is then used as the reward for the policy gradient algorithm. 137 0 50 100 GPU Hours (A6000) −150 −75 N e g F il e s iz e ( k b ) Compression 0 50 100 GPU Hours (A6000) 300 600 F il e s iz e ( k b ) Incompression 0 50 100 GPU Hours (A6000) 6 7 8 L A IO N A e s t h e t ic Aesthetic 0 100 200 300 GPU Hours (A6000) 0.76 0.77 0.78 L L a V A 1 3 B Prompt-Image Alignment RLCM DDPO Figure 7.5: Training Time: Plots of performance by runtime measured by GPU hours. We report the runtime on four NVIDIA RTX A6000 across three random seeds and plot the mean and standard deviation. We observe that in all tasks RLCM noticeably reduces the training time while achieving comparable or better reward score performance. 5 10 15 Inference Time (sec) −150 −75 N e g F il e s iz e ( k b ) Compression 5 10 15 Inference Time (sec) 300 600 F il e s iz e ( k b ) Incompression 5 10 15 Inference Time (sec) 4 5 6 7 8 L A IO N A e s t h e t ic Aesthetic 5 10 15 Inference Time (sec) 0.7 0.8 L L a V A 1 3 B Prompt-Image Alignment RLCM DDPO Figure 7.6: Inference Time: Plots showing the inference performance as a function of time taken to generate. For each task, we evaluated the final checkpoint obtained after training and measured the average score across 100 trajectories at a given time budget on 1 NVIDIA RTX A6000 GPU. We report the mean and std across three seeds for every run. Note that for RLCM, we are able to achieve high scoring trajectories with a smaller inference time budget than DDPO. 7.5.1 RLCM vs. DDPO Performance Comparisons Following the sample complexity evaluation proposed in Black et al. (2024), we first compare DDPO and RLCM by measuring how fast they can learn based on the number of reward model queries. As shown in Figure 7.4, RLCM has better performance on three out of four of our tested tasks. Note that for the prompt-to-image alignment task, the initial consistency model finetuned by RLCM has lower performance than the initial diffusion model trained by DDPO. RLCM is able to close the performance gap between 138 the consistency and diffusion model through RL finetuning4. Figure 7.3 demonstrates that similar to DDPO, RLCM is able to train its respective generative model to adapt to various styles just with a reward signal without any additional data curation or supervised finetuning. 7.5.2 Train and Test Time Analysis To show faster training advantage of the proposed RLCM, we compare to DDPO in terms of training time in Figure 7.5. Here we experimentally find that RLCM has a significant advantage to DDPO in terms of the number of GPU hours required in order to achieve similar performance. On all tested tasks RLCM reaches the same or greater performance than DDPO, notably achieving a x17 speedup in training time on the Aesthetic task. This is most likely due to a combination of factors – the shorter horizon in RLCM leads to faster online data generation (rollouts in the RL training procedure) and policy optimization (e.g., less number of backpropagations for training the networks). Figure 7.6 compares the inference time between RLCM and DDPO. For this ex- periment, we measured the average reward score obtained by a trajectory given a fixed time budget for inference. Similar to training, RLCM is able to achieve a higher reward score with less time, demonstrating that RLCM retains the computational benefits of consistency models compared to diffusion models. Note that a full rollout with RLCM takes roughly a quarter of the time for a full rollout with DDPO. 4It is possible that this performance difference on the compression and incompression tasks are due to the consistency models default image being larger. However, in the prompt image alignment and aesthetic tasks, we resized the images before reward calculation. 139 7.5.3 Ablation of Inference Horizon for RLCM 0K 20K Reward Queries 6 7 8 L A IO N A e s t h e t ic Aesthetic Performance H=8 H=4 H=2 DDPO 0 1 2 3 In fe r e n c e T im e ( ← ) Aesthetic Inference Speed # o f In fe re n c e S te p s 2 8 Figure 7.7: Inference time vs Generation Quality: We measure the performance of the policy gradient instantiation of RLCM on the aesthetic task at 3 differ- ent values for the number of inference steps (left) in addition to measuring the inference speed in seconds with varied horizons (right). We report the mean and std across three seeds. We further explore the effect of finetuning a consistency model with different inference horizons. That is we aimed to test RLCM’s sensitivity to H. As shown in Figure 7.7 (left), increasing the number of inference steps leads to a greater possible gain in the re- ward. However, Figure 7.7 (right) shows that this reward gain comes at the cost of slower inference time. This highlights the infer- ence time vs generation quality tradeoff that becomes available by using RLCM. Nevertheless, re- gardless of the number of infer- ence steps chosen, RLCM enjoys faster inference time than diffusion model based baselines. 7.5.4 Qualitative Effects on Generalization We now test our trained models on new text prompts that do not appear in the training set. Specifically, we evaluated our trained models on the aesthetic task. As seen in Figure 7.8 which consists of images of prompts that are not in the training dataset, the RL finetuning 140 does not influence the ability of the model to generalize. We see this through testing a series of prompts (“bike”, “fridge”, “waterfall”, and “tractor”) unseen during training. 7.6 Conclusion and Future Directions We present RLCM, a fast and efficient RL framework to directly optimize a variety of rewards to train consistency models. We empirically show that RLCM achieves better performance than a diffusion model RL baseline, DDPO, on most tasks while enjoying the fast train and inference time benefits of consistency models. Finally, we provide qualitative results of the finetuned models and test their downstream generalization capabilities. There remain a few directions unexplored which we leave to future work. In particular, the specific policy gradient method presented uses a sparse reward. It may be possible Figure 7.8: Prompt Generalization: We observe that RLCM is able to generalize to other prompts without substantial decrease in aesthetic quality. The prompts used to test generalization are “bike”, “fridge”, “waterfall”, and “tractor”. 141 to use a dense reward using the property that a consistency model always predicts to x0. Another future direction is the possibility of creating a loss that further reinforces the consistency property, further improving the inference time capabilities of RLCM policies. 142 CHAPTER 8 CONCLUSION The thesis of this dissertation is that specific data sources demand the design of specialized algorithms that empower IL and RL agents to become effective decision making agents. I defended this thesis on two fronts: 1) observations-only, offline, and off-policy data for IL (Part 1), and 2) RL algorithms for generative models (Part 2). Following the structure of this thesis, in this chapter I will offer potential future directions first for Imitation Learning and then for RL for generative models. 8.1 Imitation Learning In this dissertation, I focused on designing effective imitation learning algorithms from three different data types: 1) imitation learning from observations alone, 2) imitation learning from offline data, and 3) imitation learning from off-policy data. In each setting there is still exciting work to be done. For imitation learning from observations, MobILE presented a model-based approach that is different than many of the model-free approaches pursued in recent literature. Given the advent of incredibly powerful world models (i.e. some of the generative models from Part 2), it would be interesting to investigate what is strategic exploration with more modern dynamics models such as diffusion models or vision transformers? Given the immense scale of modern dynamics models, it seems prohibitive and impractical to learn one of these models online let alone an ensemble to capture notions of uncertainty and calibration. In the world of large pretrained models, what would be the bonus design to capture exploration of the agent on our specific task? Similar to the imitation learning from observations setting, our model-based, offline 143 IL presents the same interesting questions. Perhaps a more speculative question would be, given a sufficiently good world model, could we do most tasks offline? Beyond interesting question with respect to the dynamics models, an exciting future direction would like in the hybrid setting where a hypothetical algorithm could use both offline and online data to learn even more effective imitation. Another exciting future direction for all three settings presented in this dissertation is in the multi-task setting. Similar to how we now have large pretrained dynamics models, we also have an incredibly diverse and large dataset of expert data across many tasks. Would it be possible to go beyond designing IL agents for specific tasks to more general imitators that can learn a family of tasks under some notions of task abstraction? Some potential task abstractions could be families of tasks that share the same task dynamics (i.e. dynamics of a 7-dof robot arm) or have the same state abstractions (i.e. 1080p video streams). 8.2 Reinforcement Learning for Generative Models In this dissertation, I mostly focused on investigating RL and IL algorithms across different types of generative models, from Large Language Models to Consistency Models. In this section, I would like to shift our focus to the reward modeling of human intentions in the RLHF pipeline and challenge our existing modeling assumptions for human alignment. The first assumption is that preferences are aligned with RL’s notion of reward. In the context of RL, rewards were designed for systems with clear definitions of optimality or numerical representations of the behavioral goals of an agent. However, it is still an open question whether we can learn from preferences a reward model that captures this optimal 144 solution. Currently, researchers have found cases of over optimization of the reward model leads to worse alignment to human intentions, suggesting that there still exists a metric mismatch between preferences and LLM alignment. Is IL a viable methodology here to learn a better reward? Is the solution a combination of IL and preference data? The second assumption is that a single, learned preference reward captures the human intentions for a given task. That is, we can distill to a single numeric value what we may prefer for a task like summarization. Even for general purpose instruction following language models or general purpose diffusion models, we do RLHF finetuning with a single preference reward model. This relies on a strong assumption that a single model can capture the complicated, potentially contradictory preferences of the user base. Instead can we efficiently learn multiple reward models and how would we optimize them downstream with RL/IL? 8.3 Concluding Remarks In conclusion, I take understanding how to do efficient IL and RL from diverse sources of data to be of fundamental importance for learning decision making agents in the real world. The analysis and investigations presented in this dissertation build on a long line of research developing general purpose IL and RL algorithms. There is much work to be done, but I am excited for the road ahead. 145 Part III Appendix 146 APPENDIX A MISSING PROOFS AND DETAILS IN CHAPTER 2 A.1 Analysis of Algorithm 2 We start by presenting the proof for the unified main result in Theorem 3. We then discuss the bounds for special instances individually. The following lemma shows that under Assumption 2, with bt(s, a) = H min{σt(s, a), 2}, we achieve optimism at all iterations. Lemma 30 (Optimism). Assume Assumption 2 holds, and set bt(s, a) = H min {σt(s, a), 2}. For all state-wise cost function f : S 7→ [0, 1], denote the bonus enhance cost as f̃t(s, a) := f (s) − bt(s, a). For all policy π, we have the following optimism: Vπ P̂t , f̃t ≤ Vπ P, f ,∀t. Proof. In the proof, we drop subscript t for notation simplicity. We consider a fixed function f and policy π. Also let us denote V̂π as the value function of π under (P̂, f̃ ), and Vπ as the value function under (P, f ). Let us start from h = H, where we have V̂π H(s) = Vπ H(s) = 0. Assume inductive hypothesis holds at h + 1, i.e., for any s, a, we have Q̂π h+1(s, a) ≤ Qπ h+1(s, a). Now let us 147 move to h. We have: Q̂π h(s, a) − Qπ h(s, a) = f̃ (s, a) + Es′∼P̂(·|s,a)V̂ π h+1(s′) − f (s) − Es′∼P(·|s,a)Vπ h+1(s′) ≤ −H min{σ(s, a), 2} + Es′∼P̂(·|s,a)V π h+1(s′) − Es′∼P(·|s,a)Vπ h+1(s′) ≤ −H min{σ(s, a), 2} + H ∥∥∥∥P̂(·|s, a) − P(·|s, a) ∥∥∥∥ 1 ≤ −H min{σ(s, a), 2} + H min{σ(s, a), 2} = 0, where the first inequality uses the inductive hypothesis at time step h + 1. Finally, note that Vπ h (s) = Ea∼π(s)Qπ h(s, a), which leads to V̂π h (s) ≤ Vπ h (s). This concludes the induction step. □ The next lemma concerns the statistical error from finite sample estimation of Es∼dπe f (s). Lemma 31. Fix δ ∈ (0, 1). For all t, we have that with probability at least 1 − δ,∣∣∣∣∣∣∣Es∼dπe f (s) − N∑ i=1 f (se i )/N ∣∣∣∣∣∣∣ ≤ 2 √ ln ( 2t2|F |/δ ) N ,∀ f ∈ F . Proof. For any t, we set the failure probability to be 6δ/(t2π2) at iteration t where we abuse notation and point out that π = 3.14159.... Thus the total failure probability for all t ∈ N is at most δ. We then apply classic Hoeffding inequality to bound Es∼dπe f (s)− ∑N i=1 f (se i )/N with the fact that f (s) ∈ [0, 1] for all s. We conclude the proof by taking a union bound over all f ∈ F . □ Note that here we have assumed se i ∼ dπ e is i.i.d sampled from dπ e . This can easily be achieved by randomly sampling a state from each expert trajectory. Note that we can easily deal with i.i.d trajectories, i.e., if our expert data contains N many i.i.d trajectories 148 {τ1, . . . , τN}, we can apply concentration on the trajectory level, and get:∣∣∣∣∣∣∣Eτ∼πe H−1∑ h=0 f (sh)  − 1 N N∑ i=1 H−1∑ h=0 f (si h) ∣∣∣∣∣∣∣ ≤ O H √ ln(t2|F |/δ) N  , where τ ∼ π denotes that a trajectory τ being sampled based on π, si h denotes the state at time step h on the i-th expert trajectory. Also note that we have Es∼dπ f (s) = 1 HEτ∼π [∑H−1 h=0 f (sh) ] for any π, f . Together this immediately implies that:∣∣∣∣∣∣∣Es∼dπe f (s) − 1 NH N∑ i=1 H−1∑ h=0 f (si h) ∣∣∣∣∣∣∣ ≤ O  √ ln(t2|F |/δ) N  , which matches to the bound in Lemma 31. Now we conclude the proof for Theorem 3. Proof of Theorem 3. Assume that Assumption 2 and the event in Lemma 31 hold. Denote the joint of these two events as E. Note that the probability of E is at most 2δ. For notation simplicity, denote ϵstats = 2 √ ln(2T 2 |F |/δ) N . In each model-based planning phase, recall that we perform model-based optimization on the following objective: πt = argmin π∈Π max f∈F Es,a∼dπ P̂t [ f (s) − bt(s, a) ] − N∑ i=1 f (se i )/N  . Note that for any π, using the inequality in Lemma 31, we have: max f∈Ft Es,a∼dπ P̂t ( f (s) − bt(s, a)) − N∑ i=1 f (se i )/N  = max f∈F Es,a∼dπ P̂t ( f (s) − bt(s, a)) − Es∼dπe f (s) + Es∼dπe f (s) − N∑ i=1 f (se i )/N  ≤ max f∈F [ Es,a∼dπ P̂t ( f (s) − bt(s, a)) − Es∼dπe f (s) ] +max f∈F Es∼dπe f (s) − N∑ i=1 f (se i )/N  ≤ max f∈F [ Es,a∼dπ P̂t ( f (s) − bt(s, a)) − Es,a∼dπe P̂t ( f (s) − bt(s, a)) ] + ϵstats 149 where in the last inequality we use optimism from Lemma 30, i.e., Es,a∼dπe P̂t ( f (s) − bt(s, a)) ≤ Es∼dπe f (s). Hence, for πt, since it is the minimizer and πe ∈ Π, we must have: max f∈F Es,a∼dπt P̂t ( f (s) − bt(s, a)) − N∑ i=1 f (se i )/N  ≤ max f∈F Es,a∼dπe P̂t ( f (s) − bt(s, a)) − N∑ i=1 f (se i )/N  ≤ max f∈F [ Es,a∼dπe P̂t ( f (s) − bt(s, a)) − Es,a∼dπe P̂t ( f (s) − bt(s, a)) ] + ϵstats = ϵstats. Note that F contains c, we must have: Es,a∼dπt P̂t [c(s) − bt(s, a)] ≤ N∑ i=1 c(se i )/N + ϵstats ≤ Es∼dπe c(s) + 2ϵstats, which means that Vπt P̂t ;̃ct ≤ Vπe + 2Hϵstats. Now we compute the regret in episode t. First recall that bt(s, a) = H min{σt(s, a), 2}, which means that ∥bt∥∞ ≤ 2H as ∥c∥∞ ≤ 1, which means that ∥c − bt∥∞ ≤ 2H. Thus,∥∥∥∥Vπ P̂;c−bt ∥∥∥∥ ∞ ≤ 2H2. Recall simulation lemma (Lemma 40), we have: Vπt − Vπe ≤ Vπt − Vπt P̂t ;̃ct + 2Hϵstats = HEs,a∼dπt [ |̃ct(s, a) − c(s)| + 2H2 ∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ] + 2Hϵstat = HEs,a∼dπt [ H min{σt(s, a), 2} + 2H2 ∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ] + 2Hϵstat ≤ HEs,a∼dπt [ H min{σt(s, a), 2} + 2H2 min{σt(s, a), 2} ] + 2Hϵstat ≤ 3H3Es,a∼dπt min{σt(s, a), 2} + 2Hϵstat ≤ 6H3Es,a∼dπt min{σt(s, a), 1} + 2Hϵstat Now sum over t, and denote Eπt as the conditional expectation conditioned on the 150 history from iteration 0 to t − 1, we get: T−1∑ t=0 [ Vπt − Vπe] ≤ 6H2 T−1∑ t=0 Eπt H−1∑ h=0 min{σt(st h, a t h), 1}  + 2HT ϵstat ≤ 6H2 T−1∑ t=0 √H √√ Eπt H−1∑ h=0 min{σ2 t (st h, a t h), 1}  + 2HT ϵstat, where in the last inequality we use E[a⊤b] ≤ √ E[∥a∥22]E[∥b∥22]. Recall that πt are random quantities, add expectation on both sides of the above inequality, and consider the case where E holds and E holds, we have: E T−1∑ t=0 ( Vπt − Vπe) ≤ 6H2.5E T−1∑ t=0 √√ Eπt H−1∑ h=0 min { σ2 t (st h, a t h), 1 } + 2HT ϵstat + P(E)T H ≤ 6H2.5 √T √√ E T−1∑ t=0 H−1∑ h=0 min { σ2 t (st h, a t h), 1 }  + 2HT ϵstat + 2δT H, where in the last inequality, we use E[a⊤b] ≤ √ E[∥a∥22]E[∥b∥22]. This implies that that: E [ min t Vπt − Vπe ] ≤ 6H2.5 √ T √√ max Alg EAlg T−1∑ t=0 H−1∑ h=0 min { σ2 t (st h, a t h), 1 } + 2Hϵstats + 2Hδ. Set δ = 1/(HT ), we get: E [ Vπ − Vπe] ≤ 6H2.5 √ T √√ max Alg EAlg T−1∑ t=0 H−1∑ h=0 min { σ2 t (st h, a t h), 1 } + 2H √ ln(T 3H|F |) N + 2 T where Alg is any adaptive mapping that maps from history from t = 0 to the end of the t − 1 iteration to to some policy πt. This concludes the proof. □ Below we discuss special cases. 151 A.1.1 Discrete MDPs Proposition 32 (Discrete MDP Bonus). With δ ∈ (0, 1). With probability at least 1 − δ, for all t ∈ N, we have: ∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ≤ min  √ S ln(t2S A/δ) Nt(s, a) , 2  . Proof. The proof simply uses the concentration result for P̂t under the ℓ1 norm. For a fixed t and s, a pair, using Lemma 6.2 in Agarwal et al. (2019), we have that with probability at least 1 − δ, ∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ≤ √ S ln(1/δ) Nt(s, a) . Applying union bound over all iterations and all s, a pairs, we conclude the proof. □ What left is to bound the information gain I for the tabular case. For this, we can simply use the Proposition 35 that we develop in the next section for KNR. This is because in KNR, when we set the feature mapping ϕ(s, a) ∈ R|S||A| to be a one-hot vector with zero everywhere except one in the entry corresponding to (s, a) pair, the information gain in KNR is reduced to the information gain in the tabular model. Proposition 33 (Information Gain in discrete MDPs). We have: IT = O ( HS 2A · ln(TS A/δ) ln(1 + T H) ) . Proof. Using Lemma B.6 in Kakade et al. (2020a), we have: T−1∑ t=0 min H−1∑ h=0 1 Nt(st h, a t h) , 1  ≤ 2S A ln (1 + T H) . 152 Now using the definition of information gain, we have: IT = T−1∑ t=0 H−1∑ h=0 min { σ2 t (st h, a t h), 1 } ≤ S ln(T 2S A/δ)H T−1∑ t=0 min H−1∑ h=0 1 Nt(st h, a t h) , 1  ≤ 2HS 2A ln(T 2S A/δ) ln(1 + T H) This concludes the proof. □ A.1.2 KNRs Recall the KNR setting from Example 2. The following proposition shows that the bonus designed in Example 2 is valid. Proposition 34 (KNR Bonus). Fix δ ∈ (0, 1). With probability at least 1− δ, for all t ∈ N, we have: ∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ≤ min { βt σ ∥ϕ(s, a)∥Σ−1 t , 2 } ,∀s, a, where βt = √ 2λ∥W⋆∥22 + 8σ2 ( ds ln(5) + 2 ln(t2/δ) + ln(4) + ln (det(Σt)/ det(λI)) ) . Proof. The proof directly follows the confidence ball construction and proof from Kakade et al. (2020a). Specifically, from Lemma B.5 in Kakade et al. (2020a), we have that with probability at least 1 − δ, for all t:∥∥∥∥(Ŵt −W⋆ ) (Σt)1/2 ∥∥∥∥2 2 ≤ β2 t . Thus, with Lemma 41, we have:∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ≤ 1 σ ∥∥∥∥(Ŵt −W⋆)ϕ(s, a) ∥∥∥∥ 2 ≤ ∥∥∥∥(Ŵt −W⋆)(Σt)1/2 ∥∥∥∥ ∥ϕ(s, a)∥Σ−1 t /σ ≤ βt σ ∥ϕ(s, a)∥Σ−1 t . This concludes the proof. □ 153 The following proposition bounds the information gain quantity. Proposition 35 (Information Gain on KNRs). For simplicity, let us assume ϕ : S ×A 7→ Rd, i.e., ϕ(s, a) is a d-dim feature vector. In this case, we will have: IT = O ( H ( d ln(T 2/δ) + dds + d2 ln ( 1 + ∥W⋆∥22T H/σ2 )) ln ( 1 + ∥W⋆∥22T H/σ2 )) . Proof. From the previous proposition, we know that σ2 t (s, a) = (β2 t /σ 2)∥ϕ(s, a)∥2 Σ−1 t . Setting λ = σ2/∥W⋆∥22, we will have β2 t /σ 2 ≥ 1, which means that min{σ2 t (s, a), 1} ≤ (β2 t /σ 2) min { ∥ϕ(s, a)∥2 Σ−1 t , 1 } . Note that βt is non-decreasing with respect to t, so βt ≤ βT for T ≥ t, where βT = √ 2σ2 + 8σ2(ds ln(5) + 2 ln(T 2/δ) + ln(4) + d ln(1 + T H∥W⋆∥22/σ 2)) Also we have ∑T−1 t=0 ∑H−1 h=0 min { ∥ϕ(st h, a t h)∥2 Σ−1 t , 1 } ≤ H ∑T−1 t=0 min {∑H−1 h=0 ∥ϕ(st h, a t h)∥2 Σ−1 t , 1 } , since min{a1, b1} +min{a2, b2} ≤ min{a1 + a2, b1 + b2}. Now call Lemma B.6 in Kakade et al. (2020a), we have: T−1∑ t=0 min H−1∑ h=0 ∥ϕ(st h, a t h)∥2 Σ−1 t , 1  ≤ 2 ln (det(ΣT )/ det(λI)) = 2d ln ( 1 + T H∥W⋆∥22/σ 2 ) . (A.1) Finally recall the definition of IT , we have: IT = T−1∑ t=0 H−1∑ h=0 min { σ2 t (st h, a t h), 1 } ≤ β2 T σ2 T−1∑ t=0 H−1∑ h=0 min { ∥ϕ(st h, a t h)∥2 Σ−1 t , 1 } ≤ β2 T σ2 2Hd ln(1 + ∥W⋆∥22T H/σ2) ≤ 2Hd ( 2 + 8 ( ds ln(5) + 2 ln(T 2/δ) + ln(4) + d ln ( 1 + ∥W⋆∥22T H/σ2 ))) ln ( 1 + ∥W⋆∥22T H/σ2 ) = H ( 4d + 32dds + 32d ln(T 2/δ) + 32d + 2d2 ln ( 1 + ∥W⋆∥22T H/σ2 )) ln ( 1 + ∥W⋆∥22T H/σ2 ) , which concludes the proof. □ 154 Extension to Infinite Dimensional RKHS When ϕ : S × A 7→ H where H is some infinite dimensional RKHS, we can bound our regret using the following intrinsic dimension: d̃ = max {{st h,a t h} H−1 h=0 } T−1 t=0 ln I + 1 λ T−1∑ t=0 H−1∑ h=0 ϕ(st h, a t h)ϕ(st h, a t h)⊤  . In this case, recall Proposition 34, we have: βt ≤ βT ≤ √ 2λ∥W⋆∥22 + 8σ2 ( ds ln(5) + 2 ln(t2/δ) + ln(4) + ln (det(ΣT )/ det(λI)) ) ≤ √ 2λ∥W⋆∥22 + 8σ2 ( ds ln(5) + 2 ln(t2/δ) + ln(4) + d̃ ) . Also recall Eq. (A.1), we have: T−1∑ t=0 min H−1∑ h=0 ∥ϕ(st h, a t h)∥2 Σ−1 t , 1  ≤ 2 ln (det(ΣT )/ det(λI)) ≤ 2d̃. Combine the above two, following similar derivation we had for finite dimensional setting, we have: IT = Õ ( Hd̃2 + Hd̃ds ) . A.1.3 General Function Class G with Bounded Eluder dimension Proposition 36. Fix δ ∈ (0, 1). Consider a general function class G where G is discrete, and supg∈G,s,a ∥g(s, a)∥2 ≤ G. At iteration t, denote ĝt ∈ argming∈G ∑t−1 i=0 ∑H−1 h=0 ∥g(si h, a i h) − si h+1∥ 2 2, and denote a version space Gt as: Gt = g ∈ G : t−1∑ i=0 H−1∑ h=0 ∥∥∥g(si h, a i h) − ĝt(si h, a i h) ∥∥∥2 2 ≤ ct  , with ct = 2σ2G2ln(2t2|G|/δ). The with probability at least 1 − δ, we have that for all t, and all s, a:∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ≤ min { 1 σ max g1∈Gt ,g2∈Gt ∥g1(s, a) − g2(s, a)∥2 , 2 } . 155 Proof. Consider a fixed function g ∈ G. Let us denote zt h = ∥∥∥g(st h, a t h) − st h+1 ∥∥∥2 2 −∥∥∥g⋆(st h, a t h) − st h+1 ∥∥∥2 2 . We have: zt h = ( g(st h, a t h) − g⋆(st h, a t h) )⊤ ( g(st h, a t h) + g⋆(st h, a t h) − 2g⋆(st h, a t h) − 2ϵ t h ) = ∥∥∥g(st h, a t h) − g⋆(st h, a t h) ∥∥∥2 2 − 2(g(st h, a t h) − g⋆(st h, a t h))⊤ϵ t h. Since ϵ t h ∼ N(0, σ2I), we must have: 2(g(st h, a t h) − g⋆(st h, a t h))⊤ϵ t h ∼ N(0, 4σ2 ∥∥∥g(st h, a t h) − g⋆(st h, a t h) ∥∥∥2 2 ) Since supg,s,a ∥g(s, a)∥2 ≤ G, then 2(g(st h, a t h) − g⋆(st h, a t h))⊤ϵ t h is a 2σG sub-Gaussian random variable. Call Lemma 3 in (Russo and Van Roy, 2014), we have that with probability at least 1 − δ: ∑ t ∑ h ∥∥∥g(st h, a t h) − st h+1 ∥∥∥2 2 ≥ ∑ t ∑ h ∥∥∥g⋆(st h, a t h) − st h+1 ∥∥∥2 2 + 2 ∑ t ∑ h ∥∥∥g(st h, a t h) − g⋆(st h, a t h) ∥∥∥2 2 − 4σ2G2 ln(1/δ). Note that the above can also be derived directly using Azuma-Bernstein’s inequality and the property of square loss. With a union bound over all g ∈ G, we have that with probability at least 1 − δ, for all g ∈ G. ∑ t ∑ h ∥∥∥g(st h, a t h) − st h+1 ∥∥∥2 2 ≥ ∑ t ∑ h ∥∥∥g⋆(st h, a t h) − st h+1 ∥∥∥2 2 + 2 ∑ t ∑ h ∥∥∥g(st h, a t h) − g⋆(st h, a t h) ∥∥∥2 2 − 4σ2G2 ln(|G|/δ). Set g = ĝt, and use the fact that gt is the minimizer of ∑ t ∑ h ∥g(st h, a t h) − st h+1∥ 2 2, we must have: ∑ t ∑ h ∥∥∥̂gt(st h, a t h) − g⋆(st h, a t h) ∥∥∥2 2 ≤ 2σ2G2ln(2t2|G|/δ). 156 Namely we prove that our version space Gt contains g⋆ for all t. Thus, we have:∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a) ∥∥∥∥ 1 ≤ 1 σ ∥̂gt(s, a) − g⋆(s, a)∥2 ≤ 1 σ sup g1∈Gt ,g2∈Gt ∥g1(s, a) − g2(s, a)∥2, where the last inequality holds since both g⋆ and ĝt belong to the version Gt. □ Now we bound the information gain IT below. The proof mainly follows from the proof in (Osband and Van Roy, 2014). Lemma 37 (Lemma 1 in Osband and Van Roy (2014)). Denote βt = 2σ2G2 ln(t2|G|/δ). Let us denote the uncertainty measure wt;h = sup f1, f2∈Gt ∥ f1(st h, a t h)− f2(st h, a t h)∥2 (note that wt;h is non-negative). We have: t−1∑ i=0 H−1∑ h=0 1{w2 t;h > ϵ} ≤ ( 4βt ϵ + H ) dE( √ ϵ). Proposition 38 (Bounding IT ). Denote d = dE(1/T H). We have IT = ( 1/σ2 + HdG2/σ2 + 8G2 ln(T 2|G|/δ)d ln(T H) ) . Proof. Note that the uncertainty measures wt;h are non-negative. Let us reorder the sequence and denote the ordered one as w1 ≥ w2 ≥ w3 · · · ≥ wT H−H. For notational simplicity, denote M = T H − H We have: T−1∑ i=0 H−1∑ h=0 w2 t;h = M−1∑ i=0 w2 i ≤ 1 + ∑ i w2 i 1{w2 i ≥ 1 M }, where the last inequality comes from the fact that ∑ i w2 i 1{w2 i < 1/M} ≤ M 1 M = 1. Consider any wt where w2 t ≥ 1/M. In this case, we know that w2 1 ≥ w2 2 ≥ · · · ≥ w2 t ≥ 1/M. This means that: t ≤ ∑ i ∑ h 1{w2 t;h > w2 t } ≤ ( 4βT w2 t + H ) dE( √ wt) ≤ ( 4βT w2 t + H ) dE(1/M), 157 where the second inequality uses the lemma above, and the last inequality uses the fact that dE(ϵ) is non-decreasing when ϵ gets smaller. Denote d = dE(1/M). The above inequality indicates that w2 t ≤ 4βT d t−Hd . This means that for any w2 t ≥ 1/M, we must have w2 t ≤ 4βT d/(t − Hd). Thus, we have: T−1∑ i=0 H−1∑ h=0 w2 t;h ≤ 1 + HdG2 + M∑ τ=Hd+1 w2 τ1{w 2 τ ≥ 1/M} ≤ 1 + HdG2 + 4βT d ln(M) = 1 + HdG2 + 4βT d ln(T H). Finally, recall the definition of IT , we have: T−1∑ t=0 H−1∑ h=0 min{σ2 t (st h, a t h), 1} ≤ T−1∑ t=0 H−1∑ h=0 σ2 t (st h, a t h) ≤ 1 σ2 T−1∑ t=0 H−1∑ h=0 w2 t;h ≤ 1 σ2 ( 1 + HdG2 + 4βT d ln(T H) ) . This concludes the proof. □ A.1.4 Proof of Theorem 7 This section provides the proof of Theorem 7. First we present the reduction from a bandit optimization problem to ILFO. Consider a Multi-armed bandit (MAB) problem with A many actions {ai} A i=1. Each action’s ground truth reward ri is sampled from a Gaussian with mean µi and variance 1. Without loss of generality, assume a1 is the optimal arm, i.e., µ1 ≥ µi ∀ i , 1. We convert this MAB instance into an MDP. Specifically, set H = 2. Suppose we have a fixed initial state s0 which has A many actions. For the one step transition, we have P(·|s0, ai) = N(µi, 1), i.e., g∗(s0, ai) = µi. Here we denote the optimal expert policy πe as 158 πe(s0) = a1, i.e., expert policy picks the optimal arm in the MAB instance. Hence, when executing πe, we note that the state s1 generated from πe is simply the stochastic reward of a1 in the original MAB instance. Assume that we have observed infinitely many such s1 from the expert policy πe, i.e., we have infinitely many samples of expert state data, i.e., N → ∞. Note, however, we do not have the actions taken by the expert (since this is the ILFO setting). This expert data is equivalent to revealing the optimal arm’s mean reward µ1 to the MAB learner a priori. Hence solving the ILFO problem on this MDP is no easier than solving the original MAB instance with additional information which is that optimal arm’s mean reward is µ1 (but the best arm’s identity is unknown). Below we show the lower bound for solving the MAB problem where the optimal arm’s mean is known. Theorem 39. Consider best arm identification of Gaussian MAB with the additional information that the optimal arm’s mean reward is µ. For any algorithm, there exists a MAB instance with number of arms A ≥ 2, such that the expected cumulative regret is still Ω( √ AT ), i.e., the additional information does not help improving the worst-case regret bound to solve the MAB instance. Proof of Theorem 39. Below, we will construct A many MAB instances where each instance has A many arms and each arm has a Gaussian reward distribution with the fixed variance σ2. Each of the A instances has the maximum mean reward equal to ∆, i.e., all these A instances have the same maximum arm mean reward. Consider any algorithm Alg that maps ∆ together with the history of the interactionsHt = {a0, r0, a1, r1, . . . , at−1, rt−1} to a distribution over A actions. We will show for any such algorithm alg that knows ∆, with constant probability, there must exist a MAB instance from the A many MAB instances, such that Alg suffers at least Ω( √ AT ) regret where T is the number of iterations. 159 Now we construct the A instances as follows. Consider the i-th instance (i = 1, . . . , A). For arm j in the i-th instance, we define its mean as µi j = 1{i = j}∆. Namely, for MAB instance i, its arms have mean reward zero everywhere except that the i-th arm has reward mean ∆. Note that all these MAB instances have the same maximum mean reward, i.e., ∆. Hence, we cannot distinguish them by just revealing ∆ to the learner. We will construct an additional MAB instance (we name it as 0-th MAB instance) whose arms have reward mean zero. Note that this MAB instance has maximum mean reward 0 which is different from the previous A MAB instances that we constructed. However, we will only look at the regret of Alg on the previously constructed A MAB instances. I.e., we do not care about the regret of Alg(∆,Ht) on the 0-th MAB instance. Let us denote Pi (for i = 0, . . . , A) as the distribution of the outcomes of algorithm Alg(∆,Ht) interacting with MAB instance i for n iterations, and E j[Ni(T )] as the expected number of times arm i is pulled by Alg(∆,Ht) in MAB instance j. Consider MAB instance i with i ≥ 1: Ei[Ni(T )] − E0[Ni(T )] ≤ T ∥Pi − P0∥1 ≤ T √ KL(P0,Pi) ≤ T √ ∆2E0[Ni(T )], where the last step uses the fact that we are running the same algorithm Alg(∆,Ht) on both instance 0 and instance i (i.e., same policy for generating actions), and thus, KL(P0,Pi) = ∑A j=1 E0[N j(T )]KL (q0( j), qi( j)) (Lemma 15.1 in Lattimore and Szepesvári (2020)), where qi( j) is the reward distribution of arm j at instance i. Also recall that for instance 0 and instance i, their rewards only differ at arm i. This implies that: Ei[Ni(T )] ≤ E0[Ni(T )] + T √ ∆2E0[Ni(T )]. 160 Sum over i = 1, . . . , A on both sides, we have: A∑ i=1 Ei[Ni(T )] ≤ T + T A∑ i=1 √ ∆2E0[Ni(T )] ≤ T + T √ A √√ A∑ i=1 ∆2E0[Ni(T )] ≤ T + T √ A √ ∆2T Now let us calculate the regret of Alg(∆,Ht) on i-th instance, we have: Ri = T∆ − Ei[Ni(T )]∆. Sum over i = 1, . . . , A, we have: A∑ i=1 Ri = ∆ AT − A∑ i=1 Ei[Ni(T )]  ≥ ∆ ( AT − T − T √ A∆2T ) Set ∆ = c √ A/T for some c that we will specify later, we get: A∑ i=1 Ri ≥ c √ A T (AT − T − cAT ) . Set c = 1/4, we get: A∑ i=1 Ri ≥ c √ A T (AT − T − cAT ) ≥ 1 4 √ AT (A − 1 − A/4) = 1 4 √ AT (3A/4 − 1) ≥ 1 4 √ AT (A/4) , assuming A ≥ 2. Thus there must exist i ∈ {1, . . . , A}, such that: Ri ≥ 1 16 √ AT . Note that the above construction considered any algorithm Alg(∆,Ht) that maps ∆ and history to action distributions. Thus it concludes the proof. □ 161 The hardness result in Theorem 39 and the reduction from MAB to ILFO together implies the lower bound for ILFO in Theorem 7, namely solving ILFO with cumulative regret smaller then O( √ AT ) will contradict the MAB lower bound in Theorem 39. A.2 Auxiliary Lemmas Lemma 40 (Simulation Lemma). Consider any two functions f : S ×A 7→ [0, 1] and f̂ : S × A 7→ [0, 1], any two transitions P and P̂, and any policy π : S 7→ ∆(A). We have: Vπ P; f − Vπ P̂, f̂ = H−1∑ h=0 Es,a∼dπP [ f (s, a) − f̂ (s, a) + Es′∼P(·|s,a)Vπ P̂, f̂ ;h (s′) − Es′∼P̂(·|s,a)V π P̂, f̂ ;h (s′) ] ≤ H−1∑ h=0 Es,a∼dπP [ f (s, a) − f̂ (s, a) + ∥Vπ P̂, f̂ ;h ∥∞∥P(·|s, a) − P̂(·|s, a)∥1 ] . where Vπ P, f ;h denotes the value function at time step h, under π, P, f . Such simulation lemma is standard in model-based RL literature and can be found, for instance, in the proof of Lemma 10 from Sun et al. (2019a). Lemma 41. Consider two Gaussian distribution P1 := N(µ1, σ 2I) and P2 := N(µ2, σ 2I). We have: ∥P1 − P2∥1 ≤ 1 σ ∥µ1 − µ2∥2 . The above lemma can be proved by Pinsker’s inequality and the closed-form of the KL divergence between P1 and P2. 162 A.3 Implementation Details A.3.1 Environment Setup and Benchmarks This section sketches the details of how we setup the environments. We utilize the stan- dard environment horizon of 500, 50, 200 for Cartpole-v1, Reacher-v2, Cartpole-v0. For Swimmer-v2, Hopper-v2 and Walker2d-v2, we work with the environment horizon set to 400 (Kurutach et al., 2018; Nagabandi et al., 2018; Luo et al., 2018; Rajeswaran et al., 2020; Kidambi et al., 2020a). Furthermore, for Hopper-v2, Walker2d-v2, we add the velocity of the center of mass to the state parameterization (Rajeswaran et al., 2020; Luo et al., 2018; Kidambi et al., 2020a). As noted in the main text, the expert policy is trained using NPG/TRPO (Kakade, 2001b; Schulman et al., 2015b) until it hits a value of (approximately) 460,−10, 38, 3000, 2000, 170 for Cartpole-v1, Reacher-v2, Swimmer-v2, Hopper-v2, Walker2d-v2, Cartpole-v0 respectively. Furthermore, for Walker2d-v2 we utilized pairs of states (s, s′) for defining the feature representation used for parameterizing the discriminator. All the results presented in the experiments section are averaged over five seeds. Furthermore, in terms of baselines, we compare MobILE to BC, BC-O, ILPO, GAIL and GAIFO. Note that BC/GAIL has access to expert actions whereas our algorithm does not have access to the expert actions. We report the average of the best performance offered by BC/BC-O when run with five seeds, even if this occurs at different epochs for each of the runs - this gives an upper hand to BC/BC-O. Moreover, note that for BC, we run the supervised learning algorithm for 500 passes. Furthermore, we run BC-O/GAIL with same number of online samples as MobILE in order to present our results. Furthermore, we used 2 CPUs with 16-32 GB of RAM usage to perform all our benchmarking runs implemented in Pytorch. Finally, our codebase utilizes Open-AI’s implementation of TRPO (Dhariwal et al., 2017) for environments 163 Algorithm 12 MobILE: Model-based Imitation Learning and Exploring for ILFO (used in practical implementation) 1: Require: Expert DatasetDe, Access to dynamics of the true environment i.e. P⋆. 2: Initialize Policy π0, Discriminator w0, Replay Buffer of pre-determined size D, Dynamics Model P̂−1, Bonus b−1. 3: for t = 0, · · · ,T − 1 do 4: Online Interaction: Execute πt in true environment P⋆ to get samples St. 5: Update replay buffer: D = Replay-Buffer-Update(D,St) (refer to section Sec- tion A.3.2). 6: Update dynamics model: Obtain P̂t by starting at P̂t−1 and update using replay bufferD (refer to section Section A.3.2). 7: Bonus Update: Update bonus bt : S ×A → R+ using replay bufferD (refer to section Section A.3.2). 8: Discriminator Update: Update discriminator as wt ← arg maxw L(w; πt, P̂t, bt,De) (refer to section Section A.3.2). 9: Policy Update: Perform incremental policy update through approximate mini- mization of L(·), i.e.: πt ← arg minπ L(π; wt, P̂t, bt,De) by running KPG steps of TRPO (refer to section Section A.3.2). 10: end for 11: Return πT . with discrete actions, and the MJRL repository (Rajeswaran et al., 2017b) for working with continuous action environments. With regards to results in the main paper, our bar graph presenting normalized results was obtained by dividing every algorithm’s performance (mean/standard deviation) by the expert mean; for Reacher-v2 because the rewards themselves are negative, we first added a constant offset to make all the algorithm’s performance to become positive, then, divided by the mean of expert policy. A.3.2 Practical Implementation of MobILE We will begin with presenting the implementation details of MobILE (refer to Algo- rithm 12): 164 Dynamics Model Training As detailed in the main paper, we utilize a class of Gaussian Dynamics Models param- eterized by an MLP (Rajeswaran et al., 2020), i.e. P̂(s, a) := N(hθ(s, a), σ2I), where, hθ(s, a) = s+σ∆s ·MLPθ(sc, ac), where, θ are MLP’s trainable parameters, sc = (s−µs)/σs, ac = (a − µa)/σa with µs, µa (and σs, σa) being the mean of states, actions (and standard deviation of states and actions) in the replay bufferD. Note that we predict normalized state differences instead of the next state directly. In practice, we fine tune our estimate of dynamics models based on the new contents of the replay buffer as opposed to re-training the models from scratch, which is com- putationally more expensive. In particular, we start from the estimate P̂t−1 in the t − 1 epoch and perform multiple updates gradient updates using the contents of the replay bufferD. We utilize constant stepsize SGD with momentum (Sutskever et al., 2013) for updating our dynamics models. Since the distribution of (s, a, s′) pairs continually drift as the algorithm progresses (for instance, because we observe a new state), we utilize gradient clipping to ensure our model does not diverge due to the aggressive nature of our updates. Replay Buffer Since we perform incremental training of our dynamics model, we utilize a replay buffer of a fixed size rather than training our dynamics model on all previously collected online (s, a, s′) samples. Note that the replay buffer could contain data from all prior online interactions should we re-train our dynamics model from scratch at every epoch. 165 Design of Bonus Function We utilize an ensemble of two transition dynamics models incrementally learned using the contents of the replay buffer. Specifically, given the models hθ1(·) and hθ2(·), we compute the discrepancy as: δ(s, a) = ||hθ1(s, a) − hθ2(s, a)||2. Moreover, given a replay bufferD, we compute the maximum discrepancy as δD = max(s,a,s′)∼D δ(s, a). We then set the bonus as b(s, a) = min (1, δ(s, a)/δD) ·λ, thus ensuring the magnitude of our bonus remains bounded between [0, λ] roughly. Discriminator Update Recall that fw(s) = w⊤ψ(s), where w are the parameters of the discriminator. Given a policy π, the update for the parameters w take the following form: max w:||w||22≤ζ L(w; π, P̂, b,De) := E(s,a)∼dπ P̂ [ fw(s) − b(s, a) ] − Es∼De [ fw(s) ] ≡ max w Lζ(w; π, P̂, b,De) = E(s,a)∼dπ P̂ [ fw(s) − b(s, a) ] − Es∼De [ fw(s) ] − 1 2 · ( ||w||22 − ζ ) , =⇒ ∂wLζ(w; π, P̂, b,De) = Es∼dπ P̂ [ ψ(s) ] − Es∼De [ ψ(s) ] − w ∈ 0, where, ∂wLζ(w; π, P̂, b,De) denotes the sub-differential of Lζ(·) wrt w. This in particular implies the following: 1. Exact Update: w∗ = PB(ζ) ( Es∼dπ P̂ [ ψ(s) ] − Es∼De [ ψ(s) ]) , P· is the projection opera- tor, and B(ζ) is the ζ−norm ball. 2. Gradient Ascent Update: wt+1 = PB(ζ) ( (1 − ηw)wt + ηw · ( Es∼dπ P̂ [ ψ(s) ] − Es∼De [ ψ(s) ])) , ηw > 0 is the step-size. We found empirically either of the updates to work reasonably well. In the Swimmer-v2 task, we use the gradient ascent update with ηw = 0.67, and, in the other 166 tasks, we utilize the exact update. Furthermore, we empirically observe the gradient ascent update to yield more stability compared to the exact updates. In the case of Walker2d-v2, we found it useful to parameterize the discriminator based on pairs of states (s, s′). Model-Based Policy Update Once the maximization of the discriminator parameters w is performed, consider the policy optimization problem, i.e., min π L(π; w, P̂, b,De) := E(s,a)∼dπ P̂ [ fw(s) − b(s, a) ] − Es∼De [ fw(s) ] ≡ min π L(π; w, P̂, b,De) = E(s,a)∼dπ P̂ [ fw(s) − b(s, a) ] Hence we perform model-based policy optimization under P̂ and cost function fw(s) − b(s, a). In practice, we perform approximate minimization of L(·) by incrementally updating the policy using KPG-steps of policy gradient, where, KPG is a tunable hyper- parameter. In our experiments, we find that setting KPG to be around 10 to generally be a reasonable choice (for precise values, refer to Table A.1). This paper utilizes TRPO (Schulman et al., 2015b) as our choice of policy gradient method; note that this can be replaced by other alternatives including PPO (Schulman et al., 2017c), SAC (Haarnoja et al., 2018b) etc. Similar to practical implementations of existing policy gradient methods, we implement a reward filter by clipping the IPM reward f (s) by truncating it between cmin and cmax as this leads to stability of the policy gradient updates. Note that the minimization is done with access to P̂, which implies we perform model-based planning. Empirically, for purposes of tuning the exploration-imitation parameter λ, we minimize a surrogate namely: E(s,a)∼dπ P̂ [ (1 − λ) · fw(s) − b(s, a) ] (recall that b(s, a) has a factor of λ associated with it). This ensures that we can precisely control the magnitude of the bonuses against the IPM costs, which, in our experience is empirically easier to 167 work with. 168 Pa ra m et er Ca rt po le -v 1 Re ac he r- v2 Sw im me r- v2 Ca rt po le -v 0 Ho pp er -v 2 Wa lk er 2d -v 2 E nv ir on m en tS pe ci fic at io ns H or iz on H 50 0 50 40 0 20 0 40 0 40 0 E xp er tP er fo rm an ce (≈ ) 46 0 − 10 38 18 1 30 00 20 00 # on lin e sa m pl es pe ro ut er lo op 2 · H 2 · H 2 · H 2 · H 8 · H 3 · H D yn am ic sM od el A rc hi te ct ur e/ N on -l in ea ri ty M L P( 64 ,6 4) /R eL U M L P( 64 ,6 4) /R eL U M L P( 51 2, 51 2) /R eL U M L P( 64 ,6 4) /R eL U M L P( 51 2, 51 2) /R eL U M L P( 51 2, 51 2) /R eL U O pt im iz er (L R ,M om en tu m ,B at ch Si ze ) SG D (0 .0 05 ,0 .9 9, 25 6) SG D (0 .0 05 ,0 .9 9, 25 6) SG D (0 .0 05 ,0 .9 9, 25 6) SG D (0 .0 05 ,0 .9 9, 25 6) SG D (0 .0 05 ,0 .9 9, 25 6) SG D (0 .0 05 ,0 .9 9, 25 6) # tr ai n pa ss es pe ro ut er lo op 20 10 0 10 0 20 50 20 0 G ra d C lip pi ng 2. 0 2. 0 1. 0 2. 0 4. 0 1. 0 R ep la y B uf fe rS iz e 10 · H 10 · H 10 · H 10 · H 16 · H 15 · H E ns em bl e ba se d bo nu s # m od el s/ bo nu s ra ng e 2/ [0 ,1 ] 2/ [0 ,1 ] 2/ [0 ,1 ] 2/ [0 ,1 ] 2/ [0 ,1 ] 2/ [0 ,1 ] IP M pa ra m et er s St ep si ze fo rw up da te (η w ) E xa ct E xa ct 0. 33 E xa ct E xa ct E xa ct # R FF s/ B W H eu ri st ic 12 8/ 0. 1 qu an til e 12 8 /0 .1 qu an til e 12 8 /0 .1 qu an til e 12 8 /0 .1 qu an til e 12 8 /0 .1 qu an til e 12 8 /0 .1 qu an til e Po lic y pa ra m et er iz at io n A rc hi te ct ur e/ N on -l in ea ri ty M L P( 64 ,6 4) /T an H M L P( 64 ,6 4) /T an H M L P( 64 ,6 4) /T an H M L P( 32 ,3 2) /T an H M L P( 32 ,3 2) /T an H M L P( 32 ,3 2) /T an H Po lic y C on st ra in ts N on e N on e N on e N on e lo g σ m in = − 1. 0 lo g σ m in = − 2. 0 Pl an ni ng A lg or ith m # m od el sa m pl es pe rT R PO st ep 2 · H 10 · H 4 · H 4 · H 8 · H 20 · H # T R PO st ep s pe ro ut er lo op (K PG ) 3 10 20 5 10 15 T R PO Pa ra m et er s (C G ite rs ,d am pe ni ng ,k l, ga e λ ,γ ) (5 0, 0. 00 1, 0. 01 , 0. 97 ,0 .9 95 ) (1 00 ,0 .0 01 ,0 .0 1, 0. 97 ,0 .9 95 ) (1 00 ,0 .0 01 ,0 .0 1, 0. 97 ,0 .9 95 ) (1 00 ,0 .0 01 ,0 .0 1, 0. 97 ,0 .9 95 ) (1 0, 0. 00 01 ,0 .0 25 , 0. 97 ,0 .9 95 ) (1 0, 0. 00 01 ,0 .0 25 , 0. 97 ,0 .9 95 ) C ri tic pa ra m et er iz at io n A rc hi te ct ur e/ N on -l in ea ri ty M L P( 12 8, 12 8) /R eL U M L P( 12 8, 12 8) /R eL U M L P( 12 8, 12 8) /R eL U M L P( 32 ,3 2) /R eL U M L P( 12 8, 12 8) /R eL U M L P( 12 8, 12 8) /R eL U O pt im iz er (L R ,B at ch Si ze ,ϵ ,R eg ul ar iz at io n) A da m (0 .0 01 ,6 4, 1e − 5, 0) A da m (0 .0 01 ,6 4, 1e − 5, 0) A da m (0 .0 01 ,6 4, 1e − 5, 0) A da m (0 .0 01 ,6 4, 1e − 5, 0) A da m (0 .0 01 ,6 4, 1e − 8, 1e − 3) A da m (0 .0 01 ,6 4, 1e − 8, 1e − 3) # tr ai n pa ss es pe rT R PO up da te 1 1 1 1 2 2 Ta bl e A .1 :L is to fv ar io us H yp er -p ar am et er s em pl oy ed in Mo bI LE ’s im pl em en ta tio n. 169 A.3.3 Hyper-parameter Details This section presents an overview of the list of hyper-parameters necessary to implement Algorithm 2 in practice, as described in Algorithm 12. The list of hyper-parameters is precisely listed out in Table A.1. The hyper-parameters are broadly categorized into ones corresponding to various components of MobILE, namely, (a) environment specifications, (b) dynamics model, (c) ensemble based bonus, (d) IPM parameterization, (e) Policy parameterization, (f) Planning algorithm parameters, (g) Critic parameterization. Note that if there a hyper-parameter that has not been listed, for instance, say, the value of momentum for the ADAM optimizer in the critic, this has been left as is the default value defined in Pytorch. A.4 Additional Experimental Results A.4.1 Modified Cartpole-v0 environment with noise added to tran- sition dynamics 1 2 3 4 Online Samples 1e4 0 50 100 150 200 R et ur n (V al ue ) CartPole-v0 (stochastic) BC Expert GAIL BC-O MobILE (Ours) GAIFO ILPO Figure A.1: Learning curves for Cartpole-v0 with stochastic dynamics with 20 expert trajectories comparing MobILE with BC, BC-O, GAIL, GAIFO and ILPO. 170 We consider a stochastic variant of Cartpole-v0, wherein, we add additive Gaussian noise of variance unknown to the learner in order to make the transition dynamics of the environment to be stochastic. Specifically, we train an expert of value ≈ 170 in Cartpole-v0 with stochastic dynamics using TRPO. Now, using 20 trajectories drawn from this expert, we wish to consider solving the ILFO problem using MobILE as well as other baselines including BC, BC-O, ILPO, GAIL and GAIFO. Figure A.1 presents the result of this comparison. Note that MobILE compares favorably against other baseline methods - in particular, BC tends suffer in environments like Cartpole-v0 with stochastic dynamics because of increased generalization error of the supervised learning algorithm used for learning a policy. Our algorithm is competitive with both BC-O, GAIL, GAIFO and ILPO. Note that BC-O tends to outperform BC both in Cartpole-v1 and in Cartpole-v0 (with stochastic dynamics). A.4.2 Swimmer Learning Curves We supplement the learning curves for Swimmer-v2 (with 40 expert trajectories) with the learning curves for Swimmer-v2 with 10 expert trajectories in figure A.2. As can be seen, MobILE outperforms baseline algorithms such as BC, BC-O, ILPO, GAIL and GAIFO in Swimmer-v2 with both 40 and 10 expert trajectories. The caveat is that for 10 expert trajectories, all algorithms tend to show a lot more variance in their behavior and this reduces as we move to the 40 expert trajectory case. 171 0.5 1.0 Online Samples 1e5 0 20 40 R et ur n (V al ue ) 40 trajectories BC Expert GAIL BC-O MobILE (Ours) GAIFO ILPO 0.5 1.0 Online Samples 1e5 0 20 40 10 trajectories Figure A.2: Learning curves for Swimmer-v2 with 40 (left) and 10 (right) expert trajec- tories comparing MobILE with BC, BC-O, ILPO, GAIL and GAIFO. MobILE continues to perform well relative to all other benchmarks with both 10 and 40 expert trajectories. The variance of the algorithm as well as the benchmarks is notably higher with lesser number of expert trajectories. 1 2 3 4 5 Online Samples 1e4 0 100 200 300 400 500 600 R et ur n (V al ue ) CartPole-v1 (10 traj.) BC Expert GAIL GAIFO ILPO BC-O MobILE (Ours) 0.5 1.0 1.5 Online Samples 1e4 40 30 20 10 R et ur n (V al ue ) Reacher-v2 (10 traj.) 0.2 0.4 0.6 0.8 1.0 Online Samples 1e5 0 10 20 30 40 50 R et ur n (V al ue ) Swimmer-v2 (40 traj.) 0.5 1.0 1.5 Online Samples 1e6 0 1000 2000 3000 R et ur n (V al ue ) Hopper-v2 (10 traj.) 0.25 0.50 0.75 1.00 1.25 Online Samples 1e6 0 500 1000 1500 2000 2500 R et ur n (V al ue ) Walker2d-v2 (10 traj.) Figure A.3: Learning curves tracking the running maximum averaged across seeds comparing MobILE against BC, BC-O, ILPO, GAIL and GAIFO. MobILE tends to reach expert performance consistently and in a more sample efficient manner. A.4.3 Additional Results In this section, we give another view of our results for MobILE compared against the baselines (BC/BC-O/ILPO/GAIL/GAIFO) by tracking the running maximum of each policy’s value averaged across seeds. Specifically, for every iteration t, we plot the best policy performance obtained by the algorithm so far averaged across seeds (note that 172 this quantity is monotonic, since the best policy obtained so far can never be worse at a later point of time when running the algorithm). For BC/BC-O/ILPO, we present a simplified view by picking the best policy obtained through the course of running the algorithm and averaging it across seeds (so the curves are flat lines). As figure A.3 shows, MobILE reliably hits expert performance faster than GAIL and GAIFO while often matching/outperforming ILPO/BC/BC-O. A.4.4 Ablation Study on Number of Models used for Strategic Ex- ploration Bonus In this experiment, we present an ablation study on using more number of models in the ensemble for setting the strategic exploration bonus. Figure A.4 suggests that even utilizing two models for purposes of setting the bonus is effective from a practical perspective. 1 2 3 4 5 # Online Samples 1e4 0 100 200 300 400 500 R et ur n (V al ue ) CartPole-v1 (5 traj.) Expert 2 Models 4 models 8 models Figure A.4: Learning curves for Cartpole-v1 with varying number of dynamics models for assigning bonuses for strategic exploration. 173 APPENDIX B MISSING PROOFS AND DETAILS IN CHAPTER 3 B.1 Bonus Designs We show the bonus design in Section 3.5 is valid, i.e, model is well-calibrated for tabular MDPs, KNRs, and GPs. B.1.1 Tabular models Lemma 42. With probability 1 − δ, ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ √ |S| log 2 + log(2|S||A|/δ) 2{N(s, a) + λ} + λ N(s, a) + λ ∀(s, a) ∈ S ×A. Proof. When N(s, a) > 0, we use the concentration inequality of discrete distributions (Jiang, 2020). Then, with probability 1 − δ, ∥∥∥∥∥N(·|s, a) N(s, a) − P(·|s, a) ∥∥∥∥∥ 1 ≤ √ |S| log 2 + log(2|S||A|/δ) 2N(s, a) ∀(s, a) ∈ {(s, a) : N(s, a) > 0}. Thus, noting 0 < N(s, a)/(N(s, a) + λ) < 1, with probability 1 − δ, we have ∀(s, a) ∈ {(s, a) : N(s, a) > 0}, ∥∥∥∥∥ N(·|s, a) (N(s, a) + λ) − P(·|s, a) × N(s, a) N(s, a) + λ ∥∥∥∥∥ 1 ≤ √ |S| log 2 + log(2|S||A|/δ) 2{N(s, a) + λ} . (B.1) Besides, the above inequality is still well-defined and holds including the case N(s, a) = 0. Thus, with probability 1 − δ, we have ∀(s, a) ∈ S ×A, we have Equation (B.1). Recall the estimator P̂ is N(s′|s, a)/(N(s, a) + λ). Therefore, 174 ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ ∥∥∥∥∥P̂(·|s, a) − P(·|s, a) × N(s, a) N(s, a) + λ ∥∥∥∥∥ 1 + ∥∥∥∥∥P(·|s, a) − P(·|s, a) × N(s, a) N(s, a) + λ ∥∥∥∥∥ 1 ≤ √ |S| log 2 + log(2|S||A|/δ) 2{N(s, a) + λ} + λ N(s, a) + λ . This concludes the proof. □ B.1.2 KNRs In KNRs, the ground truth model is s′ = W∗ϕ(s, a) + ϵ, ϵ ∼ N(0, ζ2I), where s ∈ RdS , a ∈ RdA , ϕ : S ×A → Rd. We define ∥ϕ(s, a)∥ Σ−1 no B ϕ⊤(s, a)Σ−1 no ϕ(s, a). Lemma 43. With probability at least 1 − δ, we have:∥∥∥∥P̂(·|s, a) − P(·|s, a) ∥∥∥∥ 1 ≤ min { βno ζ ∥ϕ(s, a)∥Σ−1 no , 2 } ∀(s, a) ∈ S ×A, where βno = √ 2λ∥W⋆∥22 + 8ζ2 ( dS ln(5) + ln(1/δ) + Īno ) , Īno = ln ( det(Σno)/ det(λI) ) . Proof. The proof directly follows the confidence ball construction and proof from Kakade et al. (2020b). Specifically, from Lemma B.5 in Kakade et al. (2020b), we have that with probability at least 1 − δ, ∥∥∥∥(Ŵ −W⋆ ) ( Σno )1/2 ∥∥∥∥2 2 ≤ β2 no . Thus, with Lemma 64, we have:∥∥∥∥P̂(·|s, a) − P(·|s, a) ∥∥∥∥ 1 ≤ 1 ζ ∥∥∥∥(Ŵ −W⋆)ϕ(s, a) ∥∥∥∥ 2 ≤ ∥∥∥∥(Ŵ −W⋆)(Σno) 1/2 ∥∥∥∥ 2 ∥ϕ(s, a)∥Σ−1 no /ζ ≤ βno ζ ∥ϕ(s, a)∥Σ−1 no . This concludes the proof. □ 175 B.1.3 Gaussian processes Let Hk be the RKHS with the kernel k(·, ·). We denote the associated norm and inner product by ∥·∥k and ⟨·, ·⟩k. In GPs, the ground truth model is defined as s′ = g∗(s, a)+ϵ, ϵ ∼ N(0, ζ2I), where g∗ belongs to an RKHSHk. Lemma 44. With probability 1 − δ, ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min ( βno ζ √ kno((s, a), (s, a)), 2 ) ∀(s, a) ∈ S ×A, and βno = √ dS{2 + 150 log3(dSno/δ)Ino}, Ino = log(det(I + ζ−2Kno)). Proof. Let ĝi and g∗ be i-th component of ĝ and g∗. We have ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ 1 ζ ∥ĝ(s, a) − g∗(s, a)∥2 ( Lemma 64) = 1 ζ  dS∑ i=1 {ĝi(s, a) − g∗i (s, a)}2  1/2 ≤ 1 ζ  dS∑ i=1 kno((s, a), (s, a))∥ĝi − g∗i ∥ 2 kno  1/2 . (CS inequality and g = ⟨g(·), k((s, a), ·)⟩kno ) By (Srinivas et al., 2010, Theorem 6), with probability 1 − δ, we have ∥ĝi(s, a) − g∗i ∥kno ≤ βno ∀i ∈ [1, · · · , dS]. This concludes the statement. □ 176 B.2 Proof of Theorem 10 In this section, we prove Theorem 10. We also prove the RL version of Theorem 10 when the cost c is given and the goal is policy optimization. Before that, we prepare several lemmas. Lemma 45. With probability 1 − δ, we have ∀ f ∈ F , |E(s,a)∼dπe [ f (s, a)] − EDe[ f (s, a)]| ≤ ϵstat, ϵstat = √ log(2|F |/δ)/2ne. Proof. From Hoeffding’s inequality and a union bound over F . □ Lemma 46 (Pessimistic Policy Evaluation 1 ). Suppose Assumption 9 holds and max f∈F ∥ f ∥∞ ≤ 1. With probability at least 1 − δ, ∀π ∈ Π,∀ f ∈ F , 0 ≤ Vπ P̂, f+b − Vπ P, f . Proof of Lemma 46. We denote the expected total cost of π under P̂ and cost function f by Vπ P̂, f :h (s, a). In this proof, we condition on the event ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2) ∀(s, a) ∈ S ×A. We use the inductive hypothesis argument. We start from h = H + 1, where Vπ P̂, f+b:H+1 = Vπ P, f :H+1 = 0. Assume the inductive hypothesis holds at h + 1, i.e, 0 ≤ Vπ P̂, f+b:h+1(s) − Vπ P, f :h+1(s), ∀s ∈ S,∀π ∈ Π, ∀ f ∈ F . 177 Then, ∀π ∈ Π, ∀ f ∈ F , Qπ P, f :h(s, a) − Qπ P̂, f+b:h(s, a) = −b(s, a) + Es′∼P̂(·|s,a)[V π P, f :h+1(s′)] − Es′∼P(·|s,a)[Vπ P̂, f+b:h+1(s′)] ≤ −b(s, a) + Es′∼P̂(·|s,a)[V π P, f :h+1(s′)] − Es′∼P(·|s,a)[Vπ P, f :h+1(s′)] (Inductive hypothesis assumption) ≤ −b(s, a) + H∥P̂(·|s, a) − P(·|s, a)∥1 (∥F ∥∞ ≤ 1) ≤ −H min(σ(s, a), 2) + H min(σ(s, a), 2) = 0. (Bonus construction) Then, noting Qπ P, f :h(s, π(s)) − Qπ P̂, f+b:h (s, π(s)) = Vπ P, f :h(s) − Vπ P̂, f+b:h (s), we have Vπ P, f :h(s) − Vπ P̂, f+b:h(s) ≤ 0 ∀π ∈ Π, ∀ f ∈ F . This concludes the induction step. Then, we have Vπ P, f − Vπ P̂, f+b = Vπ P, f :1 − Vπ P̂, f+b:1 ≤ 0 ∀π ∈ Π, ∀ f ∈ F . □ Lemma 47 (Pessimistic Policy Evaluation 2 ). Suppose Assumption 9 holds and ∥F ∥∞ ≤ 1. With probability at least 1 − δ, ∀π ∈ Π, ∀ f ∈ F , Vπ P̂, f+b − Vπ P, f ≤ Error, Error := (3H2 + H)E(s,a)∼dπP[min(σ(s, a), 2)]. Proof of Lemma 47. In this proof, we condition on the event ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2) ∀(s, a) ∈ S ×A. 178 We invoke simulation Lemma 63. Then, we have ∀π ∈ Π,∀ f ∈ F Vπ P̂, f+b − Vπ P, f = H∑ h=1 E(s,a)∼dπP[b(s, a) + Es′∼P̂(·|s,a)[V π P̂, f+b;h(s′)] − Es′∼P(·|s,a)[Vπ P̂, f+b;h(s′)]] ≤ H∑ h=1 E(s,a)∼dπP[b(s, a) + ∥Vπ P̂, f+b;h∥∞∥P̂(·|s, a) − P(·|s, a)∥1] ≤ HE(s,a)∼dπP[H min(σ(s, a), 2) + H(2H + 1) min(σ(s, a), 2)] ( ∥Vπ P̂, f+b;h ∥∞ ≤ H(2H + 1)) = (3H2 + H)E(s,a)∼dπP[min(σ(s, a), 2)]. Here, we use ∥Vπ P̂, f+b;h ∥∞ ≤ H(2H + 1) which is derived by 0 ≤ f + b ≤ 2H + 1. □ By using the above lemmas, we prove our main result. Proof of Theorem 10. In this proof, we condition on the event ∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2), which holds with probability 1 − δ, and the event in Lemma 45, which holds with probability 1 − δ. 179 Then, with probability 1 − 2δ, we have V π̂IL P,c − Vπe P,c ≤ V π̂IL P̂,c+b − Vπe P,c (Lemma 46) ≤ H max f∈F {E(s,a)∼dπ̂IL P̂ [ f (s, a) + b(s, a)] − E(s,a)∼dπe P [ f (s, a)]} (c ∈ F ) ≤ H max f∈F {E(s,a)∼dπ̂IL P̂ [ f (s, a) + b(s, a)] − EDe[ f (s, a)]} + Hϵstats ( Lemma 45) ≤ H max f∈F {E(s,a)∼dπe P̂ [ f (s, a) + b(s, a)] − EDe[ f (s, a)]} + Hϵstats (πe ∈ Π and the definition of π̂IL) ≤ H max f∈F {E(s,a)∼dπe P̂ [ f (s, a) + b(s, a)] − E(s,a)∼dπe P [ f (s, a)]} + 2Hϵstats ( Lemma 45) ≤ max f∈F {Vπe P̂, f+b − Vπe P, f } + 2Hϵstats ≤ (3H2 + H)E(s,a)∼dπe P [min(σ(s, a), 2)] + 2Hϵstats (Lemma 47) ≤ (6H2 + 2H)E(s,a)∼dπe P [min(σ(s, a), 1)] + 2Hϵstats. This concludes the proof. □ Finally, we prove the finite-sample error bounds for the RL case. Similar results are obtained in (Kidambi et al., 2020b; Yu et al., 2020). We use this theorem in the next section. Theorem 48 (Bounds for RL). Consider any comparator policy π̃ ∈ Π. Assume P ∈ P and Assumption 9 holds, with probability 1 − 2δ, we have V π̂RL P,c − V π̃ P,c ≤ (6H2 + 2H)E(s,a)∼dπ̃P [min(σ(s, a), 1)]. (B.2) 180 Proof of Theorem 48 . V π̂RL P,c − V π̃ P,c ≤ V π̂RL P̂,c+b − V π̃ P,c (Lemma 46) = V π̃ P̂,c+b − V π̃ P,c (π∗ ∈ Π and the definition of π̂RL) = (3H2 + H)E(s,a)∼dπ̃P [min(σ(s, a), 2)] (Lemma 47) ≤ (6H2 + 2H)E(s,a)∼dπ̃P [min(σ(s, a), 1)]. This concludes the proof. □ B.3 Finite sample error bound for each model In this section, we analyze the bound for the following models: (1) discrete MDPs, (2) KNRs, (3) GPs. All of the proofs are deferred to Section B.3.4. We will also discuss the implication to the RL case using Theorem 48. B.3.1 Discrete MDPs Recall πe-concentratabiliy coefficient is defined by Cπe = max (s,a) dπe P (s, a) ρ(s, a) . Then, the error is calculated as follows. Theorem 49 (Error of MILO for discrete MDPs). • With probability 1 − δ, when λ = Ω(1), 181 V π̂IL P,c − Vπe P,c ≤ Erro + Erre, Erro = c1H2 log(|S||A|c2/δ)  √ Cπe |S|2|A| no + Cπe |S||A| no  , Erre = 2H √ log(2|F |/δ) 2ne , where c1 and c2 are some universal constants. • With probability 1 − δ, when λ = Ω(1), V π̂RL P,c − Vπe P,c ≤ c1H2 log(|S||A|c2/δ)  √ Cπ∗ |S|2|A| no + Cπ∗ |S||A| no  , Cπ∗ = max (s,a) dπ ∗ P (s, a) ρ(s, a) . (B.3) where c1 and c2 are some universal constants. The quantity Cπe measures the difference of distributions between the expert and the batch data. This is much smaller than the common concentratabiliy coefficients in offline RL: max π∈Π max (s,a)∈S×A dπP(s, a) ρ(s, a) , 1 min(s,a) ρ(s, a) , which measure the worst discrepancy between all policies in Π and the batch data (Yin and Wang, 2020). These assumptions imply ρ has global coverage. We achieve this better bound via pessimism. In the RL case, the similar bound as (B.3) has been obtained in offline policy optimization based on FQI (Rashidinejad et al., 2021). However, their work is limited to a tabular case. Hereafter, we will show our result is extended to more general continuous MDPs. B.3.2 KNRs As in Proposition 43, σ(s, a) is given by βno/ζ∥ϕ(s, a)∥Σ−1 no . Thus, from Theorem 10, the final error bound of V̂ π̂IL P,c − Vπe P,c is (6H2 + 2H) min(E(s,a)∼dπe P [βno/ζ∥ϕ(s, a)∥Σ−1 no ], 1) + 2H √ log(2|F |/δ)/(2ne). 182 Hereafter, we analyze βno and E(s,a)∼dπe P [∥ϕ(s, a)∥Σ−1 no ]. Analysis of information gain First, we analyze βno . We need to upper-bound the information gain Īno in βno . Recall Σρ = E(s,a)∼ρ[ϕ(s, a)ϕ⊤(s, a)] and ϕ(s, a) ∈ Rd. Theorem 50 (Finite sample analysis of information gain in finite-dimensional linear models). Assume ∥ϕ(s, a)∥2 ≤ 1∀(s, a) ∈ S ×A. Let c1, c2 be universal constants. 1. When λ = Ω(1), with probability 1 − δ, we have Īno = log(det(Σno/λI)) ≤ c1rank(Σρ){rank(Σρ) + log(c2/δ)} log(1 + no). 2. When λ = Ω(1) and ζ2 = Ω(1), With probability 1 − δ, we have βno ≤ c1 √ ∥W∗∥2 + dSrank(Σρ){rank(Σρ) + log(c2/δ)} log(1 + no). Theorem 50 states Īno = O(rank[Σρ]2 log(no)). We highlight the novelty of our analysis comparing to the other literature. Seeger et al. (2008) analyzes the expectation of the information gain in a fixed or random design setting. Following their discussion, we can prove E[Īno] ≤ rank(Σρ) log(1 + no) as Theorem 72 by Jensen’s inequality. Going beyond the expectation, we derive the finite-sample result by leveraging the variational representation and the uniform law with localization in Lemma 69. The finite-sample analysis is much harder than calculating the bound of the expectation. The worse case of Īno referred to as the maximum information gain has been often used in online learning (Srinivas et al., 2010; Abbasi-yadkori et al., 2011; Kakade et al., 183 2020b). From their discussion, we always have Īno = O(d log(n)). Here, we show that the information gain can be upper-bounded more tightly when rank[Σρ]2 ≤ d in offline RL (a random design setting). Comparing to the analysis of maximum information gain, our analysis takes the low-rankness of the design matrix Σρ into consideration by fully utilizing the random design setting assumption. Analysis of E(s,a)∼dπe P [∥ϕ(s, a)∥Σ−1 no ] and the final bound Next, we analyze E(s,a)∼dπe P [∥ϕ(s, a)∥Σ−1 no ]. Theorem 51. Suppose λ = Ω(1), ζ2 = Ω(1), ∥W∗∥2 = Ω(1). Let c1, c2 be some universal constants. 1. With probability 1 − δ, E(s,a)∼dπe P [∥ϕ(s, a)∥Σ−1 no ] ≤ c1 √ Cπerank[Σρ]{rank[Σρ] + log(c2/δ)} no , Cπe = sup x∈Rd ( x⊤Σπe x x⊤Σρx ) , where Σπe = E(s,a)∼dπe P [ϕ(s, a)ϕ(s, a)⊤]. 2. With probability 1 − δ, V π̂IL P,c − Vπe P,c ≤ Erro + Erre, R̄ = rank[Σρ]{rank[Σρ] + log(c2/δ)}, (B.4) Erro = c1H2 min(d1/2, R̄) √ R̄ √ dSCπe log(1 + no) no , Erre = 2H √ log(2|F |/δ)/(2ne). 3. With probability 1 − δ, let Cπ∗ = supx∈Rd ( x⊤Σπ∗ x x⊤Σρx ) , Σπ∗ = E(s,a)∼d∗P[ϕ(s, a)ϕ⊤(s, a)]. Then, we have V π̂RL P,c − Vπ∗ P,c ≤ c1H2{rank(Σρ) + log(c2/δ)}rank(Σρ) √ dSCπ∗ log(1 + no) no . The final bound (B.4) suggests Erro is Õ(H2rank[Σρ]2√dsCπe/no). We can also get Õ(H2rank[Σρ]d1/2√dsCπe/no), which implies Õ(H2d3/2√dsCπe/no). In other words, 184 when Cπe , rank[Σρ] are not so large and the offline sample size no is large enough, O(H √ log(|F |)/ne) is a dominating term and the covariate shift problem in BC can be avoided since the horizon dependence is just H. Recall the known BC error bound is O(H2 √ log |Π|/ne) (Agarwal et al., 2019, Chapter 14). We see the implication of Erro in more details, which also corresponds to the error of RL case. The rate regarding no is n−1/2 o , which is the standard rate in parametric regression. Besides, we can see the bound depends on rank[Σρ],Cπe . Importantly, since we always have rank[Σρ] ≤ d, our final bound captures the possible low-rankness of the batch data. The quantity Cπe corresponds to πe-concentrability coefficient (∗-concentrability in the RL case). This is much smaller than the worst case concentrability coefficients: sup π∈Π Cπ, C̃ = sup (s,a) ∥ϕ(s, a)∥22∥Σ −1 ρ ∥2. Finally, we note the technical novelty by comparing it to the techniques developed in the offline RL literature. A quantity that is similar to E(s,a)∼dπe P [∥ϕ(s, a)∥Σ−1 no ] has been analyzed in Jin et al. (2020b)1, which studies the error bound of FQI with pessimism in linear MDPs. (Jin et al., 2020b, Corollary 4.5) assumes that full coverage, i.e., Σρ is full- rank and has lower bounded eigenvalues. Also the number of offline samples n0 depends on the smallest eigenvalue. Our analysis just uses partial coverage with the refined concept of relative condition number and thus does not require the full rank assumption on Σρ. Moreover, our bound is distribution dependent, i.e., it depends on rank[Σρ] rather than the ambient dimension of the feature vector ϕ. Thus the bound is much tighter for benign cases where the offline data from ρ happens to concentrate on a low-dimensional subspace. Beyond model-based offline RL literature, one can potentially adapt the model- free offline policy evaluation results (e.g., Duan et al. (2020); Wang et al. (2020a)) with linear function approximation to offline policy optimization (without pessimism). Such 1They analyze E(s,a)∼dπ∗P [∥ϕ(s, a)∥Σ−1 no ], which also appears in our RL result Theorem 51. 185 model-free results will also incur supπ∈ΠCπ, C̃ and the ambient dimension d, instead of much more refined quantities Cπe and rank[Σρ]. B.3.3 Gaussian processes In this section, we give details on GPs. Note that prior works on model-free and model- based offline IL do not have results for infinite-dimensional non-parametric models. Thus our techniques developed in this section are new and relevant even to the offline RL literature—a point that we will return to at the end of this section. From Theorem 10, the final error is (6H2 + 2H) min(Ex∼dπe P [βno/ζ √ kno(x, x)], 1) + 2H √ log(2|F |/δ)/(2ne). where x = (s, a). Hereafter, we analyze βno and Ex∼dπe P [ √ kno(x, x)]. Before going into the details, we repeat several important notations below. In this section, following Srinivas et al. (2010), for simplicity, we suppose the following: Assumption 52. k(x, x) ≤ 1,∀x ∈ S ×A. k(·, ·) is a continuous and positive semidefinite kernel. S ×A is a compact space. Recall we denote x := (s, a) and we have orthonormal eigenfunctions and eigen- values {ψi, µi} ∞ i=1 by Mercer’s theorem. We denote the feature mapping ϕ(x) := [ √ µ1ψ1(x), . . . , √ µ∞ψ∞(x)]⊤. Assume eigenvalues {µ1, . . . , µ∞} is in non-increasing order, we recall the effective dimension: d∗ = min{ j ∈ N : j ≥ B( j + 1)no/ζ 2}, B( j) = ∞∑ k= j µk. 186 We also introduce the empirical version of d⋆, where µ̂i are eigenvalues of the gram matrix Kno . Definition 53 (Empirical effective dimension). d̂ = min{ j ∈ N : j ≥ B( j + 1)/ζ2, B̂( j) =∑no k= j µ̂k. Hereafter, for simplicity, we treat ζ2 = 1, that is, ζ2 = Ω(1). Then, since no ≤ B(no + 1)no/ζ 2, we have d∗ ≤ no. The effective dimensions d̂ and d∗ are widely used in machine learning literature. The first quantity d∗ is often referred to as the degree of freedom (Zhang, 2005; Bach, 2017). In finite-dimensional linear kernels {x 7→ a⊤ϕ(x), a ∈ Rd} (k(x, x) = ϕ⊤(x)ϕ(x)), d∗ is rank[Ex∼ρ[ϕ(x)ϕ⊤(x)]]. Thus, d∗ is considered to be a natural extension of rank[Ex∼ρ[ϕ(x)ϕ⊤(x)]] to infinite-dimensional models. The worst case of the second quantity: max {x1∈S×A,··· ,xno∈S×A} d̂ is often used in online learning literature (Valko et al., 2013; Janz et al., 2020). Up to logarithmic factors, it is equal to the maximum information gain (Srinivas et al., 2010): max {x1∈S×A,··· ,xno∈S×A} log det(I +Kno). as shown in Calandriello et al. (2019); Valko et al. (2013). Importantly, as we will see soon since our setting is offline (a random design setting), d̂ can be upper-bounded much tightly than their analysis. Analysis of information gain With the above in mind, we first analyze βno . To do that, we need to bound the information gain Ino . From (Seeger et al., 2008, Leemma 1), we can easily prove E[Ino] ≤ log(1 + no)d∗. 187 as in Theorem 73. Going beyond the expectation, we derive the finite-sample error bound. Theorem 54 (Finite sample analysis of information gain in infinite-dimensional models). Suppose Assumption 52. Let c1 and c2 be universal constants. 1. We have Ino = log(det(I + ζ−2Kno)) ≤ 2d̂{log(1 + no/ζ 2) + 1}. (B.5) 2. When ζ2 = Ω(1), with probability 1 − δ, Ino = log(det(I + ζ−2Kno)) ≤ c1{d∗ + log(c2/δ)}d∗ log(1 + no). 3. When ζ2 = Ω(1), with probability 1 − δ, βno ≤ c1 √ dS log3(c2dSno/δ){d∗ + log(c2/δ)}d∗ log(1 + no). Theorem 54 states Ino = O((d∗)2 log(no)). Our bound in the offline (a random design) setting can be much tighter compared to the online setting, that is, the known upper bound of maximum information gain in Srinivas et al. (2010) though we can always use this as the bound of Ino with probability 1. We can see this situation in linear kernels as we see in the previous section. In d-linear dimensional linear kernels, the maximum information gain is d. On the other hand, {d∗}2 = rank[Σρ]2 can be much smaller than d. Analysis of learning curves and the final bound We bound Ex∼dπe P [ √ kno(x, x)], where kno(x, x′) = k(x, x′) − k̄no(x)⊤(Kno + ζ 2I)−1k̄no(x′), {xi} no i=1 ∼ ρ(x). where x = (s, a). 188 Recall the definition of eigenvalues {µi} and eigenfunctions {ψi} (which are or- thonormal), we define the feature mapping ϕ(x) = [ √ µ1ψ1(x), . . . , √ µ∞ψ∞(x)]⊤. De- note Φ ∈ Rno×∞ as a matrix where each row of Φ corresponds to ϕ(xi). Since k(x, x′) = ϕ(x)⊤ϕ(x′), we can rewrite the kernel kno(x, x) as follows: kno(x, x) = ϕ(x)⊤ϕ(x) − ϕ(x)⊤Φ⊤ ( ΦΦ⊤ + ζ2I )−1 Φϕ(x) = ϕ(x)⊤ [ I − Φ⊤ ( ΦΦ⊤ + ζ2I )−1 Φ ] ϕ(x) = ϕ(x)⊤ ( I + ζ−2Φ⊤Φ )−1 ϕ(x) = ϕ(x)⊤Σ−1 no ϕ(x), where Σno B I + ζ−2 ∑no i=1 ϕ(xi)ϕ(xi)⊤, and we use matrix inverse lemma in the third equality. Note the infinite-dimensional inverse lemma is formalized in the proof. Now we can use the relative condition number definition and Lemma 60 for a distribution change, i.e., Ex∼dπe P [ √ kno(x, x)] ≤ √ Ex∼dπe P [kno(x, x)] = √ tr ( Ex∼dπe P ϕ(x)ϕ(x)⊤Σno ) ≤ √ Cπe tr ( Ex∼ρϕ(x)ϕ(x)⊤Σno ) = √ CπeEx∼ρkno(x, x), where Cπe = sup ∥x∥2≤1 xΣπe x xΣρx , Σπe = Ex∼dπe P [ϕ(x)ϕ(x)⊤], Σρ = Ex∼ρ[ϕ(x)ϕ(x)⊤]. Now we only need to focus on analyzing Ex∼ρ[kno(x, x)]. Before proceeding to the analysis, we introduce the critical radius (Bartlett et al., 2005). Given some function class F , consider the localized population Rademacher complexity: Rn(δ;F ) = E  sup f∈F ,Ex∼ρ[ f 2(x)]≤δ ∣∣∣∣∣∣∣ 1 no no∑ i=1 ϵi f (xi) ∣∣∣∣∣∣∣  189 where {xi} are i.i.d samples following ρ(x) and {ϵi} are i.i.d Rademacher variables taking values in {−1,+1} equiprobably, independent of the sequence {xi}. The critical radius is defined as the minimum solution to Rn(ξ;F ) ≤ ξ2/b w.r.t ξ where b is a value s.t. ∥F ∥∞ ≤ b. Theorem 55. Suppose Assumption 52. Let c1 and c2 be universal constants. 1. Let δno be the critical radius of the function class { f : f ∈ Hk, ∥ f ∥k ≤ 1}. With probability 1 − δ, Ex∼dπe P [ √ kno(x, x)] ≤ c1ζδ ′ no √ Cπed∗, where δ′no = δno + √ log(c2/δ)/no. 2. Assume ζ2 = Ω(1). With probability 1 − δ, δno ≤ c1 √ d∗/no, Ex∼dπe P [ √ kno(x, x)] ≤ c1 √ Cπed∗{d∗ + log(c2/δ)} no . 3. Assume ζ2 = Ω(1). With probability 1 − δ, V π̂IL P,c − Vπe P,c ≤ Erro + Erre (B.6) Erro = c1H2{d∗ + log(c2/δ)}d∗ √ dSCπe log3(c2dSno/δ) log(1 + no) no Erre = 2H √ log(2|F |/δ)/(2ne). 4. Assume ζ2 = Ω(1). For offline RL, with probability 1 − δ, V π̂RL P,c − Vπ∗ P,c ≤ c1H2{d∗ + log(c2/δ)}d∗ √ dSCπ∗ log3(c2dSno/δ) log(1 + no) no , where Cπ∗ = sup∥x∥2≤1 xΣπ∗ x xΣρx . 190 The final bound in (B.6) suggests that Erro is Õ(H2{d∗}2 √ dSCπe/no). In other words, when Cπe , d∗ are not so large and the offline sample size is large enough, Erre dominates Erre and the covariate shift problem in BC can be avoided since the horizon dependence is just H. Our bound is the natural extension of Theorem 51 to possibly infinite dimensional models. The first and second statements in Theorem 55 are mainly proved in two steps: formulating kno(x, x) into the variational representation and utilizing the uniform law with localization. Note the critical radius can be upper-bounded more tightly than O( √ d∗/no) depending on the kernels. Besides, Cπe can be replaced with a tighter quantity: max i∈N E(s,a)∼dπe P [ψ2 i (s, a)]. Since E(s,a)∼ρ[ψ2 i (s, a)] = 1, this quantity also measure the difference of batch data and expert. This is less than Cπe noting that xΣπ∗ x xΣρx = E(s,a)∼dπe P [ψ2 i (s, a)] when x is a vector s.t. only i-th element is 1 and the other elements are 0. The third statement in Theorem 55 is directly proved by combining the second statement in Theorem 55 and Theorem 54. Implication to offline RL The final statement in Theorem 55 is the bound for the RL case. This is the first result showing the error bound for pessimistic offline RL with nonparametric models. As related literature, in model-free offline RL, Uehara et al. (2021); Duan et al. (2021) obtained the finite-sample error bounds characterized by the critical radius for some minimax-type estimators called Modified RBM (Antos et al., 2008). As we show in Theorem 55, since the critical radius of an RKHS ball is upper-bounded by the effective dimension d∗, their bounds are also characterized by the effective dimension. On top of that, several papers derived the bounds under the general function approximation setting: FQI (Fan et al., 2020; Duan et al., 2021; Munos and Szepesvári, 2008; Chen and Jiang, 2019), marginal weighting based estimators 191 (Uehara et al., 2020), DICE methods (Zhang et al., 2020; Nachum et al., 2019b), policy based methods (Liao et al., 2020; Liu et al., 2020) and MABO (Xie and Jiang, 2020). Comparing to our result, all of their bounds depend on sup π∈Π sup (s,a) dπP(s, a) ρ(s, a) or sup (s,a) 1 ρ(s, a) . The pessimistic bonus allows us to obtain the bound only depending on Cπ∗ but not the above constants. Besides, our Cπ∗ in Theorem 55 is more refined quantity than the density ratios in the sense that it is defined in terms of the relative condition number. Note we can easily obtain the statements which replace Cπ∗ in Theorem 55 with dπ ∗ P (s,a) ρ(s,a) . Remark 56 (Relation with more general offline RL literature). Due to the lack of exploration, it is known how to deal with the lack of the coverage of the offline data is a challenging problem (Zanette, 2020; Wang et al., 2020a). We use the penalty terms based on model-based RL. In the above, we explain how the penalty term in MILO (and its RL counterpart) is transferred to the final sample-error bounds. The idea of penalization has been utilized in a variety of other ways in offline RL. The first other way is imposing constraints on the policy class or Q-function class so that estimated policies are not too much far away from behavior policies. For example, we can use KL divegences, MMD distance, Wasserstein distance to measure the distance from behavior policies (Wu et al., 2019; Fakoor et al., 2021; Matsushima et al., 2020; Touati et al., 2020; Fujimoto et al., 2019) and add D(π, πb) as penalty terms, where πb is a behavior policy. Another way is explicitly estimating the lower bound of q-functions (Kumar et al., 2020; Yu et al., 2021, 2020). By doing so, we can avoid the overestimation of the q-functions in unknown (non-covered) regions. Remark 57 (Relation with GP literature). The quantity Ex∼ρ(x)[kno(x, x)] is often referred to as the learning curve in GP literature (Williams and Vivarelli, 2000; Sollich and Halees, 2002; Rasmussen and Williams, 2005). Their analysis mainly focuses on the 192 numerical viewpoints, that is, how to approximately calculate Ex∼ρ(x)[kno(x, x)]. Though Le Gratiet et al. (2015) analyzes the convergence property, their analysis is limited to the expectation and the result is asymptotic. As far as we know, our result is the first result showing the finite-sample error rate. Remark 58 (Duality between KNRs and GPs). KNRs and GPs have a primal and dual relationship via Mercer’s theorem. In fact, as we see, k(·, ·) = ⟨ϕ(·), ϕ(·)⟩, we have kno(x, x) = ϕ(x)⊤Σ−1 no ϕ(x). Thus, our result in GPs can be applied to the result for infinite-dimensional KNRs with ϕ : S ×A 7→ H whereH is some RKHS. Remark 59 (Online RL using RKHS). There are several online RL literature using RKHS such as the model-based way (Calandriello et al., 2019) like our work and the model-free way (Agarwal et al., 2020a; Yang et al., 2020; Du et al., 2021). In both cases, the final-sample error bounds incur the maximum information gain, i.e., a worse case quantity which is distribution independent. Comparing to that, our final bounds use distribution-dependent quantities d∗. B.3.4 Missing Proofs Below, we provide missing proofs for tabular MDPs, KNRs, and non-parametric GP models. Missing proofs for tabular result We start by providing proof of the tabular MDP result. 193 Proof of Theorem 49. We use Theorem 10. Then, we have V π̂IL P,c − Vπe P,c ≤ (6H2 + 2H) min(1,E(s,a)∼dπe P [σ(s, a)]) + Hϵstat. Hereafter, we show how to upper-bound E(s,a)∼dπe P [σ(s, a)]. We use Lemma 65. Then, by letting ξ = c1 log(|S||A|c2/δ), with probability 1 − δ, we have 1 N(s, a) + λ ≤ ξ noρ(s, a) + λ ∀(s, a) ∈ S ×A. We condition on the above event. Then, E(s,a)∼dπe P [σ(s, a)] ≤ E(s,a)∼dπe P  √ |S| log 2 + log(2|S||A|/δ) 2{N(s, a) + λ} + λ N(s, a) + λ  ≤ √ E(s,a)∼dπe P [ |S| log 2 + log(2|S||A|/δ) 2{N(s, a) + λ} ] + E(s,a)∼dπe P [ λ N(s, a) + λ ] . From Lemma 65, we have E(s,a)∼dπe P [σ(s, a)] ≤ √ ξE(s,a)∼dπe P [ |S| log 2 + log(2|S||A|/δ) {noρ(s, a) + λ} ] + E(s,a)∼dπe P [ λξ noρ(s, a) + λ ] ≤ √ ξCπeE(s,a)∼ρ [ |S| log 2 + log(2|S||A|/δ) {noρ(s, a) + λ} ] +CπeE(s,a)∼ρ [ λξ noρ(s, a) + λ ] ≤ √ ξCπe ∑ s,a [ {|S| log 2 + log(2|S||A|/δ)}ρ(s, a) {noρ(s, a) + λ} ] +Cπe ∑ s,a [ ρ(s, a)λξ noρ(s, a) + λ ] ≤ √ ξCπe{|S| log 2 + log(2|S||A|/δ)}|S ||A|/no + λCπeξ|S ||A|/no. where again Cπe = max (s,a) dπe P (s, a) ρ(s, a) . This concludes the proof. □ Missing proofs for KNR results Next we move to provide proofs for the KNR results. 194 Proof of Theorem 50. In the proof, we use two statements, Equation (B.8) and Equa- tion (B.9), in the proof of Theorem 51. We recommend readers to read the proof of Theorem 51 first. We denote the eigenvalues of ∑no i=1 ϕ(si, ai)ϕ⊤(si, ai) by {µ̂i} d i=1 s.t. µ̂1 ≥ µ̂2 ≥ · · · . Since we assume ∥ϕ(s, a)∥2 ≤ 1, we have µ̂1 ≤ no. First step We first show log(det(Σno)/ det(λI)) ≤ tr Σ−1 no no∑ i=1 ϕ(si, ai)ϕ⊤(si, ai)  {log(1 + no/λ) + 1}. Note this directly shows log(det(Σno)/ det(λI)) ≤ d log(1+ no/λ), ϕ(s, a) ∈ Rd. The above is proved as follows: log(det(Σno)/ det(λI)) = d∑ i=1 log ( 1 + µ̂i λ ) = d∑ i=1 log ( 1 + µ̂i λ ) µ̂i/λ + 1 µ̂i/λ + 1 = d∑ i=1 log ( 1 + µ̂i λ ) µ̂i/λ µ̂i/λ + 1 + log ( 1 + µ̂i λ ) 1 µ̂i/λ + 1 ≤ log ( 1 + µ̂1 λ ) d∑ i=1 µ̂i/λ µ̂i/λ + 1 + d∑ i=1 µ̂i/λ µ̂i/λ + 1 (log(1 + x) < x) ≤ {log(1 + no/λ) + 1} d∑ i=1 µ̂i/λ µ̂i/λ + 1 (µ̂1 ≤ no) = {log(1 + no/λ) + 1} tr Σ−1 no no∑ i=1 ϕ(si, ai)ϕ⊤(si, ai)  . In the last line, letting UVU⊤ be the eigendecomopsition of ∑no i=1 ϕ(si, ai)ϕ⊤(si, ai), we use tr Σ−1 no no∑ i=1 ϕ(si, ai)ϕ⊤(si, ai)  = tr [ {V + λI}−1V ] = d∑ i=1 µ̂i/λ µ̂i/λ + 1 . Then, the first statement is proved. 195 Second step Next, we prove the second statement. We have tr Σ−1 no no∑ i=1 ϕ(si, ai)ϕ⊤(si, ai)  = no∑ i=1 ϕ⊤(si, ai)Σ−1 no ϕ(si, ai). Then, from (B.8), with probability 1 − δ, no∑ i=1 ϕ⊤(si, ai)Σ−1 no ϕ(si, ai) ≲ c1{rank[Σρ] + log(c2/δ)} no∑ i=1 ϕ⊤(si, ai){noΣρ + λI}−1ϕ(si, ai). (B.7) Hereafter, we condition on the above event. To upper-bound ∑no i=1 ∥ϕ(si, ai)∥2{noΣρ+λI}−1 , we use Bernstein’s inequality:∣∣∣∣∣∣∣ no∑ i=1 ϕ⊤(si, ai){noΣρ + λI}−1ϕ(si, ai) − noE(s,a)∼ρ[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] ∣∣∣∣∣∣∣ ≲ √ no Var (s,a)∼ρ [ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] + 1/λ. since ∥ϕ(s, a)∥2 {noΣρ+λI}−1 ≤ 1/λ∀(s, a) ∈ S ×A. Here, from (B.9), noE(s,a)∼ρ[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] ≤ rank[Σρ]. Besides, Var (s,a)∼ρ [ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] ≤ E(s,a)∼ρ[{ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)}2] ≤ 1/λE(s,a)∼ρ[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] ≤ rank[Σρ]/(noλ). (from (B.9)) Thus, no∑ i=1 ϕ⊤(si, ai){noΣρ + λI}−1ϕ(si, ai) ≲ rank[Σρ] + √ rank[Σρ]/λ + 1/λ. By combining (B.7) with the above, we have log(det(Σno)/ det(λI)) ≤ c1rank(Σρ){rank(Σρ) + log(c2/δ)} log(1 + noc3). from λ = Ω(1). □ 196 Before proving Theorem 51, we first present some lemmas. Lemma 60 (Distribution change). Consider two distributions ρ1 ∈ ∆(S × A) and ρ2 ∈ ∆(S ×A), and a feature mapping ϕ : S ×A 7→ H whereH is some Hilbert space (e.g., finite dimensional Euclidean space). Denote C := supx∈H x⊤Es,a∼ρ1ϕ(s,a)ϕ(s,a)⊤x x⊤Es,a∼ρ2ϕ(s,a)ϕ(s,a)⊤x . Then for any positive definition linear matrix ( operator Λ), we have: Es,a∼ρ1ϕ(s, a)⊤Λϕ(s, a) ≤ CEs,a∼ρ2ϕ(s, a)⊤Λϕ(s, a). Proof. Denote the eigendecomposition of Λ = UΣU⊤ where {σi, ui} as the eigenvalue- eigenvector pairs. We have: Es,a∼ρ1ϕ(s, a)⊤Λϕ(s, a) = ∞∑ i=0 σiu⊤i Es,a∼ρ1ϕ(s, a)ϕ(s, a)⊤ui ≤ ∞∑ i=0 σiCu⊤i Es,a∼ρ2ϕ(s, a)ϕ(s, a)⊤ui = CEs,a∼ρ2ϕ(s, a)⊤Λϕ(s, a), which concludes the proof. □ Proof of Theorem 51. Here, we prove the first statement. We need to upper-bound E(s,a)∼dπe P [√ ϕ⊤(s, a)Σ−1 no ϕ(s, a) ] . As the first step, we use Jensen’s inequality: E(s,a)∼dπe P [√ ϕ⊤(s, a)Σ−1 no ϕ(s, a) ] ≤ √ E(s,a)∼dπe P [ ϕ⊤(s, a)Σ−1 no ϕ(s, a) ] . Hereafter, we analyze E(s,a)∼dπe P [ ϕ⊤(s, a)Σ−1 no ϕ(s, a) ] . We first use the definition of the relative condition number Cπe and Lemma 60 to change distribution from dπe P to ρ, i.e., via Lemma 60, we have: Es,a∼dπe P ϕ(s, a)⊤Σ−1 no ϕ(s, a) ≤ CπeEs,a∼ρϕ(s, a)⊤Σ−1 no ϕ(s, a). Thus, below we just need to bound Es,a∼ρϕ(s, a)⊤Σ−1 no ϕ(s, a). 197 Concentration argument In this step, we consider how to bound E(s,a)∼ρ[ϕ⊤(s, a)Σ−1 no ϕ(s, a)]. To do that, we show with probability 1 − δ, ϕ⊤(s, a)Σ−1 no ϕ(s, a) ≤ c1{rank(Σρ) + log(c2/δ)}ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a) ∀(s, a) ∈ S ×A. (B.8) We use the variational representation: ϕ⊤(s, a)Σ−1 no ϕ(s, a) = sup {a∈Rd:a⊤Σno a≤1} {a⊤ϕ(s, a)}2 = sup {a∈Rd:a⊤Σno a≤1,∥a∥22≤(1+λ)/λ,∥a⊤ϕ∥∞≤1/λ} {a⊤ϕ(s, a)}2. Note that in the first line, we use sup {a∈Rd:a⊤Σno a≤1} a⊤ϕ(s, a) = sup {b∈Rd:b⊤b≤1} b⊤Σ−1/2 no ϕ(s, a) = ∥ϕ(s, a)∥Σ−1 no . From the first line to the second line, we use the fact that the maximization regarding a is taken when ã = Σ−1 no ϕ(s, a)/∥ϕ(s, a)∥Σ−1 no and ∥ã∥22 = ϕ ⊤(s, a)Σ−2 no ϕ(s, a)/{ϕ⊤(s, a)Σ−1 no ϕ(s, a)} = (no + λ)/λ2, |ã⊤ϕ| ≤ ∥ϕ(s, a)∥Σ−1 no ≤ 1/λ ∀(s, a) ∈ S ×A, noting ∥ϕ(s, a)∥2 ≤ 1. By defining c̄ = (no + λ)/λ2, we have ∀(s, a) ∈ S ×A, ϕ⊤(s, a)Σ−1 no ϕ(s, a) = sup {a∈Rd:a⊤Σno a≤1,∥a∥22≤c̄,∥a⊤ϕ∥∞≤1/λ} {a⊤ϕ(s, a)}2 = sup {a∈Rd:a⊤λIa+ ∑no i=1{a ⊤ϕi}2≤1,∥a∥22≤c̄,∥a⊤ϕ∥∞≤1/λ} {a⊤ϕ(s, a)}2. Next, we use Lemma 66, that is, with probability 1 − δ, 1 no no∑ i=1 f 2(si, ai) ≥ 0.5E(s,a)∼ρ[ f 2(s, a)] − 0.5{δ′no }2 ∀ f ∈ F , where F = {(s, a) 7→ a⊤ϕ(s, a) : a⊤Σnoa ≤ 1, ∥a∥22 ≤ c̄, ∥a⊤ϕ∥∞ ≤ 1/λ, a ∈ Rd}. 198 Here, δ′no = δno + √ log(c2/δ)/no, where δno is the critical radius of the function class F . Noting λ = Ω(1), from Lemma 67, δ′no = c1 √ rank[Σρ]/no + √ log(c2/δ)/no. By conditioning on the above event, ∀(s, a) ∈ S ×A, we have ∥ϕ(s, a)∥2 Σ−1 no ≤ sup {a∈Rd:a⊤λIa+0.5noE(s,a)∼ρ[{a⊤ϕ}2]≤1+0.5noδ ′2 no ,∥a∥ 2 2≤c̄,∥a⊤ϕ∥∞≤1/λ} {a⊤ϕ(s, a)}2 ≤ sup {a∈Rd:a⊤{noΣρ+λI}a≤2+noδ ′2 no ,∥a∥ 2 2≤c̄,∥a⊤ϕ∥∞<1/λ} {a⊤ϕ(s, a)}2 ≤ sup {a∈Rd:a⊤{noΣρ+λI}a≤2+noδ ′2 no } {a⊤ϕ(s, a)}2 = (2 + noδ ′2 no )ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a) ≤ c1{rank[Σρ] + log(c2/δ)}ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a). Last step Then, the final bound is E(s,a)∼dπe P [∥ϕ(s, a)∥Σ−1 no ] = √ CπeE(s,a)∼ρ[ϕ⊤(s, a)Σ−1 no ϕ(s, a)] ≤ c1 √ Cπeno{rank[Σρ] + log(c2/δ)}E(s,a)∼ρ[ϕ⊤(s, a){Σρ + λI}−1ϕ(s, a)]. Let UVU⊤ be the eigenvalue decomoposition of Σρ s.t. Vi,i = µi. We have E(s,a)∼ρ [ ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a) ] = Tr[{noΣρ + λI}−1{Σρ}] = Tr[{noV + λI}−1V] = 1 no no∑ i=1 µi µi + λ/no ≤ rank[Σρ] no . (B.9) By combining all things together, with probability 1 − δ, E(s,a)∼dπe P [∥ϕ(s, a)∥Σ−1 no ] ≤ c1 √ Cπerank[Σρ]{rank[Σρ] + log(c2/δ)} no . □ Missing proofs of non-parametric model Finally, we provide missing proofs for the non-parametric GP model. 199 Proof of Theorem 54. In the proof, we use two statements, (B.10) and (B.11), in the proof of Theorem 55. We recommend readers to read the proof of Theorem 55 first. We denote the eigenvalues of Kno by {µ̂i} no i=1 s.t. µ̂1 ≥ µ̂2 ≥ · · · . From Assumption 52, we have no = tr(Kno) = no∑ i=1 µ̂i. Thus implies µ̂1 ≤ no. Then, log(det(I + ζ−2Kno)) = no∑ i=1 log ( 1 + µ̂i ζ2 ) = no∑ i=1 log ( 1 + µ̂i ζ2 ) µ̂i/ζ 2 + 1 µ̂i/ζ2 + 1 = no∑ i=1 log ( 1 + µ̂i ζ2 ) µ̂i/ζ 2 µ̂i/ζ2 + 1 + log ( 1 + µ̂i ζ2 ) 1 µ̂i/ζ2 + 1 = no∑ i=1 log ( 1 + µ̂i ζ2 ) µ̂i/ζ 2 µ̂i/ζ2 + 1 + log ( 1 + µ̂i ζ2 ) 1 µ̂i/ζ2 + 1 ≤ log ( 1 + µ̂1 ζ2 ) no∑ i=1 µ̂i/ζ 2 µ̂i/ζ2 + 1 + no∑ i=1 µ̂i/ζ 2 µ̂i/ζ2 + 1 (log(1 + x) ≤ x) ≤ {log(1 + no/ζ 2) + 1} no∑ i=1 µ̂i/ζ 2 µ̂i/ζ2 + 1 (µ̂1 ≤ no) ≤ {log(1 + no/ζ 2) + 1}min j { j + B̂( j + 1)/ζ2} ≤ 2{log(1 + no/ζ 2) + 1}d̂, where the last second inequality uses the fact that ∑no i=1 µ̂i/ξ 2 µ̂i/ξ2+1 ≤ j+ ∑no i= j+1 µ̂i/ξ 2. Then, the first statement is proved. Next, we prove the second statement. We use no∑ i=1 µ̂i/ζ 2 µ̂i/ζ2 + 1 = 1 ζ2 no∑ i=1 kno(xi, xi). proved in Lemma 70. Then, from (B.10), with probability 1 − δ, 1 ζ2 no∑ i=1 kno(xi, xi) ≲ δ′2no no∑ i=1 sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(xi), where δ′n = δn + √ log(c2/δ)/no and δn is the critical radius of { f ∈ Hk : ∥ f ∥k ≤ 1}. Hereafter, we condition on the above event. 200 Then, from Bernstein’s inequality, ∣∣∣∣∣∣∣  no∑ i=1 sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(xi)  − noEx∼ρ  sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(x)  ∣∣∣∣∣∣∣ ≲ √ no Var x∼ρ [ sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(x)] + no. We use for f inHk s.t. ∥ f ∥k ≤ 1 | f (x)| = |⟨ f (·), k(x, ·)⟩k| ≤ ∥ f ∥k∥k(x, ·)∥k ≤ 1. from Theorem 52. Here, from (B.11), the expectation is upper-bounded by Ex∼ρ  sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(x)  ≤ d∗. Besides, the variance is also upper-bounded by Var x∼ρ [ sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(x)] ≤ Ex∼ρ[ sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 4(x)] ≤ Ex∼ρ[ sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(x)] ( f 2(x) ≤ 1∀x ∈ S ×A from Assumption 52) = d∗. (From (B.11)) Thus, with probability 1 − δ, no∑ i=1 kno(xi, xi) ≲ {δ′no }2no(d∗ + √ d∗ + 1) ≲ c1{d∗ + log(c2/δ)}d∗. noting δ′no = √ d∗/no + √ log(c2/δ)/no from Theorem 55. 201 By combining all things together, with probability 1 − δ, log(det(I + ζ−2Kno)) ≤ {log(1 + no/ζ 2) + 1} no∑ i=1 µ̂i/ζ 2 µ̂i/ζ2 + 1 = {log(1 + no/ζ 2) + 1} 1 ζ2 no∑ i=1 kno(xi, xi) ≲ {log(1 + c3no)}{d∗ + log(c2/δ)}d∗. This concludes the proof. □ Proof of Theorem 55. First Statement From Jensen’s inequality, we have Ex∼dπe P [ √ kno(x, x)] ≤ √ Ex∼dπe P [kno(x, x)]. Thus, we focus how to bound Ex∼dπe P [kno(x, x)]. Before that, we show the following statement. With probability 1 − δ, we have for ∀x ∈ S ×A: kno(x, x) ≤ c1ζ 2δ′2no × sup { f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1} f 2(x), (B.10) where δ′no = δno + √ log(c2/δ)/no and δno is the critical radius of { f ∈ Hk : ∥ f ∥k ≤ 1}. As the first step, we use Lemma 68 and Lemma 69. kno(x, x) = sup { f∈Hkno |∥ f ∥ 2 kno ≤1} f 2(x) (From Lemma 68) = sup { f∈Hk |∥ f ∥2k+ζ −2 ∑no i=1 f (xi)2≤1} f 2(x). (From Lemma 69) Next invoke Lemma 66, that is, with probability 1 − δ, 1 no no∑ i=1 f 2(xi) ≥ 0.5E(s,a)∼ρ[ f 2(x)] − 0.5{δ′no }2 ∀ f ∈ F 202 where F = { f : f ∈ Hk, ∥ f ∥2k = 1}. Here, δ′no = δno + √ log(c2/δ)/no, where δno is the critical radius of the function class F . Hereafter, we condition on the above event. Note the uniform boundedness assumption of F for Lemma 66 is satisfied noting | f (x)| = |⟨ f (·), k(·, x)⟩k| ≤ ∥ f ∥k∥k(·, x)∥k ≤ 1. noting Theorem 52. Then, we have kno(x, x) ≤ sup { f∈Hk |∥ f ∥2k+ζ −2no/2Ex∼ρ[ f 2(x)]≤1+noδ ′2 no/2} f 2(x). kno(x, x) is further upper-bounded by kno(x, x) ≤ sup { f∈Hk:2ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤2ζ2/no+ζ2δ′2no } f 2(x) (Multiply 2ζ2/no) ≤ (2ζ2/no + ζ 2δ′2no ) × sup { f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1} f 2(x) ≤ c1ζ 2δ′2no × sup { f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1} f 2(x). This concludes (B.10). Next, we show Ex∼dπe P  sup { f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1} f 2(x)  ≤ 2d∗ × sup ∥x∥2≤1 x⊤Σπe x x⊤Σρx . For f (·) = a⊤ϕ(·) (recall ϕ(·) is the feature mapping defined by the eigenvalues µi and eigenfunctions ϕ, s.t. ϕ = (ϕ1, · · · , ϕ∞)), we have ∥ f ∥2k = a⊤a, Ex∼ρ[ f 2(x)] = a⊤Ma. where M is a diagonal matrix in R∞×∞ s.t. Mi,i = µi. Thus, 203 Ex∼dπe P  sup { f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk} f 2(x)  = Ex∼dπe P  sup {a∈R∞:a⊤(ζ2/noI+M}a≤1} {a⊤ϕ(x)}2  . Then, by letting Σρ and Σπe be E(s,a)∼ρ[ϕ(s, a)ϕ⊤(s, a)] and E(s,a)∼dπe P [ϕ(s, a)ϕ⊤(s, a)], Ex∼dπe P  sup {a∈Rd:a⊤(ζ2/noI+M}a≤1 {a⊤ϕ(x)}2  = Ex∼dπe P [ϕ(x){ζ2/noI + M}−1ϕ(x)] = tr[Ex∼dπe P [ϕ(x)ϕ(x)⊤]{ζ2/noI + M}−1] = tr[Ex∼ρ[ϕ(x)ϕ(x)⊤]{ζ2/noI + M}−1] × sup ∥x∥2≤1 x⊤Σπe x x⊤Σρx = ∞∑ i=1 µi ζ2/no + µi × sup ∥x∥2≤1 x⊤Σπe x x⊤Σρx . Then, by defining Cπe = sup∥x∥2≤1 x⊤Σπe x x⊤Σρx , we have Ex∼dπe P  sup {a∈Rd:a⊤(ζ2/no+M}a≤1 {a⊤ϕ(x)}2  ≤ min j { j + no/ζ 2 ∞∑ i= j+1 µi} ×Cπe ≤ min j { j + no/ζ 2 ∞∑ i= j+1 µi} ×Cπe ≤ 2d∗ ×Cπe . By combining all things together ((B.10) and (B.11)), the statement is concluded, that is, with probability 1 − δ: Edπe P [ √ kno(x, x)] ≤ √ ζ2δ′2no × Ex∼dπP[ sup { f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1} f 2(x)] ≤ ζδ′no √ Cπed∗. where Cπe = sup∥x∥2≤1 x⊤Σπe x x⊤Σρx . Remark 61. Like the above, We can also prove Ex∼ρ[ sup { f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1} f 2(x)] ≤ ∞∑ i=1 µi ζ2/no + µi ≤ 2d∗. (B.11) This is used in the proof of Theorem 54. 204 Remark 62. We can also use Ex∼dπe P  sup {a∈Rd:a⊤(ζ2/noI+M}a≤1 {a⊤ϕ(x)}2  = Ex∼dπe P [ϕ(x){ζ2/noI + M}−1ϕ(x)] = tr[Ex∼dπe P [ϕ(x)ϕ(x)⊤]{ζ2/noI + M}−1] = ∞∑ i=1 Ex∼dπe P [ϕi(x)ϕi(x)⊤]} ζ2/no + µi = ∞∑ i=1 µi ζ2/no + µi × Ex∼dπe P [ϕi(x)ϕi(x)⊤] µi  = ∞∑ j=1 µ j ζ2/no + µ j ×max i (Ex∼dπe P [ψi(x)ψi(x)⊤]). Then, Cπe is replaced with maxi(Ex∼dπe P [ψi(x)ψi(x)⊤]). Second statement We use Lemma 71 to calculate the critical radius of the RKHS ball. The critical inequality is √ 1/no √√ no∑ i=1 min(y2, µ j) ≤ y2. We show y = √ d∗/no satisfies the above. This is proved by √ 1/no √√ no∑ i=1 min(y2, µ j) ≤ min 1≤k≤no { √ 1/n √ ky2 + B(k + 1)} ≤ √ 1/no √ d∗y2 + B(d∗ + 1) (d∗ ≤ no) ≤ √ 1/no √ d∗y2 + d∗/no (B(d∗ + 1) ≤ d∗/no) ≤ √ d∗y2/no ≤ y2. □ 205 B.4 Auxiliary Lemmas Lemma 63 (Simulation Lemma). Consider any two functions f : S ×A 7→ [0, 1] and f̂ : S × A 7→ [0, 1], any two transitions P and P̂, and any policy π : S 7→ ∆(A). We have: Vπ P; f − Vπ P̂, f̂ = H∑ h=0 Es,a∼dπP [ f (s, a) − f̂ (s, a) + Es′∼P(·|s,a)[Vπ P̂, f̂ ;h (s′)] − Es′∼P̂(·|s,a)[V π P̂, f̂ ;h (s′)] ] ≤ H∑ h=0 Es,a∼dπP [ f (s, a) − f̂ (s, a) + ∥Vπ P̂, f̂ ;h ∥∞∥P(·|s, a) − P̂(·|s, a)∥1 ] . where Vπ P, f ;h denotes the value function at time step h, under π, P, f . Such simulation lemma is standard in model-based RL literature and the derivation can be found, for instance, in the proof of Lemma 10 from Sun et al. (2019b). Lemma 64 (ℓ1 Distance between two Gaussians). Consider two Gaussian distributions P1 := N(µ1, ζ 2I) and P2 := N(µ2, ζ 2I). We have: ∥P1 − P2∥1 ≤ 1 ζ ∥µ1 − µ2∥2 . This lemma is proved by Pinsker’s inequality and the closed-form of the KL diver- gence between P1 and P2. Lemma 65 (Concentration on the inverse of state-action visitation). We set λ = Ω(1). Then, with probability 1 − δ, 1 N(s, a) + λ ≤ c1 log(|S||A|c2/δ) noρ(s, a) + λ ∀(s, a) ∈ S ×A. The extension of this lemma to the linear models is stated in Equation (B.8). 206 Proof. We set ξ = c1 log(|S||A|/δ) + 1 (c1 > 4/3 + 3). First, we have 1 N(s, a) + λ ≤ ξ N(s, a) + ξλ . from ξ ≥ 1. Here, by Bernsteins’s inequality, with probability 1 − δ, N(s, a) ≥ noρ(s, a) − 2 √ 2noρ(s, a)(1 − ρ(s, a)) log(|S||A|/δ) − 4 log(|S||A|/δ)/3, ∀(s, a). Thus, ∀(s, a) ∈ V̄ , we have N(s, a) + ξλ ≥ noρ(s, a) − 2 √ 2noρ(s, a)(1 − ρ(s, a)) log(|S||A|/δ) − 4 log(|S||A|/δ)/3 + ξλ ≥ noρ(s, a) − 2 √ 2noρ(s, a)(1 − ρ(s, a)) log(|S||A|/δ) + (c1 − 4/3) log(|S||A|/δ) + λ ≥ noρ(s, a) − 2 √ 2noρ(s, a) log(|S||A|/δ) + (c1 − 4/3) log(|S||A|/δ) + λ ≥ 0.5noρ(s, a) + ( √ 0.5noρ(s, a) − √ 4 log(|S||A|/δ))2 + (c1 − 4/3 − 4) log(|S||A|/δ) + λ ≥ 0.5noρ(s, a) + 0.5λ. This implies with 1 − δ, 1 N(s, a) + λ ≤ 2ξ n0ρ(s, a) + λ ∀(s, a). Then, noting c1 log(|S||A|/δ)+ 1 ≤ c1 log(|S||A|c2/δ) for some c2, the proof is concluded. □ Lemma 66 (A uniform law with localization: Theorem 14.1 in (Wainwright, 2019)). Assume ∥F ∥∞ ≤ b. Denote the critical radius of a function class F by δn. The critical radius δn is defined as a solution to Rno(y;F ) ≤ y2/b. w.r.t y. Then, with probability 1 − δ 1 no no∑ i=1 f (xi)2 ≥ 1/2Ex∼ρ[ f 2(x)] − (δ′n)2/2 ∀ f ∈ F , where δ′n = δn + c1 √ log(c2/δ)/no. 207 Lemma 67 (Critical radius of linear models). Assume ∥ϕ(s, a)∥2 ≤ 1 for any (s, a) ∈ S×A. Then, the critical radius of function class F = {(s, a) 7→ a⊤ϕ(s, a) : ∥a∥22 ≤ α, a ⊤ϕ ≤ β, a ∈ Rd} is upper-bounded by c √ βrank(Σρ)/no. where c is a universal constant. We follow the proof of (Wainwright, 2019, Chapter 14). Their argument depends on the assumption Σρ is full rank. We need to change the proof so that the full-rank assumption is removed and the rank rank[Σρ] would appear in the final bound instead of d. Note that the final bound does not include α. Proof. Unless otherwise noted, in this proof, E[·] is taken w.r.t. xi = (si, ai) ∼ ρ(s, a), ϵi ∼ 2{Ber(0.5) − 1}. Note that xi and ϵi are independent. Noting Eρ∼(s,a)[(a⊤ϕ(s, a))2] = a⊤Σρa, the localized Rademacher complexity of F , Rno(ξ;F ), is E  sup {b∈Rd:∥b∥22≤α,∥b∥Σρ≤ξ,b ⊤ϕ≤β} ∣∣∣∣∣∣∣ 1 no no∑ i=1 ϵi{b⊤ϕ(si, ai)} ∣∣∣∣∣∣∣  where {ϵi} no i=1 is a set of independent Rademacher variables. This is upper-bounded by E  sup {b∈Rd:∥b∥22≤α,∥b∥Σρ≤ξ} ∣∣∣∣∣ 1 no ϵ⊤Φb ∣∣∣∣∣  where Φ is a no × d design matrix s.t. the i-th row is ϕ⊤(si, ai) and ϵ = (ϵ1, · · · , ϵno) ⊤. Here, we have E[Φ⊤Φ] = noΣρ. Let UVU⊤ be the SVD of Σρ, where U is a n × rank[Σρ] matrix and V is a rank[Σρ]× rank[Σρ] diagonal matrix. Noting b = UU⊤b+ (I− 208 UU⊤)b, we have E  sup {b∈Rd:∥b∥22≤α,∥b∥Σρ≤ξ} | 1 no ϵ⊤Φ{UU⊤b + (I − UU⊤)b}|  ≤ E  sup {b∈Rd:∥b∥22≤α} | 1 no ϵ⊤Φ(I − UU⊤)b}|  + E  sup ∥b∥Σρ≤ξ | 1 no ϵ⊤ΦUU⊤b}|  ≤ E  sup {b∈Rd:∥b∥22≤α} | 1 no ϵ⊤Φ(I − UU⊤)b}|  + E [ sup ∥c∥V≤ξ | 1 no ϵ⊤ΦUc}| ] (U⊤c = b) ≤ E [ α no ∥ϵ⊤Φ(I − UU⊤)}∥2 ] + ζ no E [ ∥ϵ⊤ΦU}∥V−1 ] (CS inequality) ≤ α no √ E [ ∥ϵ⊤Φ(I − UU⊤)}∥22 ] + ζ no √ E [ ∥ϵ⊤ΦU∥2 V−1 ] . (Jensen’s inequality) We analyze the second term and first term respectively. Regarding the second term, we have Eϵ[∥ϵ⊤ΦU∥2V−1] = Eϵ[ϵ⊤ΦUV−1U⊤Φ⊤ϵ] = tr(ΦUV−1U⊤Φ⊤), where Eϵ[·] is an expectation only regarding ϵ. Then, by the law of total expectation, E[∥ϵ⊤ΦU∥2V−1] = E[tr(ΦUV−1U⊤Φ⊤)] = E[tr(Φ⊤ΦUV−1U⊤)] = tr(noΣρUV−1U⊤) = no tr(UVU⊤UV−1U⊤) = no tr(UU⊤) = no tr(U⊤U) = norank(Σρ). Similarly, Eϵ [ ∥ϵ⊤Φ(I − UU⊤)}∥22 ] = tr(Φ(I − UU⊤)(I − UU⊤)Φ⊤) = tr(Φ⊤Φ(I − UU⊤)). Then, by the law of total expectation, E [ ∥ϵ⊤Φ(I − UU⊤)}∥22 ] = E[tr(Φ⊤Φ(I − UU⊤))] = no tr(Σρ(I − UU⊤)) = no tr(UVU⊤(I − UU⊤)) = 0. 209 Combining all things together, Rn(ξ;F ) ≤ ξ √ rank[Σρ]/no. Then, the critical inequality becomes y √ rank(Σρ)/no ≤ y2/β. Thus, the critical radius of F is √ βrank(Σρ)/no. □ Lemma 68 (Variatioanl representation of kernels). We denote the RKHS associated with a kernel k(·, ·) byHk. Then, k(x, x) = sup { f :∥ f ∥k≤1, f∈Hk} f 2(x). Proof. We have sup { f :∥ f ∥k≤1, f∈Hk} f 2(x) = sup { f :∥ f ∥k≤1, f∈Hk} ⟨ f , k(x, ·)⟩2k ≤ sup { f :∥ f ∥k≤1, f∈Hk} ∥ f ∥2kk(x, x) (CS inequality) = k(x, x). Besides, the equality is satisfied when f (·) = k(x, ·)/ √ k(x, x) noting f 2(x) = k2(x, x)/k(x, x) = k(x, x), ∥ f (·)∥k = ∥k(x, ·)∥k/k(x, x) = 1. Thus, k(x, x) = sup { f :∥ f ∥k≤1, f∈Hk} f 2(x). □ 210 Lemma 69 (Relation betweenHkno andHk ). We denoting the RKHS associated with a kernel k(·, ·) byHk and the RKHS with a kernel kno(·, ·) byHkno . Then, we haveHk = Hkno . Besides, for f ∈ Hk, we have ∥ f ∥2kno = ∥ f ∥2k + ζ −2 no∑ i=1 f (xi)2. This is stated in (Srinivas et al., 2010, Appendix B) without the proof. For complete- ness, we provide the proof. Proof. We use Mercer’s theorem (Wainwright, 2019, Theorem 12.20). Then, any element in the RKHS associated with the kernel k(x, x) is represented by f (x) = ∞∑ i=1 fiψi(x). where {ψi} ∞ i=1 is an orthonormal basis for L2(ρ): Ex∼ρ[ψi(x)ψ j(x)] = I(i = j). Here, we have k(x, x) = ψ⊤(x)Λψ(x) = ϕ⊤(x)ϕ(x), ∥ f ∥k = f̃ ⊤Λ−1 f̃ , where ϕi(x) = √ µiψi(x) and f̃ = { fi} ∞ i=1 ∈ R ∞. Then, by letting Φ be a n × d matrix s.t. the i-th row is ϕ⊤(si, ai), kno(x, x) = ϕ⊤(x)ϕ(x) − ϕ⊤(x)Φ⊤(ΦΦ⊤ + ζ2I)−1Φϕ(x) = ϕ⊤(x){I − Φ⊤(ΦΦ⊤ + ζ2I)−1Φ}ϕ(x) = ϕ⊤(x){I + Φ⊤Φ/ζ2}−1ϕ(x) (Woodbury matrix identity) = ϕ⊤(x)(I + no∑ i=1 ϕ(xi)ϕ(xi)⊤/ζ2)−1ϕ(x). Here, let UVU⊤ be the eigenvalue decomposition of {Λ−1 + ∑no i=1 ψ(xi)ψ(xi)⊤/ζ2}−1 = 211 UVU⊤. Then, kno(x, x) = ψ⊤(x)(Λ−1 + no∑ i=1 ψ(xi)ψ(xi)⊤/ζ2)−1ψ(x) = ψ⊤(x)UVU⊤ϕ(x) = ψ′⊤(x)Vψ′(x). (U⊤ψ = ψ′) Then, any element f (·) in the RKHS associated with the kernel kno(x, x) is represented as f (·) = g̃⊤ψ′(·), g̃ ∈ R∞, and the associated norm is ∥ f ∥kno = g̃⊤V−1g̃ since ψ′(·) is still an orthnormal basis for L2(ρ), i.e., Ex∼ρ[ϕ′i(x)ϕ j(x)] = I(i = j). This immediately impliesHk = Hkno . Finally, we check the relation of the norm: ∥ f ∥2kno = ∥ no∑ i=1 fiψi∥kno = ∥ f̃ ⊤ψ∥kno ( f̃ = { f1, f2 · · · } ⊤) = ∥{U⊤ f̃ }⊤U⊤ψ∥kno = ∥{U⊤ f̃ }⊤ψ′∥kno = {U⊤ f̃ }⊤V−1U f̃ = f̃ ⊤(Λ−1 + no∑ i=1 ϕ(xi)ϕ(xi)⊤/ζ2) f̃ = ∥ f ∥k + 1/ζ2 no∑ i=1 { f̃ ⊤ϕ(xi)}2 = ∥ f ∥k + ζ−2 no∑ i=1 f 2(xi). □ Lemma 70. Let {µ̂i} no i=1 be the eigenvalues of Kno . Then, no∑ i=1 µ̂i/ζ 2 µ̂i/ζ2 + 1 = 1 ζ2 no∑ i=1 kno(xi, xi). 212 Proof. no∑ i=1 kno(xi, xi) = no∑ i=1 k(xi, xi) − k̄⊤no (xi){Kno + ζ 2I}−1k̄no(xi) = tr  no∑ i=1 k(xi, xi) − k̄⊤no (xi){Kno + ζ 2I}−1k̄no(xi)  = tr ( Kno ) − tr  no∑ i=1 k̄no(xi)k̄⊤no (xi){Kno + ζ 2I}−1  = tr ( Kno −K2 no {Kno + ζ 2I}−1 ) = tr ( {K2 no + ζ2Kno −K2 no }{Kno + ζ 2I}−1 ) = tr ( ζ2Kno{Kno + ζ 2I}−1 ) = no∑ i=1 µ̂i µ̂i/ζ2 + 1 . □ Lemma 71 (Calculation of localized Rademacher complexity of RKHS balls: Corollary 14.5 in (Wainwright, 2019)). Let F = { f ∈ Hk : ∥ f ∥k ≤ 1} be the unit ball of an RKHS with eigenvalues {µ j} ∞ j=1. Then, the localized population Rademacher complexity is upper-bounded by Rn(δ;F ) ≤ √ 2 n √√ ∞∑ j=1 min(µ j, δ2). Lemma 72 (Upper-bound of expectation of information gains: finite-dimensional models ). E[Īno] ≤ rank(Σρ){log(1 + no/λ) + 1}. Proof. E[Īno] = E[log(det(Σno/λ))] ≤ log det(E[Σno/λ]) = log det(I + no/λΣρ) (Jensen’s inequality) ≤ rank(Σρ){log(1 + no/λ) + 1}. The final line is proved as in the proof of Theorem 50. □ 213 Lemma 73 (Upper-bound of expectation of information gains: RKHS). E[Ino] ≤ 2d∗{log(1 + no/ζ 2) + 1}. Proof. E[Ino] = E[log(det(I + ζ−2Kno))] ≤ ∞∑ s=1 log(1 + ζ−2µsno) (Refer to (Seeger et al., 2008, Lemma 1)) ≤ {log(1 + no/ζ 2) + 1}2d∗. From the second line to the third line, we follow in the proof of Theorem 54. □ B.5 Implementation Details Here we detail all environment details and hyperparameters used for the experiments in the main text. B.5.1 Environment Details All environments have a maximum horizon length of 500 timesteps. We achieve this by reducing the data collection frequency of the base 1000 horizon environments. We also remove all contact information from the observation and the reward. Finally, to be able to compute the ground truth reward from the state, we add the velocity of the center of mass into the state. 214 Table B.1: Observation and action space dimensions for each of the environments Environment Observation Space Dimension Action Space Dimension Hopper 12 3 Walker2d 18 6 HalfCheetah 18 6 Ant 29 8 Humanoid 47 17 Table B.2: Ground truth environment reward function used to train the expert and behavior policies as well as evaluate the performance in the learning curves. At time t, ẋt is the velocity of the center of mass in the x-axis, at is the action vector, and zt is the position of the center of mass in the z-axis. Environment Ground Truth Reward Function Hopper ẋt − 0.1∥at∥ 2 2 − 3.0 × (zt − 1.3)2 Walker2d ẋt − 0.1∥at∥ 2 2 − 3.0 × (zt − 0.57)2 HalfCheetah ẋt − 0.1∥at∥ 2 2 Ant ẋt − 0.1∥at∥ 2 2 − 3.0 × (zt − 1.3)2 Humanoid 1.25 × ẋt − 0.1∥at∥ 2 2 + 5 × bool(1.0 ≤ zt ≤ 2.0) B.5.2 Dynamics Ensemble Architecture and Model Learning For all of our experiments we use an ensemble of four dynamics models with each model parameterized by a feed-forward neural network with two hidden layers containing 1024 units. The learned model does not predict next state, but instead predicts the normalized difference between the next state and the current state, st+1 − st. The activation function used at each layer is ReLU. We train all of our ensembles using Adam with learning rate 5 × 10−5 and otherwise default hyperparameters. We train each dynamics model for 300 epochs on just the offline dataset for all of our experiments. Please see Table B.3 for all values. 215 Table B.3: All hyperparameters used for dynamics model learning Hyperparameter Value Hidden Layers (1024, 1024) Activation ReLU Optimizer Adam Learning Rate 5 × 10−5 Batch Size 256 Epochs 300 B.5.3 Policy Architecture and TRPO Details We use the open source NPG/TRPO implementation, MJRL (Rajeswaran et al., 2018). The policy network and the value network are feedforward neural networks with two hidden layers containing 32 and 128 hidden units respectively. Both networks use a tanh activation function with the policy network outputting a Gaussian distributionN(µ(s), σ2) where σ is a trainable parameter. We use Generalized Advantage Estimator (GAE) to estimate the advantages. Please see Table B.4 for all values. Table B.4: TRPO/NPG hyperparameter values used in experiments. Hyperparameter Value Policy Hidden Layers (32, 32) Critic Hidden Layers (128, 128) Batch Size 40000 Max KL Divergence 0.01 Discount γ 0.995 CG Iterations 25 CG Damping 1 × 10−5 GAE λ 0.97 Critic Update Epochs 2 Critic Optimizer Adam Critic Learning Rate 1 × 10−4 Critic L2 Regularization 1 × 10−4 Policy Init Log Std. -0.25 Policy Min Log Std. -2.0 BC Regularization λBC 0.1 216 B.5.4 Discriminator Update and Cost Function Details We parameterize our discriminator as a linear function f (s, a) = w⊤ϕ(s, a), where ϕ(s, a) are Random Fourier Features (Rahimi and Recht, 2008b) and w is the vector of parameters for the discriminator. Recall our objective, min π∈Π max f∈F [ E(s,a)∼dπ P̂ ( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)] ] + λBC · E(s,a)∼De[ℓ(a, s, π)]. Now given a policy π, we can compute a closed form update for the discriminator parameters w like so max w:∥w∥22≤η L(w; π, P̂, b,De) := E(s,a)∼dπ P̂ ( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)] ≡ max w Lη(w; π, P̂, b,De) = E(s,a)∼dπ P̂ ( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)] − 1 2 · (∥w∥22 − η) ⇒ ∂wLη(w; π, P̂, b,De) = E(s,a)∼dπ P̂ [ϕ(s, a)] − E(s,a)∼De[ϕ(s, a)] − w where ∂wLη(w; π, P̂, b,De) denotes the partial derivative of Lη(·) wrt to w. Setting the above expression to 0 and solving for w gives us the closed form solution. Note that even with the BC regularization constraint added into the objective, the solution will still hold. Now for a given updated wt, we have our cost function c(s, a) = w⊤t ϕ(s, a) + b(s, a) where our penalty, b(s, a), is the maximum discrepancy of our model ensemble predic- tions. To balance our penalty term with our cost term, we introduce a parameter λpenalty to get the cost c(s, a) = (1 − λpenalty) · w⊤t ϕ(s, a) + λpenalty · b(s, a). In our experiments, λpenalty was the only parameter we varied across environments. 217 Table B.5: λpenalty values used for each environment. Environment λpenalty Hopper 2.5 × 10−4 Walker2d 1.0 × 10−7 HalfCheetah 1.0 × 10−4 Ant 1.0 × 10−4 Humanoid 5.0 × 10−4 B.6 Additional Experiments B.6.1 MILO with Expert Trajectories Recall that in our main experiments, we create an extremely small expert dataset con- taining expert (s, a) pairs by randomly sampling state-action pairs from an expert dataset consisting of state-action pairs from many expert trajectories, and we did that for the purpose of creating an expert dataset where BC almost fails completely. One may wonder what MILO would do if we feed MILO a complete single expert trajectory. We conduct such experiments in this section. Figure B.1 shows the performance of MILO with one expert trajectory using the same hyperparameters as before. All plots are shown averaged across five seeds. Note that MILO is still performs well with one expert trajectory— matching or nearly matching the expert performance across all 5 continuous control tasks. Figure B.1: Performance of MILO with one expert trajectory. Note MILO performance just as well with trajectory inputs as with state-action pair sample inputs. 218 B.6.2 Performance of MILO on Ant without Pessimism Figure B.2: Performance of MILO with and without pessimism for Ant-v2. Figure B.2 shows MILO with and without pessimism for a given set of hyperparameters. Note that unlike the learning curves shown in Figure 7.7, MILO is still able to stably reach expert level performance. 219 APPENDIX C MISSING PROOFS AND DETAILS IN CHAPTER 4 C.1 Detailed Algorithm Pseudocode Algorithm 13 presents a more detailed pseudocode of AILBoost. The main detail here is the 2-step process of learning our discriminator using a weighted replay buffer of weak learner samples and then learning a weak learner for a certain number of RL steps. Algorithm 13 AILBOOST (Adversarial Imitation Learning via Boosting) Require: number of iterations T , expert dataDe, weighting parameter α 1: Initialize π1 weight α1 = 1, replay buffer B = ∅ 2: for t = 1, . . . ,T do 3: Construct the t-th datasetDt = {(s j, a j)}Nj=1 where s j, a j ∼ dπt ∀ j. 4: Set B ← B ∪Dt 5: for # of Discriminator Updates do 6: Sample batch from B with respective sample weights αi