TOWARDS SPECIALIZED REINFORCEMENT
LEARNING FROM DIVERSE DATA

A Dissertation

Presented to the Faculty of the Graduate School

of Cornell University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by

Jonathan Daniel Chang

August 2024


© 2024 Jonathan Daniel Chang

ALL RIGHTS RESERVED


TOWARDS SPECIALIZED REINFORCEMENT LEARNING FROM DIVERSE DATA

Jonathan Daniel Chang, Ph.D.

Cornell University 2024

Reinforcement learning (RL) fundamentally focuses on teaching agents how to make

decisions by interacting with an environment. Unlike supervised learning approaches

that learn from a fixed dataset, reinforcement learning agents learn by doing, receiving

feedback as rewards or penalties based on their chosen actions. However, the generality of

RL makes efficient learning incredibly difficult, making adopting novel tasks complicated.

Even with the explosive success of ChatGPT (OpenAI, 2023), which applied a deep

RL algorithm, Proximal Policy Optimization (PPO) (Schulman et al., 2017b), to large

language models (LLMs), there has been more interest in either reducing the complexity

of RL algorithms or eliminating the need for online interaction. Even beyond LLMs,

despite RL’s superhuman abilities in games such as DOTA (Berner et al., 2019) or

Go (Silver et al., 2016b), we have yet to see widespread adoption of RL in real-world

applications compared to other, arguably more specialized, learning paradigms such as

supervised learning. We notice that a critical challenge for the broad adoption of RL is

efficiently utilizing diverse data sources to create a specialized algorithm. That is, for

many of the successes mentioned above in RL, the algorithms used to learn the agents

were general-purpose algorithms that could also be used in other applications.

In this thesis, we attempt to introduce RL algorithms that progress toward specialized

algorithms for various settings. We first discuss doing efficient inverse reinforcement

learning (IRL) from different types of data sources. We consider three settings: learning

from observations alone, where the demonstration data does not contain action infor-

mation; offline learning, where instead of interactive access to the environment we


only get a large dataset of interactions; and off-policy learning where the interactive

feedback that we learn from can be from different learning agents. In all three settings,

we introduce a principled algorithm that performs efficient learning in a wide range of

control tasks. Next we discuss learning specialized algorithms in the space of generative

models. Foundation models increasingly live up to their namesake, becoming capable

base models for improved downstream performance on various tasks across multiple

application domains. In this thesis’s second part, we investigate RL with these models for

learning decision-making agents from diverse data sources. We present three different

learning settings with generative models: text generation with an interactive black box

model, text generation with high-quality human labels, and text instruction-guided image

generation. Overall, each setting has a specific property, whether it is deterministic

transition dynamics or a short horizon, that allows for the design of more specialized

algorithms that efficiently exploit these properties and improves beyond the general RL

baseline.


BIOGRAPHICAL SKETCH

Jonathan Chang was born in Germany. He received his Bachelors and Masters in

computer science and applied mathematics from Brown University. He then attended

Cornell University for his Ph.D. At Cornell, he was a teaching assistant for Introduction

to Artificial Intelligence and Topics in Reinforcement Learning. Jonathan also spend

time as a research intern at Meta in New York City and Microsoft Research in Montreal.

He received the LinkedIn Ph.D. Award for Fall 2023 and Spring 2024.

iii


To my parents Michael & Elizabeth, my friend Luke, and my other half Elise

iv


ACKNOWLEDGEMENTS

First, to my advisor, Wen Sun. From the first Zoom call, Wen has been able to balance

encouragement and structured advice. He has always helped me make progress on our

projects and had an almost uncanny intuition for why I was stuck on a problem. Over the

countless meetings I’ve had with him, I could count on walking out of his office with

a clear goal and an inspired excitement about the problem we were tackling. Through

his guidance, he shaped how I approach research, how I ask questions, and how to move

forward in the face of profound challenge and uncertainty. Richard Feynman wrote "The

worthwhile problems are the ones you can really solve or help solve, the ones you can

really contribute something to." During my 2nd year, Wen gave me similar advice when

I felt discouraged about my abilities as a researcher and that advice gave me the strength

to continue research. Thank you for your support, your trust, your patience in allowing

me to make mistakes, and your time being an amazing mentor and advisor. I will always

remember our time doing research, and I look forward to continuing our collaborations!

To my thesis committee members: Hadas Kress-Gazit, Sanjiban Choudhury, and

Yoav Artzi. Thank you for the taking the time to speak with me over the years about my

research and provide valuable feedback about how I’ve been doing. Especially when my

research focus pivoted away from robotics and control to generative models, everyone on

the committee encouraged this change and supported me nonetheless.

Next, to my previous advisors, Stefanie Tellex and George Konidaris. Thank you for

taking a chance on me. Only armed with a bullish excitement for reinforcement learning

and deep learning, I joined their lab to pursue research for the first time. I can still

remember Stefanie walking down College Hill with me to give me advice about Graduate

School applications and George giving me essential tips about writing, presenting, and

conducting research in his late night e-mails. Thank you! I would not be here otherwise.

To Rahul Kidambi and Kianté Brantley. My graduate school experience can be

v


divided into two chapters: first being mentored by Rahul and next being mentored

by Kianté. Both Rahul and Kianté taught me how to be a team player, how to be a

compassionate collaborator, and an impactful mentor to a more junior researcher. Both

led by example and gave me the invaluable gift of experiencing great mentorship. Thank

you both for all the support!

To my other mentors Mikael Henaff, Dipendra Misra, Brandon Amos, Eric Yuan, and

Marc-Alexandre Cote. Thank you all for your time, patience, internship opportunities,

and guidance over the years. I would love to continue collaborating in the future!

To Rebecca "Becky" Stewart, thank you for all your help navigating my time at

Cornell and for you patience answering all my questions first thing in the morning! Much

of my logistical mishaps were corrected by Becky, and I wouldn’t be here without her

support.

To my other collaborators and friends: Dhruv Sreenivas, Masatoshi Uehara, Kaiwen

Wang, Wendy Huang, Owen Oertell, Rajkumar Ramamurthy, Ge Gao, Nico Espinosa

Dice, Yuda Song, Yiyi Zhang, Wenhao Zhang, Runzhe Wu, Jerry Chee, Aaron Gokaslan,

Aaron Tucker, Katie Luo, Rohan Banerjee, thank you all!

To my parents, Michael and Elizabeth Chang, thank you for your love and support.

I am grateful for all you have done and am fortunate to be blessed with a supportive

network of unwavering encouragement. Thank you for giving me the opportunity to

pursue my passions and providing the space for me to grow into the person I am today. I

couldn’t have done it without you both.

To my partner, Elise Burdette, I am forever grateful for your love and support through

the last five years. Thank you for being there with me to witness my entire journey at

Cornell, for listening to countless rants about reinforcement learning, and filling our days

with both epic adventures and happiness. Thank you for always believing in me, and I

am nothing but excited for our future together.

vi


Last but not least, to my dog, Choco, and my cat, George. Thank you for reminding

me that there is more to life than research and insisting that we enjoy the good weather. I

can’t imagine my Ph.D. without you two even if George likes to delete blocks of code by

sitting my keyboard. Thanks you two.

vii


CONTENTS

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction 1
1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Imitation Learning from Diverse Data Sources . . . . . . . . . 4
1.1.2 Reinforcement Learning and Imitation Learning with Generative

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Markov Decision Process (MDP) . . . . . . . . . . . . . . . . 6
1.2.2 Deep Policy Gradient Algorithms . . . . . . . . . . . . . . . . 7
1.2.3 Inverse Reinforcement Learning (IRL) . . . . . . . . . . . . . . 11

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 13

I Imitation Learning 14

2 Model-Based Imitation Learning From Observation Alone 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Function Approximation Setup . . . . . . . . . . . . . . . . . . 21
2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.1 Components of MobILE . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Exploration And Imitation Tradeoff . . . . . . . . . . . . . . . 26

2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 Exploration in ILFO and the Exponential Gap between IL and

ILFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Practical Instantiation of MobILE . . . . . . . . . . . . . . . . . . . . . 31
2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.7.1 Benchmarking MobILE on MuJoCo suite . . . . . . . . . . . . . 34
2.7.2 Importance of the optimistic MDP construction . . . . . . . . . 35
2.7.3 Varying Number of Expert Samples . . . . . . . . . . . . . . . 35

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

viii


3 Model-Based Offline Imitation Learning 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 Specialization to offline RL . . . . . . . . . . . . . . . . . . . 46
3.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5.1 Analysis: Discrete MDPs . . . . . . . . . . . . . . . . . . . . . 49
3.5.2 Analysis: KNRs and GPs for Continuous MDPs . . . . . . . . . 49

3.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7.1 Evaluation on MuJoCo Continuous Control Tasks . . . . . . . . 56
3.7.2 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Model-Free Off-Policy Imitation Learning 60
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.1 Adversarial Imitation Learning (AIL) . . . . . . . . . . . . . . 66
4.3.2 Discriminator Actor Critic (DAC) . . . . . . . . . . . . . . . . . 67
4.3.3 ValueDICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 AILBoost: Adversarial Imitation Learning via Boosting . . . . 70

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Controller State-based Experiments . . . . . . . . . . . . . . . 74
4.5.2 Image-based Experiments . . . . . . . . . . . . . . . . . . . . 76
4.5.3 Sensitivity to gradient-based optimization for weak learners and

discriminators . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

II IL and RL for Generative Models 79

5 Learning to Generate Better Than Your LLM 80
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Reinforcement Learning from Guided Feedback . . . . . . . . . . . . . 88
5.5 Theoretical Justification . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 101

ix


6 Provably Efficient RL with Preference-based Feedback via Dataset Reset 102
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Dataset Reset Policy Optimization . . . . . . . . . . . . . . . . . . . . 110
6.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.4.1 Theoretical Sample Complexity . . . . . . . . . . . . . . . . . 114
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5.1 How well can DR-PO optimize the RLHF objective? . . . . . . 119
6.5.2 Analysis of Dataset Reset Proportion . . . . . . . . . . . . . . 120
6.5.3 DR-PO Transfer Performance . . . . . . . . . . . . . . . . . . 121
6.5.4 DR-PO Scaling Performance on Anthropic HH . . . . . . . . . 122

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7 RL for Consistency Models: Faster Reward Guided Text-to-Image Genera-
tion 124
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.3.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 128
7.3.2 Diffusion and Consistency Models . . . . . . . . . . . . . . . . 129
7.3.3 Reinforcement Learning for Diffusion Models . . . . . . . . . 131

7.4 Reinforcement Learning for Consistency Models . . . . . . . . . . . . 132
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.5.1 RLCM vs. DDPO Performance Comparisons . . . . . . . . . . 138
7.5.2 Train and Test Time Analysis . . . . . . . . . . . . . . . . . . 139
7.5.3 Ablation of Inference Horizon for RLCM . . . . . . . . . . . . 140
7.5.4 Qualitative Effects on Generalization . . . . . . . . . . . . . . 140

7.6 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . 141

8 Conclusion 143
8.1 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Reinforcement Learning for Generative Models . . . . . . . . . . . . . 144
8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

III Appendix 146

A Missing Proofs and Details in Chapter 2 147
A.1 Analysis of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.1.1 Discrete MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.1.2 KNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.1.3 General Function Class G with Bounded Eluder dimension . . . 155
A.1.4 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . 158

x


A.2 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 163

A.3.1 Environment Setup and Benchmarks . . . . . . . . . . . . . . . 163
A.3.2 Practical Implementation of MobILE . . . . . . . . . . . . . . . 164
A.3.3 Hyper-parameter Details . . . . . . . . . . . . . . . . . . . . . 170

A.4 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . 170
A.4.1 Modified Cartpole-v0 environment with noise added to transi-

tion dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.4.2 Swimmer Learning Curves . . . . . . . . . . . . . . . . . . . . 171
A.4.3 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . 172
A.4.4 Ablation Study on Number of Models used for Strategic Explo-

ration Bonus . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

B Missing Proofs and Details in Chapter 3 174
B.1 Bonus Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

B.1.1 Tabular models . . . . . . . . . . . . . . . . . . . . . . . . . . 174
B.1.2 KNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.1.3 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . 176

B.2 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
B.3 Finite sample error bound for each model . . . . . . . . . . . . . . . . 181

B.3.1 Discrete MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B.3.2 KNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
B.3.3 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . 186
B.3.4 Missing Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . 193

B.4 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
B.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 214

B.5.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . 214
B.5.2 Dynamics Ensemble Architecture and Model Learning . . . . . 215
B.5.3 Policy Architecture and TRPO Details . . . . . . . . . . . . . . 216
B.5.4 Discriminator Update and Cost Function Details . . . . . . . . 217

B.6 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 218
B.6.1 MILO with Expert Trajectories . . . . . . . . . . . . . . . . . . 218
B.6.2 Performance of MILO on Ant without Pessimism . . . . . . . . 219

C Missing Proofs and Details in Chapter 4 220
C.1 Detailed Algorithm Pseudocode . . . . . . . . . . . . . . . . . . . . . 220
C.2 Implementation and Experiment Details . . . . . . . . . . . . . . . . . 221

C.2.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . 221
C.2.2 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . 222
C.2.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 222

C.3 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
C.3.1 Aggregate Performance Comparisons . . . . . . . . . . . . . . 225
C.3.2 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . 226
C.3.3 Learning curves across different optimization schedules . . . . 226

xi


D Missing Proofs and Details in Chapter 5 229
D.1 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 229
D.2 Additional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 232
D.3 Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . 233

D.3.1 KL Reward Constraint . . . . . . . . . . . . . . . . . . . . . . 233
D.3.2 Task Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
D.3.3 IMDB - Algorithm Details . . . . . . . . . . . . . . . . . . . . 235
D.3.4 CommonGen - Algorithm Hyperparameters . . . . . . . . . . . 236
D.3.5 TL;DR Summarization - Algorithm Hyperparameters . . . . . . 237

D.4 IMDB Qualitative Examples . . . . . . . . . . . . . . . . . . . . . . . 239
D.5 CommonGen Qualitative Examples . . . . . . . . . . . . . . . . . . . 240
D.6 TL;DR Qualitative Examples . . . . . . . . . . . . . . . . . . . . . . . 241

E Missing Proofs and Details in Chapter 6 243
E.1 DR-PO with NPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
E.2 Proof of Theorem 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

E.2.1 Q function Estimation Error . . . . . . . . . . . . . . . . . . . 246
E.2.2 NPG Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 247
E.2.3 Unregularized Suboptimality Gap w.r.t. r∗ . . . . . . . . . . . . 251

E.3 NPG with regularized Q functions . . . . . . . . . . . . . . . . . . . . 253
E.3.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . 254

E.4 Proof of Theorem 90 . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
E.4.1 Q function Estimation Error . . . . . . . . . . . . . . . . . . . 256
E.4.2 Regularized NPG Analysis . . . . . . . . . . . . . . . . . . . . 256
E.4.3 Unregularized Suboptimality Gap w.r.t. r∗ . . . . . . . . . . . . 257

E.5 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
E.5.1 Least Sqaures Guarantee . . . . . . . . . . . . . . . . . . . . . 259
E.5.2 Maximum Likelihood Estimation Guarantee . . . . . . . . . . . 260
E.5.3 Performance Difference . . . . . . . . . . . . . . . . . . . . . 261
E.5.4 KL Divergence Property . . . . . . . . . . . . . . . . . . . . . 261

E.6 Additional Experiment Details . . . . . . . . . . . . . . . . . . . . . . 262
E.6.1 Experiment Hyperparameters and Task Details . . . . . . . . . 262
E.6.2 Dataset Reset Implementation Details . . . . . . . . . . . . . . 262
E.6.3 Details on GPT4 Winrate . . . . . . . . . . . . . . . . . . . . . 265
E.6.4 Examples from Test . . . . . . . . . . . . . . . . . . . . . . . 267
E.6.5 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 271
E.6.6 Additional Experiments . . . . . . . . . . . . . . . . . . . . . 272

F Missing Proofs and Details in Chapter 8 273
F.1 Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
F.2 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

F.2.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 273
F.2.2 Hyperparameter Sweep Ranges . . . . . . . . . . . . . . . . . 273
F.2.3 Details on Task Prompts . . . . . . . . . . . . . . . . . . . . . 274

xii


F.3 Additional Samples from RLCM . . . . . . . . . . . . . . . . . . . . . 276
F.3.1 Aesthetic Task . . . . . . . . . . . . . . . . . . . . . . . . . . 277
F.3.2 Prompt Image Alignment . . . . . . . . . . . . . . . . . . . . . 278

xiii


CHAPTER 1

INTRODUCTION

Reinforcement learning (RL) fundamentally focuses on teaching agents how to make

decisions by interacting with an environment. Unlike supervised learning approaches

that learn from a fixed dataset, reinforcement learning agents learn by doing, receiving

feedback as rewards or penalties based on their chosen actions. This dynamic learning

process enables agents to optimize their behavior over time to achieve specific goals,

making RL particularly effective for problems where explicit programming of correct

actions is challenging. Recent advancements in deep learning have led to the development

of deep reinforcement learning, where deep neural networks approximate the optimal

action-value functions. This integration has significantly expanded RL’s applicability,

allowing it to tackle complex tasks ranging from robotic control (Akkaya et al., 2019)

to game playing (Silver et al., 2018; Berner et al., 2019), revolutionizing how machines

learn and adapt in uncertain and variable environments.

This generality of the paradigm makes RL incredibly difficult, making adopting novel

tasks complicated. Even with the explosive success of ChatGPT (OpenAI, 2023), which

applied a deep RL algorithm, Proximal Policy Optimization (PPO) (Schulman et al.,

2017b), to large language models (LLMs), there has been more interest in either reducing

the complexity of RL algorithms or eliminating the need for online interaction. Even

beyond LLMs, despite RL’s superhuman abilities in games such as DOTA (Berner et al.,

2019) or Go (Silver et al., 2016b), we have yet to see widespread adoption of RL in

real-world applications compared to other, arguably more specialized, learning paradigms

such as supervised learning. We notice that a critical challenge for the broad adoption

of RL is efficiently utilizing diverse data sources to create a specialized algorithm. That

is, for many of the successes mentioned above in RL, the algorithms used to learn the

1


agents were general-purpose algorithms that could also be used in other applications.

In this thesis, we attempt to introduce RL algorithms that progress toward specialized

algorithms for various settings.

The first contribution of this dissertation comes from investigating efficient inverse

reinforcement learning (IRL) from different types of data sources. We consider three

settings. First, a significant source of demonstration data exists in the form of videos.

Although it is straightforward to define the demonstrators’ states we would like to

imitate (i.e., video frames), it is difficult to reliably infer the actions taken to reproduce

the sequence of states shown in a video demonstration. Imitating an expert without

knowledge of the actions creates a challenging exploration problem. This setting is

called Imitation Learning from Observations (ILFO). Second, from the safety concerns

of deploying a suboptimal model to the prohibitive costs of running an immense model,

real-world barriers exist for online exploration and active data collection. Moreover,

actively collecting high-quality demonstration data, usually through human labeling, is

incredibly costly. Then, for a given task, we can envision the setting where a learner can

access a large corpus of low-quality but pre-collected data and a much smaller source

of high-quality expert data rather than get interactive access to the environment. This

setting is called offline imitation learning, and we introduce the first offline IL algorithm.

Finally, when it does make sense to allow for online exploration for IL, using the collected

data efficiently is critical for an algorithm to be practical. Prior works improved the

sample complexity of IL algorithms by investigating the intersection of off-policy RL

and distribution-matching IL methods. A key challenge, however, was developing an

algorithm that correctly utilized these off-policy samples and scaled to more difficult

problem settings with high-dimensional states. This setting is called off-policy IL, and

we provide a principled off-policy procedure.

2


The second contribution of this work is learning specialized algorithms in the space of

generative models. Foundation models increasingly live up to their namesake, becoming

capable base models for improved downstream performance on various tasks across

multiple application domains. As alluded to before, Reinforcement learning from human

feedback (RLHF) has emerged as a promising technique to make these foundation models

even more capable in complex settings and aligned to human intentions, providing a new

toolkit to optimize and guide generative model behavior. This approach, though, requires

collecting a vast amount of resource-demanding human preference data, posing significant

challenges in scalability and widespread adoption of effective RLHF. In this thesis’s

second part, we investigate algorithmic improvements for learning decision-making

agents from diverse data sources, working toward a vision of a scalable method for

aligning these agents to user intentions. We present three different learning settings with

generative models: text generation with an interactive black box model, text generation

with high-quality human labels, and text instruction-guided image generation. Overall,

each setting has a specific property, whether it is deterministic transition dynamics or a

short horizon, that allows for the design of more specialized algorithms that efficiently

exploit these properties.

We live in a world with an over abundance of demonstration data for sequential tasks

in videos, texts, and messy spreadsheets. Applying general purpose RL algorithms to

all these settings is unnecessarily hard. Our main focus in this thesis is to improve the

efficiency of RL by designing specialized sequential decision making algorithms from

various sources of information.

1.1 Main Contributions

We outline the two main contributions presented in this thesis.

3


1.1.1 Imitation Learning from Diverse Data Sources

We consider principled approaches to imitation learning from three different data types:

1. Observations or when we have no action information from the expert that we wish

to imitate; 2. Offline or when we only have access to a large corpus of (suboptimal)

trajectories from the environment with no additional interactive data; 3. Off-policy where

our online experience we use for learning is from multiple different policies. In the

imitation learning from observations (ILFO) setting, we show that ILFO is strictly harder

than IL and exploration is necessary to do effective imitation. In the offline setting, we

presented the state of the art in offline IL at the time of publication showing the efficacy

of our model-based approach to more complicated simulation domains. Furthermore, in

theory we investigated how the use of additional offline data allowed for us to mitigate

covariate shift without any additional online interaction or expert interaction. Finally, we

proposed an off-policy IRL algorithm that was both principled and practically scalable.

At the time of writing, many off-policy IRL algorithms that scaled to higher dimensional

states such as images required ad-hoc modifications that were not principled. On the

other hand, another line of off-policy IL work building on top of DICE (Nachum et al.,

2019a) were principled but struggled to scale to high dimensional tasks. For off-policy IL,

we propose AILBoost that aims to be the best of both worlds with a principled algorithm

that scales well in practice.

1.1.2 Reinforcement Learning and Imitation Learning with Genera-

tive Models

Our main contributions in this section lies in connecting many of the insights from the

previous part into the space of generative models. The key insight underlying these

4


contributions is that many generative processes are sequential in nature, and thus could

be modeled with the same machinery we have been using in IL and RL. For example, the

generation of a sentence is the sequential prediction of next words or the generation of an

image is a sequence of denoising steps. With that in mind, we brodaly investigate two

different generation tasks: 1. text generation (Chang et al., 2023, 2024b) and 2. image

generation with diffusion (Oertell et al., 2024). For text generation, we lift many prior

algorithms like AggreVaTeD (Sun et al., 2017a) and LOLS (Chang et al., 2015b) to LLMs

while introducing new RL algorithms such as PPO++ (Chang et al., 2023) and DR-PO

(Chang et al., 2024b). The main contribution here is similar to the previous part where we

leverage various different data sources such as interactive data for PPO++ and offline data

for DR-PO to augment RL finetuning. Finally, we introduce a novel perspective in image

generation with consistency models to improve upon guided text-to-image generation

over existing methods in diffusion models while being up to two orders of magnitude

faster.

1.2 Background

Here we give a brief introduction to background concepts that are repeatedly used

throughout the dissertation. We specifically cover Markov Decision Process, Trust

Region Policy Optimization (Schulman et al., 2015a), Proximal Policy Optimization

(Schulman et al., 2017b), and adversarial inverse reinforcement learning. Each chapter

will redefine relevant background concepts in more detail, making each chapter self-

contained. The purpose of this section is to give a brief introduction to the models and

base algorithms used throughout this dissertation.

5


1.2.1 Markov Decision Process (MDP)

In this dissertation, we present algorithms that model the problem as a Markov Decision

Process. There are two types of MDPs that we consider here.

Finite Horizon MDP: As the name suggests, finite horizon MDPs refer to a problem

that has a well-defined, finite maximum horizon. For example, when modeling language

generation as an MDP, perhaps we set a maximum generation length. More formally, a

finite horizon MDP [cite] is defined as a tuple, (S,A, P,R, µ,H). S andA define the set

of states and actions respectively for the modelled task. P defines the transition dynamics

or the probability of transitioning to the next state, s′ ∈ S, when taking action a ∈ A

from state s ∈ A. R is the reward function for this task. Finally, we define µ as the initial

distribution of states and H ∈ N+ as the finite horizon of the MDP. We can then model

a sequential task as an MDP such that at timestep, t: we are at state st ∈ S, take action

at ∈ A, transition to the next state st+1 ∈ S according to P(st+1|st, at), and get a reward of

R(st, at).

We also define a policy π : S → ∆(A) that maps from a state to a distribution over

actions. So for a given initial state, s0 ∼ µ, we get an action a0 ∼ π(·|s0), and then

transition to the next state s1 ∼ P(·|s0, a0). We repeat this process for a maximum of H

timesteps to get a trajectory τ = (s0, a0, s1, a1, . . . , sH, aH). Now, given a policy π, we can

then define the state-action Q function, Qπ
t (s, a) as the following:

Qπ
t (s, a) = R(s, a) + Es′∼P(·|s,a)

[
Vπ

t+1(s′)
]

where the state value function Vπ
t (s) is defined as

Vπ
t (s) = E

 H∑
i=t

R(si, ai)
∣∣∣∣∣ ai ∼ π(·|si), si+1 ∼ P(·|si, ai), st = s

 .
The objective function J(π) is then defined as

J(π) = Es0∼µ

[
Vπ

0 (s0)
]
.

6


The goal would be to learn a policy π ∈ Π that maximizes this objective.

Discounted Infinite Horizon MDP: Discounted infinite horizon MDPs account

for problems that do not have a set defined sequence length. They are defined as

(S,A, P,R, µ, γ), where γ ∈ (0, 1) is the discount factor. Different from finite horizon

MDPs, the definitions for the state and state-action value functions become time in-

dependent. We define the state-action Q function for a given policy, Qπ(s, a) as the

following:

Qπ(s, a) = R(s, a) + γEs′∼P(·|s,a)
[
Vπ(s′)

]
,

with the value function Vπ(s) is defined as

Vπ(s) = E

 ∞∑
t=0

γtR(st, at)
∣∣∣∣∣ at ∼ π(·|st), st+1 ∼ P(·|st, at), s0 = s

 .
Also we can define the advantage function Aπ(s, a) as

Aπ(s, a) = Qπ(s, a) − Vπ(s).

Intuitively, rather than determining how good an action is in the absolute sense like the Q

function, we hope to capture how good an action is in expectation.

1.2.2 Deep Policy Gradient Algorithms

The works presented throughout this dissertation assume using multiple Deep RL al-

gorithms as black box optimizers for the RL objective. Outside of Chapter 4, we use

on-policy algorithms or RL algorithms that update their policy based on experience

collected from the current, up-to-date policy. Specifically, I will present a primer on

Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy

Optimization (PPO) (Schulman et al., 2017b). For both algorithms, we parameterize our

7


policy as a deep neural network with parameters, θ. We denote this policy as πθ and will

compute the gradient with respect to θ, i.e. ∇θJ(πθ).

Policy Gradient: Both TRPO and PPO are policy gradient algorithms. I present a

quick primer on policy gradients. Intuitively, the key idea underlying policy gradients is

to increase the probability of actions leading to higher values and drop the probabilities

of actions resulting in lower values. More formally, for our objective J(πθ), we have our

policy gradient,

∇θJ(πθ) = Eτ∼πθ

 H∑
t=0

∇θ log πθ(at|st)Aπθ(st, at)

 ,
where τ is a trajectory collected by following πθ and Aπθ is the advantage function for

the current policy. The derivation of this gradient was done by Williams (1992) where

he introduced the foundational REINFORCE algorithm, a Monte-Carlo policy gradient

algorithm. I encourage readers to refer to Sutton and Barto (1998) and Agarwal et al.

(2019) for a detailed treatment of policy gradients.

A policy gradient algorithm would then, with the gradient above, update the policy

parameters, θ, usually with stochastic gradient ascent with some learning rate α,

θ ← [ θ + α∇θJ(πθ).

A key challenge here is how to select the appropriate learning rate. The first-order

gradients from our formulation give us the direction of our policy updates, but do not

give us the magnitude of our update. Moreover, given the online nature of our RL

optimization process, a learning rate for a policy at one iteration may be catastrophic for

another update. This problem motivates our next concept, the natural policy gradient

(Kakade, 2001a).

Natural Policy Gradient: The natural policy gradient (NPG) remedies the step size

problem by including a second-order derivative to capture the sensitivity of gradients

8


dependent on the changes in parameters, θ. For this purpose, NPG first computes the

KL-divergence between the policy before and after the update,

DKL (πθ||πθ+∆θ) =
∑

s

πθ(s) log
(
πθ(s)
πθ+∆θ(s)

)
.

We can then set the maximum tolerance of change as ϵ and solve a constrained optimiza-

tion problem

argmax
DKL(πθ ||πθ+∆θ)≤ϵ

J(πθ + ∆θ).

To derive an algorithm computing this conservative update, Kakade (2001a) approximates

the constraint as a Lagrangian and performs a Taylor expansion on the modified objective.

This results in the following update rule

θ ← [ θ +

√
2ϵ

∇θJ(πθ)⊤F(θ)∇θJ(πθ)
F−1(θ)∇θJ(πθ)

F(θ) = E
[
∇θ log πθ∇θ log π⊤θ

]
is the Fisher information matrix. The key observation here

is that we replaced the learning rate α with a dynamic term that depends on the local

sensitivity of the current policy being updated and the constraint ϵ. Furthermore, we

get second order information from F−1(θ). Both of these improvements allows for us to

make the largest policy update within the divergence threshold ϵ.

Trust Region Policy Optimization (TRPO): There are two key challenges from

NPG that TRPO aims to address.

First, it is incredibly expensive to compute the natural gradient F−1(θ)J(πθ which

involves computing the inverse of the hessian which is a matrix of size |θ| · |θ|. As

we parameterize our policies as larger neural networks, the exact computation of this

term becomes infeasible. TRPO instead approximates the natural policy gradient with

the conjugate gradient method, an iterative numerical method. The conjugate gradient

method converges in much fewer steps than the computation needed to exactly compute

the inverse of the hessian of parameters, and improves the scalability of NPG.

9


Second, although NPG provided us with the optimal step size, our Taylor approxi-

mation of the KL divergence between policies may misrepresent the actual distance and

our update could still violate the constraint. To mitigate this, TRPO performs line-search

that iteratively decreases the update size until we have an update that does not violate the

constraint. Algorithm 1 shows how TRPO enforces that the KL-constraint is satisfied.

Algorithm 1 Line Search for TRPO
1: α=1.0
2: Compute proposed update ∆θ =

√
2ϵ

∇θJ(πθ)⊤F(θ)∇θJ(πθ)
F−1(θ)∇θJ(πθ)

3: for i = 0, 1, . . . , L do
4: Try update θ′ = θ + α∆θ
5: if DKL(πθ||πθ′) ≤ ϵ then
6: Accept the update and break
7: else
8: Decay α
9: end if

10: end for

Proximal Policy Optimization (PPO): Schulman et al. (2017b) introduced PPO

building on the foundations introduced in TRPO. Despite improvements of TRPO over

NPG in scalability, TRPO is still complicated to implement and is much more costly

to implement than first-order updates used in vanilla policy gradient algorithms. To

balance practicality and the insight from NPG of taking conservative policy updates,

PPO approximates the trust region used in TRPO and NPG (i.e. DKL ≤ ϵ with a clipped

surrogate objective,

J(πθ) = Eτ∼πθ

 H∑
t=0

min
(
πθ(at|st)
πθk(at|st)

Aπθk (st, at), clip
(
πθ(at|st)
πθk(at|st)

, 1 − ϵ, 1 + ϵ
)

Aπθk (st, at)
)

The main insight from this objective is that the clip serves as an implicit trust region.

Furthermore, this objective allows for us to use stochastic gradient descent to update

the policy gradient, making both the implementation and the optimization much simpler.

10


Although ϵ requires tuning, the relative simplicity of PPO over TRPO and its improved

scalability has made PPO the de facto policy gradient algorithm over TRPO.

1.2.3 Inverse Reinforcement Learning (IRL)

Another main topic of this dissertation is inverse reinforcement learning. Although we

have used the term imitation learning (IL) up to this point, the family of methods that we

investigate to solve the IL problem is IRL. Here I will briefly introduce the setting for

IRL and present a unified min-max objective for IRL similar to our RL objective.

Setting: Recall we defined our setting for RL to be an MDP or a tuple

(S,A, P,R, µ,H). In IL, we still model our task as an MDP, but we do not assume

access to the reward. Instead, we usually assume access to an expert’s demonstrations in

the form of a dataset of trajectories. Thus, the setting for IRL would be an MDP without

the reward function (i.e. (S,A, P, µ,H)) and a separate dataset of N, demonstrations,

D = {τi}
N
i=0.

Objective: As the name, Inverse RL, suggests the overall objective of IRL is to learn

the unknown reward function that the expert policy, πE, is maximizing. Note that this

procedure could easily be extended to solve an IL problem which seeks to imitate and

learn πE from demonstrations. Specifically, once we have the optimal reward function R

that the expert is optimizing for, we can then do RL with that reward function to learn

the imitating policy. We formalize this and get the following objective

π̂ = argmax
π∈Π

min
f∈F
Eπ

[
f (s, a)

]
− EπE

[
f (s, a)

]
where F is the set of discriminators and Π is the set of policies. As detailed by Ke

et al. (2020), this objective can be viewed as framing IRL as a divergence minimization

11


problem. That is for difference choices of F , we are effectively trying to minimize

a different divergence between our current policy πθ and our expert policy πE. This

min-max objective has close connections to generative adversarial networks (Goodfellow

et al., 2014) leading to practical instantiations of this objective in foundational works

such as GAIL (Ho and Ermon, 2016a). Broadly, algorithms that solve this objective

iteratively do two steps: 1. solve for the inner minimization to learn f , 2. with f fixed,

do RL to maximize learn a policy, π, that maximizes the current reward depended on f .

1.3 Organization

The remainder of the thesis is organized as follows.

Part I: Imitation Learning consists of three chapters. Chapter 2 focuses on imitation

learning from observations alone or the setting where the expert demonstrations do not

include any action information. In this chapter, we present a model-based algorithm

MobILE. Chapter 3 considers doing effective imitation learning in the setting where we

do not allow for online interactions within the MDP, but instead have access to a large

dataset of (suboptimal) experience. Here, we introduce a model-based offline algorithm

MILO. Finally, Chapter 4 considers doing imitation learning from off-policy data in a

principled way and propose AILBoost.

Part II: IL and RL for Generative Models presents five chapters. Chapter 5 unifies

a wide range of RL and interactive IL algorithms and specializes them for language

generation with Large Language Models. We call this framework Reinforcement Learn-

ing from Guided Feedback (RLGF) and introduce a new algorithm PPO++. Chapter 6

removes the assumption that we have access to another interactive LLM and considers

whether we could directly use human demonstrations to improve RL for language models.

12


Here we present a hybrid RL algorithm DR-PO. Chapter 7 introduces using RL for a

novel class of generative models called Consistency Models. Here we propose RLCM

which does guided text-to-image generation with consistency models. Finally, Chapter 8

concludes the dissertation with future directions that build upon the ideas presented here.

Part III: Appendix contains multiple missing proofs and implementation details that

were kept out of the main text for clarity.

1.4 Bibliographical Remarks

This thesis contains works for which this author was a key contributor.

Chapter 2 is based on joint work with Rahul Kidambi and Wen Sun (Kidambi et al.,

2021). Chapter 3 is based on joint work with Masatoshi Uehara, Dhruv Sreenivas, Rahul

Kidambi, and Wen Sun (Chang et al., 2021). Chapter 4 is based on joint work with

Dhruv Sreenivas, Wendy Huang, Kianté Brantley, and Wen Sun (Chang et al., 2024a).

Chapter 5 is based on joint work with Kianté Brantley, Rajkumar Ramamurthy, Dipendra

Misra, and Wen Sun (Chang et al., 2023). Chapter 6 is based on joint work with Wenhao

Zhan, Kianté Brantley, Dipendra Misra, Jason D Lee, and Wen Sun (Chang et al., 2024b).

Finally, Chapter 7 is based on joint work with Owen Oertell, Yiyi Zhang, Kianté Brantley,

and Wen Sun (Oertell et al., 2024).

13


Part I

Imitation Learning

14


CHAPTER 2

MODEL-BASED IMITATION LEARNING FROM OBSERVATION ALONE

This chapter studies Imitation Learning from Observations alone (ILFO) where the

learner is presented with expert demonstrations that consist only of states visited by an

expert (without access to actions taken by the expert). We present a provably efficient

model-based framework, MobILE, to solve the ILFO problem. MobILE involves carefully

trading off strategic exploration against imitation - this is achieved by integrating the

idea of optimism in the face of uncertainty into the distribution matching imitation

learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate

that MobILE enjoys strong performance guarantees for classes of MDP dynamics that

satisfy certain well studied notions of structural complexity. We also show that the ILFO

problem is strictly harder than the standard IL problem by presenting an exponential

sample complexity separation between IL and ILFO. We complement these theoretical

results with experimental simulations on benchmark OpenAI Gym tasks that indicate

the efficacy of MobILE. Code for implementing the MobILE framework is available at

https://github.com/rahulkidambi/MobILE-NeurIPS2021.

15

https://github.com/rahulkidambi/MobILE-NeurIPS2021


2.1 Introduction

This chapter considers Imitation Learning from Observation Alone (ILFO). In ILFO, the

learner is presented with sequences of states encountered by the expert, without access to

the actions taken by the expert, meaning approaches based on a reduction to supervised

learning (e.g., Behavior cloning (BC) (Ross and Bagnell, 2010a), DAgger (Ross et al.,

2011b)) are not applicable. ILFO is more general and has potential for applications where

the learner and expert have different action spaces, applications like sim-to-real (Song

et al., 2020a; Desai et al., 2020) etc.

Sun et al. (2019c) reduced the ILFO problem to a sequence of one-step distribution

matching problems that results in obtaining a non-stationary policy. This approach,

however, is sample inefficient for longer horizon tasks since the algorithm does not

effectively reuse previously collected samples when solving the current sub-problem.

Another line of work considers model-based methods to infer the expert’s actions with

either an inverse dynamics (Torabi et al., 2018a) or a forward dynamics (Edwards et al.,

2019) model; these recovered actions are then fed into an IL approach like BC to output

the final policy. These works rely on stronger assumptions that are only satisfied for

Markov Decision Processes (MDPs) with injective transition dynamics (Zhu et al., 2020);

we return to this in the related works section.

We introduce MobILE—Model-based Imitation Learning and Exploring, a model-

based framework, to solve the ILFO problem. In contrast to existing model-based efforts,

MobILE learns the forward transition dynamics model—a quantity that is well defined

for any MDP. Importantly, MobILE combines strategic exploration with imitation by

interleaving a model learning step with a bonus-based, optimistic distribution matching

step – a perspective, to the best of our knowledge, that has not been considered in

16


Figure 2.1: Expert performance normalized scores of ILFO algorithms averaged across 5
seeds in environments with discrete action spaces (Reacher-v2) and continuous action
spaces (Hopper-v2 and Walker2d-v2).

Imitation Learning. MobILE has the ability to automatically trade-off exploration and

imitation. It simultaneously explores to collect data to refine the model and imitates

the expert wherever the learned model is accurate and certain. At a high level, our

theoretical results and experimental studies demonstrate that systematic exploration is

beneficial for solving ILFO reliably and efficiently, and optimism is a both theoretically

sound and practically effective approach for strategic exploration in ILFO (see Figure 2.1

for comparisons with other ILFO algorithms). This paper extends the realm of partial

information problems (e.g. Reinforcement Learning and Bandits) where optimism has

been shown to be crucial in obtaining strong performance, both in theory (e.g., E3 (Kearns

and Singh, 2002a), UCB (Auer et al., 2002)) and practice (e.g., RND (Burda et al., 2018)).

This paper proves that incorporating optimism into the min-max IL framework (Ziebart

et al., 2008; Ho and Ermon, 2016b; Sun et al., 2019c) is beneficial for both the theoretical

foundations and empirical performance of ILFO.

Our Contributions: We present MobILE (Algorithm 2), a provably efficient, model-

based framework for ILFO that offers competitive results in benchmark gym tasks.

MobILE can be instantiated with various implementation choices owing to its modular

17


design. In this chapter, we detail the following contributions:

1. MobILE combines model-based learning, optimism for exploration, and adversarial

imitation learning. MobILE achieves global optimality with near-optimal regret bounds

for classes of MDP dynamics that satisfy well studied notions of complexity. The key

idea of MobILE is to use optimism to trade-off imitation and exploration.

2. We show an exponential sample complexity gap between ILFO and classic IL where

one has access to expert’s actions. This indicates that ILFO is fundamentally harder

than IL. Our lower bound on ILFO also indicates that to achieve near optimal regret,

one needs to perform systematic exploration rather than random or no exploration,

both of which will incur sub-optimal regret.

3. We instantiate MobILE with a model ensemble of neural networks and a disagreement-

based bonus. We present experimental results on benchmark OpenAI Gym tasks,

indicating MobILE matches or outperforms existing approaches. Ablation studies

indicate that optimism helps in improving the performance in practice.

2.2 Related Works

Imitation Learning (IL) is considered through the lens of two types of approaches: (a)

behavior cloning (BC) (Pomerleau, 1989) which casts IL as a reduction to supervised or

full-information online learning (Ross and Bagnell, 2010a; Ross et al., 2011b), or, (b)

(adversarial) inverse RL (Ng and Russell, 2000; Abbeel and Ng, 2004; Ziebart et al., 2008;

Finn et al., 2016b; Ho and Ermon, 2016b; Ke et al., 2019; Ghasemipour et al., 2020),

which involves minimizing various distribution divergences to solve the IL problem,

either with the transition dynamics known (e.g., Ziebart et al. (2008)), or unknown (e.g.,

Ho and Ermon (2016b)). MobILE does not assume knowledge of the transition dynamics,

18


is model-based, and operates without access to the expert’s actions.

Imitation Learning from Observation Alone (ILFO) (Sun et al., 2019c) presents a

model-free approach FAIL that outputs a non-stationary policy by reducing the ILFO

problem into a sequence of min-max problems, one per time-step. While being theoreti-

cally sound, this approach cannot share data across different time steps and thus is not

data efficient for long horizon problems. Also FAIL in theory only works for discrete

actions. In contrast, MobILE learns a stationary policy using model-based approaches by

reusing data across all time steps and extends to continuous action space. Another line of

work (Torabi et al., 2018a; Edwards et al., 2019; Yang et al., 2019) relies on learning an

estimate of expert action, often through the use of an inverse dynamics models, Pe(a|s, s′).

Unfortunately, an inverse dynamics model is not well defined in many benign problem

instances. For instance, (Zhu et al., 2020, remark 1, section 9.3) presents an example

showing that inverse dynamics is not well defined except in the case when the MDP

dynamics is injective (i.e., no two actions could lead to the same next state from the

current state. Note that even deterministic transition dynamics doesn’t imply injectivity

of the MDP dynamics). Furthermore, ILPO (Edwards et al., 2019) applies to MDPs

with deterministic transition dynamics and discrete actions. MobILE, on the other hand,

learns the forward dynamics model which is always unique and well-defined for both

deterministic and stochastic transitions and works with discrete and continuous actions.

Another line of work in ILFO revolves around using hand-crafted cost functions that may

rely on task-specific knowledge (Peng et al., 2018; Aytar et al., 2018; Schmeckpeper et al.,

2020). The performance of policies outputted by these methods relies on the quality

of the engineered cost functions. In contrast, MobILE does not require cost function

engineering.

Model-Based RL has seen several advances (Sutton, 1990; Li and Todorov, 2004;

Deisenroth and Rasmussen, 2011) including ones based on deep learning (e.g., Lampe

19


and Riedmiller (2014); Gu et al. (2016); Luo et al. (2018); Janner et al. (2019); Lowrey

et al. (2019); Wang et al. (2019)). Given MobILE’s modularity, these advances in model-

based RL can be translated to improved algorithms for the ILFO problem. MobILE bears

parallels to provably efficient model-based RL approaches including E3 (Kearns and

Singh, 2002b; Kakade et al., 2003b), R-MAX (Brafman and Tennenholtz, 2001), UCRL

(Jaksch et al., 2010), UCBVI (Azar et al., 2017), Linear MDP (Yang and Wang, 2019),

LC3 (Kakade et al., 2020a), Witness rank (Sun et al., 2019a) which utilize optimism

based approaches to trade-off exploration and exploitation. MobILE utilizes optimism to

trade-off exploration and imitation.

2.3 Setting

We consider an episodic finite-horizon MDP,M = {S,A, P⋆,H, c, s0}, where S,A are

the state and action space, P⋆ : S × A 7→ S is the MDP’s transition kernel, H is

the horizon, s0 is a fixed initial state (note that our work generalizes when we have a

distribution over initial states), and c is the state-dependent cost function c : S 7→ [0, 1].

Our result can be extended to the setting where c : S × S 7→ [0, 1], i.e., the ground truth

cost c(s, s′) depends on state and next state pairs. For analytical simplicity, we focus on

c : S 7→ [0, 1].1

We denote dπP ∈ ∆(S ×A) as the average state-action distribution of policy π under

the transition kernel P, i.e., dπP(s, a) := 1
H

∑H
t=1 Pr(st = s, at = a|s0, π, P), where Pr(st =

s, at = a|s0, π, P) is the probability of reaching (s, a) at time step t starting from s0 by

following π under transition kernel P. We abuse notation and write s ∼ dπP to denote a

1Without any additional assumptions, in ILFO, learning to optimize action-dependent cost c(s, a) (or
c(s, a, s′) is not possible. For example, if there are two sequences of actions that generate the same
sequence of states, without seeing expert’s preference over actions, we do not know which actions to
commit to.

20


state s is sampled from the state-wise distribution which marginalizes action over dπP(s, a),

i.e., dπP(s) := 1
H

∑H
t=1 Pr(st = s|s0, π, P). For a given cost function f : S 7→ [0, 1], Vπ

P; f

denotes the expected total cost of π under transition P and cost function f . Similar to the

IL setting, in ILFO, the ground truth cost c is unknown. Instead, we can query the expert,

denoted as πe : S 7→ ∆(A). Note that the expert πe could be stochastic and does not have

to be the optimal policy. The expert, when queried, provides state-only demonstrations

τ = {s0, s1 . . . sH}, where st+1 ∼ P⋆(·|st, at) and at ∼ π
e(·|st).

The goal is to leverage expert’s state-wise demonstrations to learn a policy π that

performs as well as πe in terms of optimizing the ground truth cost c, with polynomial

sample complexity on problem parameters such as horizon, number of expert samples and

online samples and the underlying MDP’s complexity measures (see section Section 2.5

for precise examples). We track the progress of any (randomized) algorithm by measuring

the (expected) regret incurred by a policy π defined as E[Vπ]−Vπ∗ as a function of number

of online interactions utilized by the algorithm to compute π.

2.3.1 Function Approximation Setup

Since the ground truth cost c is unknown, we utilize the notion of a function class

(i.e., discriminators) F ⊂ S 7→ [0, 1] to define the costs that can then be utilized by a

planning algorithm (e.g. NPG (Kakade, 2001b)) for purposes of distribution matching

with expert states. If the ground truth c depends on both the state and next state, (s, s′),

we use discriminators F ⊂ S × S 7→ [0, 1]. Furthermore, we use a model class

P ⊂ S ×A 7→ ∆(S) to capture the ground truth transition P⋆. For the theoretical results

in the paper, we assume realizability:

Assumption 1. Assume F and P captures both the ground truth cost and transition, i.e.,

21


c ∈ F , P⋆ ∈ P.

We will use an integral probability metric (IPM) with F as our divergence measure.

Note that if c ∈ F and c : S 7→ [0, 1], then the IPM defined as max f∈F Es∼dπ f (s) −

Es∼dπe f (s) directly upper bounds sub-optimality gap Vπ − Vπe
, where Vπ is the expected

total cost of π under cost function c. This justifies why minimizing the IPM between

two state distributions suffices (Ho and Ermon, 2016b; Sun et al., 2019c). Similarly,

if c depends on s, s′, we can simply minimize the IPM between two state-next state

distributions, i.e., max f Es,s′∼dπ f (s, s′) − Es,s′∼dπe f (s, s′) where discriminators now take

(s, s′) as input.2

To permit generalization, we require P to have bounded complexity. For analytical

simplicity, we assume F is discrete (but exponentially large), and we require the sample

complexity of any PAC algorithm to scale polynomially with respect to its complexity

ln(|F |). The ln |F | complexity can be replaced to bounded conventional complexity

measures such as Rademacher complexity and covering number for continuous F (e.g.,

F being a Reproducing Kernel Hilbert Space).

2.4 Algorithm

We introduce MobILE (Algorithm 2) for the ILFO problem. MobILE utilizes (a) a function

class F for Integral Probability Metric (IPM) based distribution matching, (b) a tran-

sition dynamics model class P for model learning, (c) a bonus parameterization B for

exploration, (d) a policy class Π for policy optimization. At every iteration, MobILE (in

Algorithm 2) performs the following steps:

2we slightly abuse notation here and denote dπ as the average state-next state distribution of π, i.e.,
dπ(s, s′) := dπ(s)

∫
a π(a|s)daP⋆(s′|s, a).

22


Algorithm 2 MobILE: The framework of Model-based Imitation Learning and Exploring
for ILFO

1: Require: IPM class F , dynamics model class P, policy class Π, bonus function
class B, expert datasetDe ≡ {se

i }
N
i=1.

2: Initialize policy π0 ∈ Π, replay bufferD−1 = ∅.
3: for t = 0, · · · ,T − 1 do
4: Execute πt in true environment P⋆ to get samples τt = {sk, ak}

H−1
k=0 ∪ sH. Append to

replay bufferDt = Dt−1 ∪ τt.
5: Update model and bonus: P̂t+1 : S ×A → S and bt+1 : S ×A → R+ using buffer

Dt.
6: Optimistic model-based min-max IL: obtain πt+1 by solving Equation (2.1) with

P̂t+1, bt+1,De.
7: end for
8: Return πT .

1. Dynamics Model Learning: execute policy in the environment online to obtain state-

action-next state (s, a, s′) triples which are appended to the bufferD. Fit a transition

model P̂ onD.

2. Bonus Design: design bonus to incentivize exploration where the learnt dynamics

model is uncertain, i.e. the bonus b(s, a) is large at state s where P̂(·|s, a) is uncertain

in terms of estimating P⋆(·|s, a), while b(s, a) is small where P̂(·|s, a) is certain.

3. Imitation-Exploration tradeoff: Given discriminators F , model P̂, bonus b and

expert dataset De, perform distribution matching by solving the model-based IPM

objective with bonus:

πt+1 ← arg min
π∈Π

max
f∈F

L(π, f ; P̂, b,De) := E(s,a)∼dπ
P̂

[
f (s) − b(s, a)

]
− Es∼De

[
f (s)

]
,

(2.1)

where Es∼De f (s) :=
∑

s∈De
f (s)/|De|.

Intuitively, the bonus cancels out discriminator’s power in parts of the state space where

the dynamics model P̂ is not accurate, thus offering freedom for MobILE to explore. We

first explain MobILE’s components and then discuss MobILE’s key property—which is to

trade-off exploration and imitation.

23


2.4.1 Components of MobILE

Dynamics model learning: For the model fitting step in line 5, we assume that we get

a calibrated model in the sense that: ∥P̂t(·|s, a) − P⋆(·|s, a)∥1 ≤ σt(s, a),∀s, a for some

uncertainty measure σt(s, a), similar to model-based RL works, e.g. (Curi et al., 2020).

We discuss ways to estimate σt(s, a) in the bonus estimation below. There are many

examples (discussed in Section 2.5) that permit efficient estimation of these quantities

including tabular MDPs, Kernelized nonlinear regulator, nonparametric model such as

Gaussian Processes. Consider a general function class G ⊂ S ×A 7→ S, one can learn ĝt

via solving a regression problem, i.e.,

ĝt = argmin
g∈G

∑
s,a,s′∈Dt

∥g(s, a) − s′∥22, (2.2)

and setting P̂t(·|s, a) = N
(̂
gt(s, a), σ2I

)
, where, σ is the standard deviation of error

induced by ĝt. In practice, such parameterizations have been employed in several settings

in RL withG being a multi-layer perceptron (MLP) based function class (e.g.,(Rajeswaran

et al., 2020)). In Section 2.5, we also connect this with prior works in provable model-

based RL literature.

Bonus: We utilize bonuses as a means to incentivize the policy to efficiently explore

unknown parts of the state space for improved model learning (and hence better distribu-

tion matching). With the uncertainty measure σt(s, a) obtained from calibrated model

fitting, we can simply set the bonus bt(s, a) = O(Hσt(s, a)). How do we obtain σt(s, a)

in practice? For a general class G, given the least square solution ĝt, we can define a

version space Gt as: Gt =
{
g ∈ G :

∑t−1
i=0

∑H−1
h=0 ∥g(st

h, a
t
h) − ĝt(st

h, a
t
h)∥22 ≤ zt

}
, with zt being

a hyper parameter. The version space Gt is an ensemble of functions g ∈ G which has

training error onDt almost as small as the training error of the least square solution ĝt.

24


In other words, version space Gt contains functions that agree on the training setDt. The

uncertainty measure at (s, a) is then the maximum disagreement among models in Gt,

with σt(s, a) ∝ sup f1, f2∈Gt
∥ f1(s, a) − f2(s, a)∥2. Since g ∈ Gt agree onDt, a large σt(s, a)

indicates (s, a) is novel. See example 3 for more theoretical details.

Empirically, disagreement among an ensemble (Osband et al., 2018a; Azizzadenesheli

et al., 2018a; Burda et al., 2019; Pathak et al., 2019a; Lowrey et al., 2019) is used for

designing bonuses that incentivize exploration. We utilize an neural network ensemble,

where each model is trained on Dt (via SGD on squared loss Equation (2.2)) with

different initialization. This approximates the version space Gt, and the bonus is set as a

function of maximum disagreement among the ensemble’s predictions.

Optimistic model-based min-max IL: For model-based imitation (line 6), MobILE

takes the current model P̂t and the discriminators F as inputs and performs

policy search to minimize the divergence defined by P̂n and F : dt(π, πe) :=

max f∈F

[
Es,a∼dπ

P̂t
( f (s) − bt(s, a)) − Es∼dπe f (s)

]
. Note that, for a fixed π, the arg max f∈F is

identical with or without the bonus term, since Es,a∼dπ
P̂t

bt(s, a) is independent of f . In

our implementation, we use the Maximum Mean Discrepancy (MMD) with a Radial

Basis Function (RBF) kernel to model discriminators F .3 We compute argminπ dt(π, πe)

by iteratively (1) computing the argmax discriminator f given the current π, and (2)

using policy gradient methods (e.g., TRPO) to update π inside P̂t with f − bt as the cost.

Specifically, to find πt (line 6), we iterate between the following two steps:

1. Cost update: f̂ = argmax
f∈F

Es∼dπ̂
P̂t

f (s) − Es∼De f (s)

2. PG Step: π̂ = π̂ − η · ∇πV π̂

P̂t , f̂−bt
,

where the PG step uses the learnt dynamics model P̂t and the optimistic IPM cost

f̂ (s) − bt(s, a). Note that for MMD, the cost update step has a closed-form solution.
3For MMD with kernel k, F = {w⊤ϕ(s, a)|∥w∥2 ≤ 1} where ϕ: ⟨ϕ(s, a), ϕ(s′, a′)⟩ = k((s, a), (s′, a′)).

25


2.4.2 Exploration And Imitation Tradeoff

We note that MobILE is performing an automatic trade-off between exploration and

imitation. More specifically, the bonus is designed such that it has high values in the

state space that have not been visited, and low values in the state space that have been

frequently visited by the sequence of learned policies so far. Thus, by incorporating the

bonus into the discriminator f ∈ F (e.g., f̃ (s, a) = f (s)−bt(s, a)), we diminish the power

of discriminator f at novel state-action space regions, which relaxes the state-matching

constraint (as the bonus cancels the penalty from the discriminators) at those novel

regions so that exploration is encouraged. For well explored states, we force the learner’s

states to match the expert’s using the full power of the discriminators. Our work uses

optimism (via coupling bonus and discriminators) to carefully balance imitation and

exploration.

2.5 Analysis

This section presents a general theorem for MobILE that uses the notion of information

gain Srinivas et al. (2009), and then specializes this result to common classes of stochastic

MDPs such as discrete (tabular) MDPs, Kernelized nonlinear regulator Kakade et al.

(2020c), and general function class with bounded Eluder dimension Russo and Roy

(2013).

Recall, Algorithm 2 generates one state-action trajectory τt := {st
h, a

t
h}

H
h=0 at iteration t

and estimates model P̂t based on Dt = τ
0, . . . , τt−1. We present our theorem under the

assumption that model fitting gives us a model P̂ and a confidence interval of the model’s

prediction.

26


Assumption 2 (Calibrated Model). For all iteration t with t ∈ N, with probability 1 − δ,

we have a model P̂t and its associated uncertainty measure σt : S ×A 7→ R+, such that

for all s, a ∈ S ×A4

∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1
≤ min {σt(s, a), 2} .

Assumption 2 has featured in prior works (e.g., Curi et al. (2020)) to prove regret

bounds in model-based RL. Below we demonstrate examples that satisfy the above

assumption.

Example 1 (Discrete MDPs). Given Dt, denote N(s, a) as the number of times (s, a)

appears in Dt, and N(s, a, s′) number of times (s, a, s′) appears in Dt. We can set

P̂t(s′|s, a) = N(s, a, s′)/N(s, a),∀s, a, s′. We can set σt(s, a) = Õ
(√

S/N(s, a)
)
.

Example 2 (KNRs Kakade et al. (2020c)). For KNR, we have P⋆(·|s, a) =

N
(
W⋆ϕ(s, a), σ2I

)
where feature mapping ϕ(s, a) ∈ Rd and ∥ϕ(s, a)∥2 ≤ 1 for all s, a.5

We can learn P̂t via Kernel Ridge regression, i.e., ĝt(s, a) = Ŵtϕ(s, a) where

Ŵt = argmin
W

∑
s,a,s′∈Dt

∥Wϕ(s, a) − s′∥22 + λ ∥W∥
2
F

where ∥ · ∥F is the Frobenius norm. The uncertainty measure σt(s, a) = βt
σ
∥ϕ(s, a)∥Σ−1

t
,

βt = {2λ∥W⋆∥22 + 8σ2 · [ds ln(5) + 2 ln(t2/δ) + ln(4) + ln (det(Σt)/ det(λI))]}1/2, and, Σt =∑t−1
k=0

∑H−1
h=1 ϕ(sk

h, a
k
h)ϕ(sk

h, a
k
h)⊤ + λI with λ > 0. See Proposition 34 for more details.

Similar to RKHS, Gaussian processes (GPs) offers a calibrated model (Srinivas et al.,

2009). Note that GPs offer similar regret bounds as RKHS; so we do not discuss GPs

and instead refer readers to Curi et al. (2020).
4the uncertainty measure σt(s, a) will depend on the input failure probability δ, which we drop here for

notational simplicity. When we introduce specific examples, we will be explicit about the dependence on
the failure probability δ which usually is in the order of ln(1/δ).

5The covariance matrix can be generalized to any PSD matrix with bounded condition number.

27


Example 3 (General class G). In this case, assume we have P⋆(·|s, a) = N(g⋆(s, a), σ2I)

with g⋆ ∈ G. Assume G is discrete (but could be exponentially large with complexity

measure, ln(|G|)), and supg∈G,s,a ∥g(s, a)∥2 ≤ G ∈ R+. Suppose model learning step is

done by least square:

ĝt = argming∈G
∑t−1

k=0
∑H−1

h=0

∥∥∥g(sk
h, a

k
h) − sk

h+1

∥∥∥2

2
.

Compute a version space Gt =

{
g ∈ G :

∑t−1
k=0

∑H−1
h=0

∥∥∥g(sk
h, a

k
h) − ĝt(sk

h, a
k
h)
∥∥∥2

2
≤ zt

}
,

where zt = 2σ2G2ln(2t2|G|/δ) and use this for uncertainty computation. In particu-

lar, set uncertainty σt(s, a) = 1
σ

maxg1∈G,g2∈G ∥g1(s, a) − g2(s, a)∥2, i.e., the maximum

disagreement between any two functions in the version space Gt. Refer to Proposition 36

for more details.

The maximum disagreement above motivates our practical implementation where

we use an ensemble of neural networks to approximate the version space and use the

maximum disagreement among the models’ predictions as the bonus. We refer readers to

Section 2.7 for more details.

2.5.1 Regret Bound

We bound regret with the quantity named Information Gain I (up to some constant

scaling factor) (Srinivas et al., 2009):

IT := max
Alg
EAlg

T−1∑
t=0

H−1∑
h=0

min
{
σ2

t (st
h, a

t
h), 1

} , (2.3)

where Alg is any adaptive algorithm (thus including Algorithm 2) that maps from history

before iteration t to some policy πt ∈ Π. After the main theorem, we give concrete

examples for IT where we show that IT has extremely mild growth rate with respect

28


to T (i.e., logarithimic). Denote Vπ as the expected total cost of π under the true cost

function c and the real dynamics P⋆.

Theorem 3 (Main result). Assume model learning is calibrated (i.e., Assumption 2 holds

for all t) and Assumption 1 holds. In Algorithm 2, set bonus bt(s, a) := H min{σt(s, a), 2}.

There exists a set of parameters, such that after running Algorithm 2 for T iterations, we

have:

E
[

min
t∈[0,...,T−1]

Vπt − Vπe
]
≤ O

H2.5
√
IT

√
T

+ H

√
ln(T H|F |)

N

 .
Appendix A.1 contains proof of Theorem 3. This theorem indicates that as long as

IT grows sublinearly o(T ), we find a policy that is at least as good as the expert policy

when T and N approach infinity. For any discrete MDP, KNR Kakade et al. (2020c),

Gaussian Processes models Srinivas et al. (2009), and general G with bounded Eluder

dimension (Russo and Van Roy (2014); Osband and Van Roy (2014)), we can show that

the growth rate of IT with respect to T is mild.

Corollary 4 (Discrete MDP). For discrete MDPs, IT = Õ(HS 2A) where S = |S|, A = |A|.

Thus:

E
[

min
t∈[0,...,T−1]

Vπt − Vπe
]
= Õ

H3S
√

A
√

T
+ H

√
ln(|F |)

N

 .
Note that Corollary 4 (proof in Appendix A.1.1) hold for any MDPs (not just injective

MDPs) and any stochastic expert policy. The dependence on A,T is tight (see lower

bound in Section 2.5.2). Now we specialize Theorem 3 to continuous MDPs below.

Corollary 5 (KNRs (Example 2)). For simplicity, consider the finite dimension setting

ϕ : S ×A 7→ Rd. We can show that IT = Õ
(
Hd + Hdds + Hd2

)
(see Proposition 35 for

details), where d is the dimension of the feature ϕ(s, a) and ds is the dimension of the

29


state space. Thus, we have 6

E
[

min
t∈[0,...,T−1]

Vπt − Vπe
]
= Õ

H3
√

dds + d2

√
T

+ H

√
ln(|F |)

N

 .
Corollary 6 (General G with bounded Eluder dimension (Example 3)). For general

G, assume that G has Eluder-dimension dE(ϵ) (Definition 3 in Osband and Van Roy

(2014)). Denote dE = dE(1/T H). The information gain is upper bounded as IT =

O
(
HdE + dE ln(T 3H|G|) ln(T H)

)
(see Proposition 38). Thus,

E
[

min
t∈[0,...,T−1]

Vπt − Vπe
]
= Õ

H3√dE ln(T H|G|)
√

T
+ H

√
ln(|F |)

N

 .
Thus as long asG has bounded complexity in terms of the Eluder dimension Russo and

Van Roy (2014); Osband and Van Roy (2014), MobILE with the maximum disagreement-

based optimism leads to near-optimal guarantees.

2.5.2 Exploration in ILFO and the Exponential Gap between IL and

ILFO

To show the benefit of strategic exploration over random exploration in ILFO, we present a

novel reduction of the ILFO problem to a bandit optimization problem, for which strategic

exploration is known to be necessary (Bubeck and Cesa-Bianchi, 2012) for optimal

bounds while random exploration is suboptimal; this reduction indicates that benefit of

strategic exploration for solving ILFO efficiently. This reduction also demonstrate that

there exists an exponential gap in terms of sample complexity between ILFO and classic

IL that has access to expert actions. We leave the details of the reduction framework in

Appendix A.1.4. The reduction allows us to derive the following lower bound for any

ILFO algorithm.

6We use Õ to suppress log term except the ln(|G|) and ln(|F |) which present the complexity of F and G.

30


Theorem 7. There exists an MDP with number of actions A ≥ 2, such that even with

infinitely many expert data, any ILFO algorithm must occur expected commutative regret

Ω(
√

AT ).

Specifically we rely on the following reduction where solving ILFO, with even infinite

expert data, is at least as hard as solving an MAB problem with the known optimal arm’s

mean reward which itself occurs the same worst case
√

AT cumulative regret bound as

the one in the classic MAB setting. For MAB, it is known that random exploration such

as ϵ-greedy will occur suboptimal regret O(T 2/3). Thus to achieve optimal
√

T rate, one

needs to leverage strategic exploration (e.g., optimism).

Methods such as BC for IL have sample complexity that scales as poly ln(A), e.g.,

see (Agarwal et al., 2019, Theorem 14.3, Chapter 14) which shows that for tabular MDP,

BC learns a policy whose performance is O(H2√S ln(A)/N) away from the expert’s

performance (here S is the number of states in the tabular MDP). Similarly, in interactive

IL setting, DAgger Ross et al. (2011b) can also achieve poly ln(A) dependence in sample

complexity. The exponential gap in the sample complexity dependence on A between

IL and ILFO formalizes the additional difficulty encountered by learning algorithms in

ILFO.

2.6 Practical Instantiation of MobILE

We present a brief practical instantiation MobILE’s components with details in Ap-

pendix Section A.3.

Dynamics model learning:We employ Gaussian Dynamics Models parameterized by an

MLP (Rajeswaran et al., 2020; Kidambi et al., 2020a), i.e., P̂(s, a) := N(hθ(s, a), σ2I),

where, hθ(s, a) = s + σ∆s · MLPθ(sc, ac), where, θ are MLP’s trainable parameters,

31


sc = (s − µs)/σs, ac = (a − µa)/σa with µs, µa (and σs, σa) being the mean of states,

actions (and standard deviation of states and actions) in the replay bufferD. Next, for

(s, a, s′) ∈ D, ∆s = s′− s and σ∆s is the standard deviation of the state differences ∆s ∈ D.

We use SGD with momentum (Sutskever et al., 2013) for training the parameters θ of the

MLP.

Discriminator parameterization:We utilize MMD as our choice of IPM and define the

discriminator as f (s) = w⊤ψ(s), where, ψ(s) are Random Fourier Features (Rahimi and

Recht, 2008a).

Bonus parameterization:We utilize the discrepancy between predictions of a pair of dy-

namics models hθ1(s, a) and hθ2(s, a) for designing the bonus. Empirically, we found that

using more than two models in the ensemble offered little to no improvements. Denote the

disagreement at any (s, a) as δ(s, a) =
∥∥∥hθ1(s, a) − hθ2(s, a)

∥∥∥
2
, and δD = max(s,a)∼D δ(s, a)

is the max discrepancy of a replay bufferD. We set bonus as b(s, a) = λ ·min(δ(s, a)/δD,

where λ > 0 is a tunable parameter.

PG oracle:We use TRPO (Schulman et al., 2015b) to perform incremental policy opti-

mization inside the learned model.

2.7 Experiments

This section seeks to answer the following questions: (1) How does MobILE compare

against other benchmark algorithms? (2) How does optimism impact sample efficiency/-

final performance? (3) How does increasing the number of expert samples impact the

quality of policy outputted by MobILE?

We consider tasks from Open AI Gym (Brockman et al., 2016a) simulated with Mu-

joco (Todorov et al., 2012a): Cartpole-v1, Reacher-v2, Swimmer-v2, Hopper-v2 and

32


1 2 3 4 5
Online Samples 1e4

0

100

200

300

400

500

R
et

ur
n 

(V
al

ue
)

CartPole-v1 (10 traj.)

MobILE (Ours)
BC
Expert
GAIL

BC-O
GAIFO
ILPO

0.5 1.0 1.5
Online Samples 1e4

40

30

20

10

R
et

ur
n 

(V
al

ue
)

Reacher-v2 (10 traj.)

0.2 0.4 0.6 0.8 1.0
Online Samples 1e5

10

0

10

20

30

40

R
et

ur
n 

(V
al

ue
)

Swimmer-v2 (40 traj.)

0.5 1.0 1.5
Online Samples 1e6

0

1000

2000

3000

R
et

ur
n 

(V
al

ue
)

Hopper-v2 (10 traj.)

0.25 0.50 0.75 1.00 1.25
Online Samples 1e6

0

500

1000

1500

2000

2500

R
et

ur
n 

(V
al

ue
)

Walker2d-v2 (10 traj.)

CartPole-v1

Reacher-v2

Swimmer-v2

Hopper-v2

Walker2d-v2
0.00

0.25

0.50

0.75

1.00

1.25

N
or

m
al

iz
ed

 S
co

re

Figure 2.2: Comparing MobILE (red) against BC (orange), BC-O (green), GAIL (purple),
GAIFO (periwinkle), ILPO (green olive). The learning curves are obtained by averaging
all algorithms over 5 seeds. MobILE outperforms BC-O, GAIL and matches BC’s behavior
despite MobILE not having access to expert actions. The bar plot (bottom-right) presents
the best performing policy outputted by each algorithm averaged across 5 seeds for
each algorithm. MobILE clearly outperforms BC-O, GAIFO, ILPO while matching the
behavior of IL algorithms like BC/GAIL which use expert actions.

Walker2d-v2. We train an expert for each task using TRPO (Schulman et al., 2015b)

until we obtain an expert policy of average value 460,−10, 38, 3000, 2000 respec-

tively. We setup Swimmer-v2, Hopper-v2,Walker2d-v2 similar to prior model-based

RL works (Kurutach et al., 2018; Nagabandi et al., 2018; Luo et al., 2018; Rajeswaran

et al., 2020; Kidambi et al., 2020a).

We compare MobILE against the following algorithms: Behavior Cloning (BC),

GAIL (Ho and Ermon, 2016b), BC-O (Torabi et al., 2018a), ILPO (Edwards et al., 2019)

(for environments with discrete actions), GAIFO (Torabi et al., 2018b). Furthermore,

recall that BC and GAIL utilize both expert states and actions, information that is not

available for ILFO. This makes both BC and GAIL idealistic targets for comparing ILFO

methods like MobILE against. As reported by Torabi et al. (Torabi et al., 2018a), BC

33


outperforms BC-O in all benchmark results. Moreover, our results indicate MobILE out-

performs GAIL and GAIFO in terms of sample efficiency. With reasonable amount of

parameter tuning, BC serves as a very strong baseline and nearly solves deterministic

Mujoco environments. We use code released by the authors for BC-O and ILPO. For

GAIL we use an open source implementation (Hill et al., 2018), and for GAIFO, we

modify the GAIL implementation as described by the authors. We present our results

through (a) learning curves obtained by averaging the progress of the algorithm across 5

seeds, and, (b) bar plot showing expert normalized scores averaged across 5 seeds using

the best performing policy obtained with each seed. Normalized score refers to ratio

of policy’s score over the expert score (so that expert has normalized score of 1). For

Reacher-v2, since the expert policy has a negative score, we add an constant before

normalization. More details can be found in Appendix A.3.

2.7.1 Benchmarking MobILE on MuJoCo suite

Figure 2.2 compares MobILE with BC, BC-O, GAIL, GAIFO and ILPO. MobILE consis-

tently matches or exceeds BC/GAIL’s performance despite BC/GAIL having access to

actions taken by the expert and MobILE functioning without expert action information.

MobILE, also, consistently improves upon the behavior of ILFO methods such as BC-O,

ILPO, and GAIFO. We see that BC does remarkably well in these benchmarks owing

2 4
# Online Samples 1e4

0

200

400

R
et

ur
n 

(V
al

ue
)

CartPole-v1 (10 traj.)

Expert
With optimism
No optimism

0.5 1.0 1.5
# Online Samples 1e4

40

30

20

10

Reacher-v2 (10 traj.)

0.5 1.0
# Online Samples 1e5

0

20

40
Swimmer-v2 (40 traj.)

0.5 1.0 1.5
# Online Samples 1e6

0

1000

2000

3000

Hopper-v2 (10 traj.)

0.5 1.0
# Online Samples 1e6

0

1000

2000

Walker2d-v2 (10 traj.)

Figure 2.3: Learning curves obtained by running MobILE with (red) and without (green)
optimism. Without optimism, the algorithm learns slowly or does not match the expert,
whereas, with optimism, MobILE shows improved behavior by automatically trading off
exploration and imitation.

34


to determinism in the transition dynamics; in the appendix, we consider a variant of the

cartpole environment with stochastic dynamics. Our results suggest that BC struggles

with stochasticity in the dynamics and fails to solve this task, while MobILE continues

to reliably solve this task. Also, note that we utilize 10 expert trajectories for all envi-

ronments except Swimmer-v2; this is because all algorithms (including MobILE) present

results with high variance. We include a learning curve for Swimmer-v2 with 10 expert

trajectories in the appendix. The bar plot in Figure 2.2 shows that within the sample

budget shown in the learning curves, MobILE (being a model-based algorithm), presents

superior performance in terms of matching expert, thus indicating it is more sample

efficient than GAIFO, GAIL (both being model-free methods), ILPO and BC-O.

2.7.2 Importance of the optimistic MDP construction

Figure 2.3 presents results obtained by running MobILE with and without optimism. In

the absence of optimism, the algorithm either tends to be sample inefficient in achieving

expert performance or completely fails to solve the problem. Note that without optimism,

the algorithm isn’t explicitly incentivized to explore – only implicitly exploring due to

noise induced by sampling actions. This, however, is not sufficient to solve the problem

efficiently. In contrast, MobILE with optimism presents improved behavior and in most

cases, solves the environments with fewer online interactions.

2.7.3 Varying Number of Expert Samples

Table 2.1: Expert normalized score and

standard deviation of policy outputted by

MobILE when varying number of expert tra-

jectories as E1 and E2 (specific values repre-

sented in parentheses)

Environment E1 E2 Expert

Cartpole-v1 1.07 ± 0.15 (5) 1.14 ± 0 (10) 1 ± 0.25

Reacher-v2 1.01 ± 0.05 (10) 0.997 ± 0.055 (20) 1 ± 0.11

Swimmer-v2 1.54 ± 1.1 (10) 1.25 ± 0.15 (40) 1 ± 0.05

Hopper-v2 1.11 ± 0.064 (10) 1.16 ± 0.03 (40) 1 ± 0.16

Walker2d-v2 0.975 ± 0.12 (10) 0.94 ± 0.038 (50) 1 ± 0.25

Table 2.1 shows the impact of increasing

the number of samples drawn from the ex-

35


pert policy for solving the ILFO problem.

The main takeaway is that increasing the

number of expert samples aids MobILE in

reliably solving the problem (i.e. with

lesser variance).

2.8 Conclusion

MobILEis a model-based ILFO approach that is applicable to MDPs with stochastic

dynamics and continuous action spaces. MobILE trades-off exploration and imitation,

and this perspective is shown to be important for solving the ILFO efficiently both in

theory and in practice. Future works include exploring other means for learning dynamics

models, performing strategic exploration and extending MobILE to problems with rich

observation spaces (e.g. videos).

By not even needing the actions to imitate, ILFO algorithms allow for learning

algorithms to capitalize on large amounts of video data available online. Moreover,

in ILFO, the learner is successful if it learns to imitate the expert. Any expert policy

designed by bad actors can naturally lead to obtaining new policies that continue to

imitate and be a negative influence to the society. With this perspective in mind, any

expert policy must be thoroughly vetted in order to ensure ILFO algorithms including

MobILE are employed in ways that benefit the society.

36


CHAPTER 3

MODEL-BASED OFFLINE IMITATION LEARNING

This chapter studies offline Imitation Learning (IL) where an agent learns to imitate

an expert demonstrator without additional online environment interactions. Instead, the

learner is presented with a static offline dataset of state-action-next state transition triples

from a potentially less proficient behavior policy. We introduce Model-based IL from

Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve

the offline IL problem efficiently both in theory and in practice. In theory, even if the

behavior policy is highly sub-optimal compared to the expert, we show that as long as

the data from the behavior policy provides sufficient coverage on the expert state-action

traces (and with no necessity for a global coverage over the entire state-action space),

MILO can provably combat the covariate shift issue in IL. Complementing our theory

results, we also demonstrate that a practical implementation of our approach mitigates

covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with

behavior policies whose performances are less than half of that of the expert, MILO still

successfully imitates with an extremely low number of expert state-action pairs while

traditional offline IL method such as behavior cloning (BC) fails completely. Source

code is provided at https://github.com/jdchang1/milo.

37

https://github.com/jdchang1/milo


3.1 Introduction

Covariate shift is a core issue in Imitation Learning (IL). Traditional IL methods like

behavior cloning (BC) (Pomerlau, 1989), while simple, suffer from covariate shift,

learning a policy that can make arbitrary mistakes in parts of the state space not covered

by the expert dataset. This leads to compounding errors in the agent’s performance (Ross

and Bagnell, 2010b), hurting the generalization capabilities in practice.

Figure 3.1: (Left) Frames at timesteps 200, 400, 600, 800, and 1000 for Humanoid-v2
from policies trained with BC on 100 state-action pairs from the expert (blue), BC on
1M offline samples plus 100 expert samples (yellow), and our algorithm MILO(red). The
expert has a performance of 3248 and the behavior policy used to collect the offline dataset
has performance of 1505 ± 473 (≈ 46% of the expert’s). (Right) Expert performance
normalized scores averaged across 5 seeds.

Prior works have presented several means to combat this phenomenon in IL. One line

of thought utilizes an interactive expert, i.e. an expert that can be queried at an arbitrary

state encountered during the training procedure. Interactive IL algorithms such as DAgger

(Ross et al., 2011c), LOLS (Chang et al., 2015a), and AggreVaTe(D) (Ross and Bagnell,

2014a; Sun et al., 2017b) utilize a reduction to no-regret online learning and demonstrate

that under certain conditions, they can successfully learn a policy that imitates the expert.

These interactive IL algorithms, however, cannot provably avoid covariate shift if the

38


expert is not recoverable. That is, Aπe(s, a) = Θ(H) where πe is the expert, Aπ is the usual

(dis)advantage function,1 and H is the planning horizon (Rajaraman et al., 2020; Agarwal

et al., 2019, Chapter 14). A second line of work that avoids covariate shift utilizes either

a known transition dynamics model (Ziebart et al., 2008) or uses real world interactions

(Ho and Ermon, 2016a; Brantley et al., 2019; Sun et al., 2019d; Kostrikov et al., 2019c;

Reddy et al., 2020; Kidambi et al., 2021). Prior works have shown that with known

transition dynamics or real world interactions, agents can provably avoid covariate shift

in both tabular and general MDPs (Agarwal et al., 2019; Rajaraman et al., 2020) even

without a recoverable expert. While these results offer strong theoretical guarantees and

empirical performance, online interactions are often costly and prohibitive for real world

applications where active trial-and-error exploration in the environment could be unsafe

or impossible. A third perspective towards addressing this issue is to assume that the

expert visits the entire state space (Spencer et al., 2021), where the expert effectively

informs the learner what actions to take in every state. Unfortunately, such a full coverage

expert distribution might be rare and holds only for special MDPs and expert policies

(for e.g. an expert that induces ergodicity in the MDP).

Here, we consider a new perspective towards handling the covariate shift issue in IL.

In particular, we investigate a pure offline learning setting where the learner has access to

neither the expert nor the environment for additional interactions. The learner, instead,

has access to a small pre-collected dataset of state-action pairs sampled from the expert

and a large batch offline dataset of state-action-next state transition triples sampled from a

behavior policy that could be highly sub-optimal (see Figure 3.1 where BC on the offline

data results in a low-quality policy). Unlike prior works that require online interactions,

our proposed method, MILO performs high fidelity imitation in an offline, data-driven

manner. Moreover, different from interactive IL, we do not require the expert to be

1here, we use cost instead of reward, thus we call Aπ disadvantage function.

39


present during learning, significantly relieving the expert’s burden. Finally, in contrast to

the prior work (Spencer et al., 2021) that assumes the expert distribution covers the entire

state space (i.e., maxπ maxs,a dπ(s, a)/dπ
e
(s, a) < ∞ where dπ denotes the state-action

distribution of policy π), we require offline dataset to provide partial coverage, i.e., it

only needs to cover expert state actions (i.e., maxs,a dπ
e
(s, a)/ρ(s, a) < ∞ where ρ is the

offline distribution of some behavior policy).2

In summary, we list our main contributions below:

1. We propose Model based Imitation Learning from Offline data, MILO: a model-

based framework that leverages offline batch data with only partial coverage (see

Section 3.5.1 for definition) to overcome covariate shift in IL.

2. Our analysis is modular and covers common models such as discrete MDPs, linear

models, and nonparametric models such as GP. Notably, our new result on non-

parametric models (e.g. Gaussian Processes) with relative condition number is

new even considering all existing results in offline RL: see Remark 14 for a partial

coverage result, and remark 18 for robustness to compete against any comparator

policy covered by the offline distribution.

3. The practical instantiation of our general framework leverages neural network model

ensembles, and demonstrates its efficacy on benchmark MuJoco continuous control

problems. Specifically, even under low-quality behavior policies, our approach

can successfully imitate using an extremely small number of expert samples while

algorithms like BC completely fail (Figure 3.1).
2In our analysis, we refine the density ratio dπ

e
(s, a)/ρ(s, a) via the concept of relative conditional

number which allows us to extend it to large MDPs where the ratio is infinite but the relative condition
number is finite.

40


3.2 Related work

Imitation Learning As summarized above, avoiding covariate shift in IL is an impor-

tant topic. Another relevant line of research is IL algorithms that use offline or off-policy

learning. ValueDICE (Kostrikov et al., 2019c) presents a principled way to leverage

off-policy data for IL. In theory, the techniques from ValueDICE (and more broadly,

DICE (Nachum et al., 2019b; Zhang et al., 2020)) require the data provided to the agent

to have global coverage. Moreover in practice, ValueDICE uses online interaction and

maintains an increasing replay buffer which may eventually provide global coverage.

Instead, we aim to study offline IL without any online interactions and are interested

in the setting where offline data does not have global coverage. Another line of work

(Jarrett et al., 2020; Chan and van der Schaar, 2021) studies IL in an offline setting by

only using the expert dataset. In contrast to these works, our goal is to study the use of

an additional offline dataset collected from a behavior policy to mitigate covariate shift,

as information theoretically any algorithm that relies solely on expert data will still suffer

from covariate shift in the worst case (Rajaraman et al., 2020).

Offline RL In offline RL, algorithms such as FQI (Ernst et al., 2005) have finite-sample

error guarantees under the global coverage (Munos and Szepesvári, 2008; Antos et al.,

2008). Recently, many algorithms to tackle this problem have been proposed from both

model-free (Wu et al., 2019; Touati et al., 2020; Liu et al., 2020; Fujimoto et al., 2019;

Fakoor et al., 2021; Kumar et al., 2020) and model-based perspectives (Yu et al., 2020;

Kidambi et al., 2020b; Matsushima et al., 2020) with some pessimism ideas. The idea

of pessimism features in offline RL with an eye to penalize the learner from visiting

unknown regions of the state-action space (Rashidinejad et al., 2021; Jin et al., 2020b;

Yin et al., 2021; Buckman et al., 2020). We utilize pessimism within the IL context where,

41


unlike RL, the learner does not have access to an underlying reward signal. A by-product

of our IL analysis is a set of new results for pure offline RL setting where ground truth

cost is given: we expand prior theoretical results from offline RL by (a) replacing full

coverage assumption by a much weak partial coverage assumption formalized in terms

of relative condition number, and (b) providing a new type of robustness guarantee: we

can learn a policy that is comparable to any policy (not necessarily the optimal one) that

is covered by the offline distribution. In other words, as long as there is a high quality

policy being covered by the offline distribution, we can learn a policy that can compete

against it. We also refer readers to Appendix B.3 for a more detailed literature review on

offline RL.

3.3 Setting

We consider an episodic finite-horizon Markov Decision Process (MDP), M =

{S,A, P,H, c, d0}, where S is the state space,A is the action space, P : S ×A → ∆(S)

is the MDP’s transition, H is the horizon, d0 is an initial distribution, and c is the cost

function S ×A → [0, 1]. A policy π : S → ∆(A) maps from state to distribution over

actions. We denote dπP ∈ ∆(S × A) as the average state-action distribution of π under

transition kernel P, that is, dπP = 1/H
∑H

t=1 dπP,t, where dπP,t ∈ ∆(S×A) is the distribution of

(s(t), a(t)) under π at t. Given a cost function f : S×A 7→ [0, 1], Vπ
P, f denotes the expected

cumulative cost of π under the transition kernel P and cost function f . Following a

standard IL setting, the ground truth cost function c is unknown. Instead, we have the

demonstrations by the expert specified by πe : S → ∆(A) (potentially stochastic and not

necessarily optimal). Concretely, we have an expert dataset in the form of i.i.d tuples

De = {si, ai}
ne
i=1 sampled from distribution dπe

P .

42


In our setting, we also have an offline static dataset consisting of i.i.d tuples Do =

{si, ai, s′i}
no
i=1 s.t. (s, a) ∼ ρ(s, a), s′ ∼ P(s, a), where ρ ∈ ∆(S×A) is an offline distribution

resulting from some behavior policies. Note behavior policy could be a much worse

policy than the expert πe.

Our goal is to only leverage (De +Do) to learn a policy π that performs as well as πe

with regard to optimizing the ground truth cost c. More specifically, our goal is to utilize

the offline static dataDo to combat covariate shift and learn a policy that can significantly

outperform traditional offline IL methods such as Behavior cloning (BC), without any

interaction with the real world or expert.

Function classes We introduce function approximation. Since we do not know the true

cost function c and transition kernel P, we introduce a cost function class F ⊂ S → [0, 1]

and a transition model class P : S ×A → ∆(S). We also need a policy class Π. For the

analysis, we assume realizability:

Assumption 8. c ∈ F , P ∈ P, πe ∈ Π.

We use Integral Probability Metric (IPM) as a distribution distance mea-

sure, i.e., given two distributions ρ1 and ρ2, IPM with F is defined as

max f∈F

[
E(s,a)∼ρ1[ f (s, a)] − E(s,a)∼ρ2[ f (s, a)]

]
.

3.4 Algorithm

The core idea of MILO is to imitate the expert by optimizing an IPM distance be-

tween the agent and the expert with a penalty term for pessimism over the policy class.

MILO consists of three steps:

43


Algorithm 3 Framework for model-based Imitation Learning with offline data (MILO)
1: Require: IPM class F , model class P, policy class Π, datasetsDe = {s, a}, Do :=
{s, a, s′}

2: Train Dynamics Model and Bonus: P̂ : S ×A → S and b : S ×A → R+ on offline
dataDo

3: Pessimistic model-based min-max IL: with P̂, b, De, obtain π̂IL by solving the
following:

π̂IL = argmin
π∈Π

max
f∈F

[
E(s,a)∼dπ

P̂

[
f (s, a) + b(s, a)

]
− E(s,a)∼De[ f (s, a)]

]
(3.1)

1. Model learning: fit a model P̂ from the offline dataDo to learn P,

2. Pessimistic penalty design: construct penalty function b(s, a) such that there is a

high penalty on state-action pairs that are not covered by the offline data distribution

ρ.

3. Offline min-max model-based policy optimization: optimize Eq. (3.1)

Algorithm 3 provides the details of MILO. We explain each component in detail as follows.

Model learning and Penalty: Our framework assumes we can learn a calibrated model

(P̂, σ) from the datasetDo, in the sense that for any s, a, we have:
∥∥∥P̂(·|s, a) − P(·|s, a)

∥∥∥
1
≤

min{2, σ(s, a)}. Such model training is possible in many settings including classic discrete

MDPs, linear models (KNR (Kakade et al., 2020b)), and non-parametric models such

as GP. In practice, it is also common to train a model ensemble based on the idea

of bootstrapping and then use the model-disagreement to approximate σ. With such

calibrated model, the bonus will simply be b(s, a) = O(Hσ(s, a)). We will formalize this

model learning assumption in Section 3.5. We give several examples below.

For any discrete MDP, we use the empirical distribution, i.e., P̂(s′|s, a) =

N(s′, s, a)/(N(s, a) + λ), where N(s, a) is the number of (s, a) inDo, and N(s′, s, a) is the

44


number of (s, a, s′) inDo, and λ ∈ R+. In this case, we can set σ(s, a) = Õ
( √
|S|/N(s, a)

)
.

See Example 4 for more details.

For continuous Kernelized Nolinear Regulator (KNR (Kakade et al., 2020b)) model

where the ground truth transition P(s′|s, a) is defined as s′ = W⋆ϕ(s, a) + ϵ, ϵ ∼ N(0,Σ),

with ϕ being a (nonlinear) feature mapping, we can learn P̂ by classic Ridge regression on

offline datasetDo. Here we can set σ(s, a) = Õ
(
β
√
ϕ(s, a)⊤Σ−1

no
ϕ(s, a)

)
for some β ∈ R+,

where Σo is the data covariance matrix Σno :=
∑no

i=1 ϕ(si, ai)ϕ(si, ai)⊤ + λI. See Example 5

for more details.

For non-parametric nonlinear model such as Gaussian Process (GP), under the

assumption that P is in the form of s′ = g⋆(s, a) + ϵ, ϵ ∼ N(0,Σ) (here S ⊂ RdS), we

can simply represent P̂ using GP posteriors induced byDo, i.e., letting GP posterior be

GP(ĝ, kno), we have P̂(s′|s, a) being represented as s′ = ĝ(s, a) + ϵ. Then, we can set

σ(s, a) = Õ
(
βkno ((s, a), (s, a))

)
with some parameter β ∈ R+ (see Example 6 for more

details). GP is a powerful model and has been being widely used in robotics problems,

see (Ko et al., 2007; Deisenroth and Rasmussen, 2011; Bansal et al., 2017; Umlauft et al.,

2018; Fisac et al., 2018) for examples.

In practice, we can also use a model ensemble of neural networks with the maximum

disagreement between models as σ. This has been widely used in practice (e.g., Osband

et al. (2018b); Azizzadenesheli et al. (2018b); Pathak et al. (2019b)). We leave the

details to Section 3.6 where we instantiate a practical version of MILO, and the experiment

section. As we can see from the examples mentioned above, in general, the penalty

b(s, a) = O(Hσ(s, a)) is designed such that it has a high value in state-action space that is

not covered well by the offline dataDo, and has a low value in space that is covered byDo.

Adding such a penalty automatically forces our policy to stay away from these regions

where P̂ is not accurate. On the other hand, for regions where ρ has good coverage (thus

45


P̂ is accurate), we force π to stay close to πe.

Pessimistic model-based min-max IL: Note Eq. 3.1 is purely computational, i.e., we

do not need any real world samples. To solve such min-max objective, we can iteratively

(1) perform the best response on the max player, i.e., compute the argmax discriminator

f given the current π, and (2) perform incremental update on the min player, e.g., use

policy gradient (PG) methods (e.g. TRPO) inside the learned model P̂ with cost function

f (s, a) + b(s, a). We again leave the details to Section 3.6.

3.4.1 Specialization to offline RL

In RL, the cost function c is given. The goal is to obtain π∗ = argmaxπ∈Π Vπ
P,c. The

pessimistic policy optimization procedure (Yu et al., 2021; Jin et al., 2020b) is π̂RL =

argminπ∈Π E(s,a)∼dπ
P̂
[c(s, a)+b(s, a)]. While this is not our main contribution, we will show

a byproduct of our result is a novel non-parametric analysis for offline RL which does

not assume ρ has global coverage (see Remarks 11,14,18).

3.5 Analysis

Our algorithm depends on the model P̂ estimated from the offline data. We provide a

unified analysis assuming that P̂ is calibrated in that its confidence interval is provided.

Specifically, we assume:

Assumption 9. With probability 1 − δ, the estimate model P̂ satisfies the following:

∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2) ∀(s, a) ∈ S ×A. We set the bonus as b(s, a) =

H min(σ(s, a), 2).

46


We give the following three examples. For details, refer to Appendix B.1.

Example 4 (Discrete MDPs). Set uncertainty measure σ(s, a)=
√
|S| log 2+log(2|S||A|/δ)

2{N(s,a)+λ} +

λ
N(s,a)+λ .

Example 5 (KNRs). In KNRs, the ground truth model is s′ = W∗ϕ(s, a)+ϵ, ϵ ∼ N(0, ζ2I),

where s ∈ RdS , a ∈ RdA , ϕ : S × A 7→ Rd is some known state-action feature mapping.

The estimator is

ĝ(·) = Ŵϕ(·), Ŵ = argmin
W∈RdS×dA

1/no

∑
(s,a)∈Do

[∥Wϕ(s, a) − s′∥22] + λ∥W∥2F ,

where ∥ · ∥F is a frobenius norm. We set the uncertainty measure σ(s, a):

σ(s, a) = (1/ζ)βno

√
ϕ⊤(s, a)Σ−1

no
ϕ(s, a), Σno =

no∑
i=1

ϕ(si, ai)ϕ⊤(si, ai) + λI

with βno = {2λ∥W
∗∥22 + 8ζ2[dS log(5)+ log(1/δ)+ Īno)]}

1/2, where Īno = log(det(Σno/λI)).

Example 6 (GPs). In GPs, the ground truth model is defined as s′ = g∗(s, a) + ϵ, ϵ ∼

N(0, ζ2I) where g⋆ belongs to an RKHSHk with a kernel k(·, ·). Denote x := (s, a), we

have GP posterior as

ĝ(·) = S (Kno + ζ
2I)−1k̄no(·), S = [s′1, · · · , s

′
no

] ∈ RdS×no , k̄no(x) = [k(x1, x), · · · , k(xno , x)]⊤,

{Kno}i, j = k(xi, x j) (1 ≤ i ≤ no, 1 ≤ j ≤ no), kno(x, x′) = k(x, x′) − k̄no(x)⊤(Kno + ζ
2I)−1k̄no(x′),

with σ(·) = βnokno(·, ·)/ζ, βno = O((dS log3(dSno/δ)Ino)
1/2), Ino = log(det(I + ζ−2Kno)).

General results We show our general error bound results. For the proof, refer to

Appendix B.2. For analytical simplicity, we assume |F | is finite (but the bound only

depends on ln(|F |)) 3.

3When |F | is infinite, we can show that the resulting error bound scales w.r.t its metric entropy.

47


Theorem 10 (Bound of MILO). Suppose assumptions 8,9. Then, with probability 1 − 2δ,

V π̂IL
P,c − Vπe

P,c ≤ Erro + Erre,

Erro = 8H2E(s,a)∼dπe
P

[min(σ(s, a), 1)], Erre = 2H

√
log(2|F |/δ)

2ne
.

We will show through a set of examples where E(s,a)∼dπe
P

[min(σ(s, a), 1)] shrinks to

zero as no → ∞ under the partial coverage, i.e., when ρ covers dπe
P . Asymptotically,

Erre will dominate the bound. Note that Erre has two components, a linear H and a

term that corresponds to the statistical error related to expert samples and function class

complexity. Comparing to BC, which has a rate O(H2
√

log(|Π|)/ne) (Agarwal et al., 2019,

Chapter 14) for some policy class Π ⊂ X 7→ ∆(A), we see that the horizon dependence

is improved.

Before going to each analysis of Erro, we highlight two important points in our

analysis. First, our bound requires only the partial coverage, i,e., it depends on πe-

concentrability coefficient which measures the discrepancy between the offline data and

expert data. This is the first work deriving the bound with πe-concentrability coefficient

in IL with offline data. Second, our analysis covers non-parametric models. This is a

significant contribution as previous pessimistic offline RL finite-sample error results have

been limited to the finite-dimensional linear models or discrete MDPs (Jin et al., 2020b;

Rashidinejad et al., 2021).

Remark 11 (Implications on offline RL). As in Theorem 10, we have V π̂RL
P,c − V π̃

P,c =

O(H2E(s,a)∼dπ̃P
[σ(s, a)]) (Appendix B.2) for any comparator policy π̃ (not necessarily the

optimal one). Note similar results have been obtained in (Yu et al., 2020; Kidambi et al.,

2020b). Since this term is Erro by just replacing πe with π̃, this offline RL result is a

by-product of our analysis.

48


3.5.1 Analysis: Discrete MDPs

We start from discrete MDP as a warm up. Denote Cπe = max(s,a) dπe
P (s, a)/ρ(s, a) as πe

concentrability coefficient.

Theorem 12. Suppose λ = Ω(1) and the partial coverage Cπe < ∞. With probability

1 − δ,

Erro ≤ c1H2


√

Cπe |S|2|A|

no
+

Cπe |S||A|

no

 · log(|S||A|c2/δ),

where c1, c2 are universal constants.

The error does not depend on supπ∈ΠCπ or C̄ = supπ∈Πmax(s,a) dπP(s, a)/dπe
P (s, a). We

only require the partial coverage Cπe < ∞, which is much weaker than supπ∈ΠCπ < ∞

(ρ has global coverage) and C̄ < ∞ (dπe
P has global coverage (Spencer et al., 2021)).

When Cπe is small and no is large enough, Erre = O
(
H
√
|S||A|/ne

)
dominates Erro in

Theorem 10. Then, the error is linear in horizon H.

3.5.2 Analysis: KNRs and GPs for Continuous MDPs

Now we move to continuous state-action MDPs. In continuous MDPs, assuming the

boundedness of density ratio Cπe is still a strong assumption. As we dive into the KNR

and the nonparametric GP model, we will replace the density ratio with a more refined

concept relative condition number.

KNRs Let Σρ = E(s,a)∼ρ[ϕ(s, a)ϕ(s, a)⊤] and Σπe = E(s,a)∼dπe
P

[ϕ(s, a)ϕ(s, a)⊤]. We define

the relative condition number as Cπe = supx∈Rd

(
x⊤Σπe x
x⊤Σρx

)
. Even when density ratio is

49


infinite, this number could still be finite as it concerns subspaces on ϕ(s, a) rather than

the whole S ×A.

To further gain its intuition, we can consider discrete MDPs and the feature mapping

ϕ(s, a) ∈ R|S||A| which is a one-hot encoding vector that has zero everywhere except one

at the entry corresponding to the pair (s, a). In this case, the relative condition number is

reduced to max(s,a) dπe
P (s, a)/ρ(s, a), i.e., the density ratio.

Theorem 13 (Error for KNRs). Suppose sups,a ∥ϕ(s, a)∥ ≤ 1, λ = Ω(1), ζ2 =

Ω(1), ∥W∗∥2 = Ω(1) and the partial coverage Cπe < ∞. With probability 1 − δ,

Erro ≤ c1H2
(
rank2(Σρ) + rank(Σρ) log(

c2

δ
)
) √

dSCπe

no
· log1/2(1 + no), (3.2)

where c1 and c2 are some universal constants.

Theorem 13 suggests Erro is Õ(H2rank[Σρ]2
√

d|S|Cπe/no). In other words, when

Cπe , rank[Σρ] are small and the offline sample size no is large enough, Erre dominates

Erro in Theorem 10. Again, in this case, Erre = O
(
H

√
ln(F )/ne

)
, and we see that it

grows linearly w.r.t horizon H.

Our result is distribution dependent and captures the possible low-rankness of the

offline data, i.e., rank[Σρ] depends on ρ and could be much smaller than the ambient

dimension of feature ϕ(s, a). The quantity Cπe corresponds to the discrepancy measured

between the batch data and expert data. This is much smaller than the worst-case

concentrability coefficient: C̃ = supπ∈ΠCπ.

Remark 14 (Implication on offline RL: Partial coverage). In RL, a similar quantity has

been analyzed in (Jin et al., 2020b), which studies the error bound of linear FQI with

pessimism. Comparing to our result only requiring partial coverage, (Jin et al., 2020b,

Corollary 4.5) assumes the global coverage, i.e., Σρ is full-rank, which is stronger than

Cπe < ∞.

50


GPs Now we specialize our main theorem to non-parametric GP models. For simplicity,

following (Srinivas et al., 2010), we assume S ×A is a compact space. We also suppose

the following. Recall x := (s, a).

Assumption 15. k(x, x) ≤ 1,∀x ∈ S ×A. k(·, ·) is a continuous and positive semidefinite

kernel.

Under the Theorem 15, we can use Mercer’s theorem (Wainwright, 2019), which

shows that there exists a set of pairs of eigenvalues and eigenfunctions {µi, ψi}
∞
i=1, where∫

ρ(x)ψi(x)ψi(x)dx = 1 for all i and
∫
ρ(x)ψi(x)ψ j(x)dx = 0 for i , j. Eigenfunctions

and eigenvalues essentially defines an infinite-dimensional feature mapping ϕ(x) :=

[
√
µ1ψ1(x), . . . ,

√
µ∞ψ∞(x)]⊤. Here, k(x, x) = ϕ(x)⊤ϕ(x), and any function f ∈ Hk can

be represented as f (·) = α⊤ϕ(·). Note that the eigenvalues and eigenfunctions are defined

w.r.t the offline data distribution ρ, thus our result here is still distribution dependent

rather than a worst case analysis which often appears in online RL/IL settings (Srinivas

et al., 2010; Kakade et al., 2020b; Yang et al., 2020; Chowdhury and Gopalan, 2019).

Assume eigenvalues {µ1, . . . , µ∞} is in non-increasing order, we define the effective

dimension,

Definition 16 (Effective dimension). d∗ = min{ j ∈ N : j ≥ B( j + 1)no/ζ
2}, B( j) =∑∞

k= j µk.

The effective dimensions d∗ is widely used and calculated for many kernels (Zhang,

2005; Bach, 2017; Valko et al., 2013; Janz et al., 2020). In finite-dimensional linear

kernels {x 7→ a⊤ϕ(x); a ∈ Rd} (k(x, x) = ϕ⊤(x)ϕ(x)), we have d∗ ≤ rank[Σρ]. Thus, d∗ is

considered to be a natural extension of rank[Σρ] to infinite-dimensional models.

Theorem 17 (Error for GPs). Let Σπe = Ex∼dπe
P

[ϕ(x)ϕ(x)⊤],Σρ = Ex∼ρ[ϕ(x)ϕ(x)⊤]. Sup-

pose Theorem 15, ζ2 = Ω(1) and the partial coverage Cπe = sup∥x∥2≤1(x⊤Σπe x/x⊤Σρx) <

51


∞. With probability 1 − δ,

Erro ≤ c1H2
(
(d∗)2 + d∗ log(c2/δ)

) √
dSCπe

no
·

√
log3(c2dSno/δ) log(1 + no), (3.3)

where c1, c2 are universal constants.

The theorem suggests that Erro is Õ(H2d∗2
√

dSCπe/no). Thus, when Cπe , d∗ are not

so large and no is large enough, Erre asymptotically dominates Erro in Theorem 10 (again

Erre is linear in H).

While we defer the detailed proof of the above theorem to Appendix B.3.3, we

highlight some techniques we used here. The analysis is reduced to how to bound the

information gain Ino and Ex∼dπe
P

[kno(x, x)]. In both cases, we analyze them into two steps:

transforming them into the variational representation and then bounding them via the

uniform low with localization (Lemma 69).

Remark 18 (Implication to Offline RL: Robustness). As related literature, in model-

free offline RL, (Uehara et al., 2021; Duan et al., 2021) obtain the finite-sample error

bounds using nonparametric models. Though their bounds can be characterized by

the effective dimension, their bounds assume full coverage, i.e., max(s,a) 1/ρ(s, a) < ∞.

Specializing our result in Theorem 17 to offline RL, we achieve the following optimality

gap V π̂RL
P,c − V π̃

P,c ≤ Õ
(
H2(d∗)2

√
dSCπ̃/no

)
with Gaussian process, i.e., we are able to

compare against any comparator policy, as long as its relative condition number Cπ̃ < ∞.

Thus our result indicates a robustness guarantee: among the policies that are covered

by the offline distribution in terms of bounded relative condition number, we can find a

policy that can compete against the best one.

3.6 Practical Implementation

52


Algorithm 4 A practical instantiation of MILO
1: Require: expert datasetDe = {s, a}, offline datasetDo := {s, a, s′}
2: Train an ensemble of neural network models {ĝ1, . . . , ĝn} where each Pi starts with

different random initialization;
3: Set bonus b(s, a) = maxi, j ∥gi(s, a) − g j(s, a)∥2 and initialize πθ0 .
4: for t = 0→ T − 1 do
5: Set wt = arg max∥w∥2≤1 w⊤

(
E(s,a)∼dπ

P̂
[ϕ(s, a)] − E(s,a)∼De[ϕ(s, a)]

)
, ft(s, a) :=

w⊤t ϕ(s, a)

6: θt+1 = θt − ηF−1
θt

(
E(s,a)∼d

πθt
P̂

[
∇ ln πθt(a|s)Aπθt

P̂, ft+b
(s, a)

]
+ λE(s,a)∼De

[
∇ℓ(a, s, πθt)

])
7: end for

In this section we instantiate a practical version of MILO using neural networks for

the model class P and policy class Π. We use the Maximum Mean Discrepancy (MMD)

with a Radial Basis Function kernel as our discriminator class F . Note using MMD

as our discrepancy measure allows us to compute the exact maximum discriminator

argmax f∈F in closed form. We use a KL-based trust-region formulation for incremental

policy update inside the learned model P̂. Based on Eq. (3.1), we first formalize the

following constrained optimization framework:

min
π∈Π

max
f∈F

(
E(s,a)∼dπ

P̂

[
f (s, a) + b(s, a)

]
− E(s,a)∼De[ f (s, a)]

)
s.t.E(s,a)∼De[ℓ (a, s, π)] ≤ δ

where ℓ : A×S×Π 7→ R is a loss function (e.g., negative log-likelihood or any supervised

learning loss one would use in BC). Essentially, since we haveDe available, we use it

together with any supervised learning loss to constrain the policy hypothesis space Π.

Note for a deterministic expert πe, the expert policy is always a feasible solution. Thus

adding this constraint reduces the complexity of the policy class but does not eliminate

the expert policy, and our analysis in Section 3.5 still applies.

In our practical instantiation, we replace the hard constraint instead by a Lagrange

multiplier, i.e. we use the behavior cloning objective as a regularization term when

53


solving the min-max problem:

min
π∈Π

max
f∈F

[
E(s,a)∼dπ

P̂
( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)]

]
+ λ · E(s,a)∼De[ℓ (a, s, π)].

Note there always exists a regularization parameter λ that makes this regularized op-

timization problem equivalent to the constrained one. Iteratively, given policy πθt (θt

denotes the parameters), we first update the discriminator ft (line 5 in Algorithm 4);

then, with a fixed ft, we incrementally update π using NPG as in line 6 in Algorithm 4,

where Aπθ
P̂, f+b

is the disadvantage function of πθ under transition P̂ and cost ft + b, and

Fθt := E(s,a)∼d
πθt
P̂

[∇ ln πθt(a|s)∇ ln πθt(a|s)⊤] is the fisher information matrix. We summa-

rize the above procedure in Algorithm 4.4

3.7 Experiments

We aim to answer the following questions with our experiments: (1) How does MILO per-

form relative to other offline IL methods, (2) What is the impact of pessimism on MILO’s

performance? (3) How does the behavior policy’s coverage impact MILO’s performance?

(4) How does MILO’s result vary when we increase the number of samples drawn from

the expert policy?

We evaluate MILO on five environments from OpenAI Gym (Brockman et al.,

2016b) simulated with MuJoCo (Todorov et al., 2012a): Hopper-v2, Walker2d-v2,

HalfCheetah-v2, Ant-v2, and Humanoid-v2. We compare MILO against the following

baselines: (1) ValueDICE (Kostrikov et al., 2019c), a state-of-the-art off-policy IL method

modified for the offline IL setting; (2) BC on the expert dataset; and (3) BC on both the

offline and expert dataset. Note we modify ValueDICE to be offline by first populating

4In line 6, we use TRPO which means that we set η via a linear search procedure used in TRPO.

54


the replay buffer with the offline dataset and then training the policy with the frozen

replay buffer and expert data.

Environment Expert Performance Behavior Performance

Hopper-v2 3012 752 (25%)
Walker2d-v2 3082 1383 (45%)

HalfCheetah-v2 5986 3972 (66%)
Ant-v2 3072 1208 (40%)

Humanoid-v2 3248 1505 (46%)

Table 3.1: Performance for expert and behavior policy used to collect expert and offline
datasets respectively.

For the expert dataset, we first train expert policies and then randomly sample (s, a)-

pairs from a pool of 100 expert trajectories collected from these expert policies. We

randomly sample to create very small expert (s, a)-pair datasets where BC struggles to

learn. Note that BC is effective at imitating the expert for MuJoCo tasks even with a

single trajectory; prior works (Ho and Ermon, 2016a; Kostrikov et al., 2019a,c) have

used similar sub-sampling strategies to create expert datasets to make it harder for BC to

learn. While we focus on the setting with an extremely small expert dataset consisting

of expert’s state-action pairs, in appendix Figure B.1, we verify that MILO can also

successfully match to the expert performance using a single expert trajectory.

The offline datasets are collected building on prior Offline RL works (Wu et al., 2019;

Kidambi et al., 2020b); each dataset contains 1 million samples from the environment.

We first train behavior policies with mean performances often less than half of the expert

performance (Table 3.1, column 2). All results are averaged over five random seeds. See

appendix for details on hyperparameters, environments, and dataset composition.

55


Figure 3.2: Learning curves across five seeds for MILO plotted against the best perfor-
mance of BC after 1000 epochs of training on the expert/offline+expert data and the best
performance of ValueDICE after 10 thousand iterations. The bottom right bar graph
shows the expert performance normalized scores where we plot the performance at the
last iteration for MILO.

3.7.1 Evaluation on MuJoCo Continuous Control Tasks

Figure 3.2 presents results comparing MILO against benchmarks. MILO is able to achieve

close to expert level performance on three out of the five environments and outperforms

both BC and ValueDICE on all five environments. We significantly outperform BC’s

performance when trained on the expert dataset (note that the expert datasets in Figure 3.2

only contain 100-200 expert (s, a) pairs), suggesting MILO indeed mitigates covariate

shift through the use of a static offline dataset of (s, a)-pairs. BC on both the offline and

expert dataset does improve the performance, but this still cannot successfully imitate

the expert since BC has no way of differentiating random/sub-optimal trajectories from

the expert samples. ValueDICE, on the other hand, does explicitly aim to imitate the

expert samples; however, in theory, it would require either the offline data (i.e. the replay

buffer) or the expert samples to have full coverage over the state-action space. Since our

offline dataset is mainly collected from a sub-optimal behavior policy and our expert

56


samples are from a high quality expert, neither our offline nor our expert dataset is likely

to have full coverage globally; thus potentially hurting the performance of algorithms

like ValueDICE. We emphasize that in the bar plot of Figure 3.2, for MILO, we use

the performance of the policy at the last iteration while for other baselines we use the

performance of the best policy over the entire training process. This indicates that the

learning process of MILO is stable.

Figure 3.3: (Left 2) Learning curves for Hopper and Walker2d with (red) and without
(blue) pessimism. MILO generally performs worse without pessimism. (Right 2) Learning
curve for Walker2d and Humanoid with more expert samples.

3.7.2 Ablation

Impact of Pessimism Figure 3.3 (Left 2) presents MILO’s performance on two represen-

tative environments with and without pessimism (i.e., setting penalty to be zero) added

to the imitation objective. Pessimism stabilizes and improves the final performance for

MILO. We find that pessimism is necessary in other environments as well except Ant

where we find that MILO achieves expert level performance even without pessimism (see

Appendix Figure B.2).

Behavior with more expert samples We investigate whether MILO is able to achieve

expert performance with more expert samples in the two environments (walker and

humanoid) that it did not solve with very small expert datasets in Figure 3.2. Figure 3.3

57


(Right 2) shows that with one trajectory worth of expert samples (1000 expert state-action

pairs), MILO is able to achieve expert performance on walker and humanoid.

Impact of Coverage As our analysis suggests, MILO’s performance degrades as the

offline data’s coverage over the expert’s state-action space decreases. We use the behavior

policy’s value as a surrogate for coverage, i.e. a lower value potentially suggests lower

coverage. We generate two additional offline datasets for each environment by lowering

the performance of the behavior policy. The three datasets are: (1) the original offline

datasets used in Table 3.1 (≈ 25% for Hopper-v2 and ≈ 50% for others); (2) ones that

have roughly half the performance of (1) (12% for Hopper-v2 and ≈ 25% for others);

and (3) ones collected from a random behavior policy (Random). Table 3.2 shows that

MILO performs reasonably on three environments even with a lower coverage dataset

(second column): matches to expert performance on Ant-v2, and achieves approximately

70% of the expert performance on Hopper-v2and Humanoid-v2. For the random datasets,

MILO achieves around 50% of the expert performance on Hopper-v2, around 20% on

Walker2d-v2 and Ant-v2, but fails on HalfCheetah-v2 and Humanoid-v2.

Environment ≈ 50% ≈ 25% Random

Hopper-v2 0.95 ± 0.01 0.66 ± 0.33 0.42 ± 0.36
Walker2d-v2 0.72 ± 0.02 0.27 ± 0.06 0.23 ± 0.12

HalfCheetah-v2 0.96 ± 0.01 0.01 ± 0.02 0.01 ± 0.02
Ant-v2 1.02 ± 0.02 0.99 ± 0.01 0.21 ± 0.52

Humanoid-v2 0.88 ± 0.10 0.72 ± 0.03 0.08 ± 0.01

Table 3.2: Expert performance normalized scores on three different offline datasets
collected from behavior policies with approximately 50%, 25%, and random performance
relative to the expert.

58


3.8 Conclusion

MILO investigates how to mitigate covariate shift in IL using an offline dataset of environ-

ment interactions that has partial coverage of the expert’s state-action space. We show

the effectiveness of MILO both in theory and in practice. In future works, we hope to scale

to image-based control to further scale MILO to real world settings where an offline IL

algorithm may be effective.

59


CHAPTER 4

MODEL-FREE OFF-POLICY IMITATION LEARNING

Adversarial imitation learning (AIL) has stood out as a dominant framework across

various imitation learning (IL) applications, with Discriminator Actor Critic (DAC)

(Kostrikov et al., 2019b) demonstrating the effectiveness of off-policy learning algo-

rithms in improving sample efficiency and scalability to higher-dimensional observations.

Despite DAC’s empirical success, the original AIL objective is on-policy and DAC’s ad-hoc

application of off-policy training does not guarantee successful imitation (Kostrikov

et al., 2019b, 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles

this issue by deriving a fully off-policy AIL objective. Instead, we develop a novel

and principled AIL algorithm via the framework of boosting. Like boosting, our new

algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e.,

policies) and trains a discriminator that witnesses the maximum discrepancy between the

distributions of the ensemble and the expert policy. We maintain a weighted replay buffer

to represent the state-action distribution induced by the ensemble, allowing us to train

discriminators using the entire data collected so far. In the weighted replay buffer, the

contribution of the data from older policies are properly discounted with the weight com-

puted based on the boosting framework. Empirically, we evaluate our algorithm on both

controller state-based and pixel-based environments from the DeepMind Control Suite.

AILBoost outperforms DAC on both types of environments, demonstrating the benefit of

properly weighting replay buffer data for off-policy training. On state-based environ-

ments, AILBoost outperforms ValueDICE and IQ-Learn(Garg et al., 2021), achieving

competitive performance with as little as one expert trajectory.

60


4.1 Introduction

Adversarial Imitation Learning (AIL) is an incredibly successful approach for imitation

learning (Ho and Ermon, 2016a; Fu et al., 2018; Kostrikov et al., 2019b; Ke et al., 2020).

These methods cast IL as a distribution matching problem whereby the learning agent

minimizes the divergence between the expert demonstrator’s distribution and the state-

action distribution induced by the agent. First introduced by (Ho and Ermon, 2016a), this

divergence minimization can be achieved in an iterative procedure reminiscent of GAN

algorithms (Goodfellow et al., 2014) with our learned reward function and policy being

the discriminator and generator respectively.

Originally, a limitation of many AIL methods was that they were on-policy. That is,

for on-policy AIL methods like GAIL (Ho and Ermon, 2016a) and AIRL (Fu et al., 2018),

the algorithm would draw fresh samples from the current policy in every iteration for

the distribution matching process while discarding all old samples, rendering the sample

complexity of these algorithms to be prohibitively large in many applications. Follow-

up works (Kostrikov et al., 2019b; Sasaki et al., 2019) attempt to relax the on-policy

requirement by creating off-policy methods that utilize the entire history of observed

data during the learning process. This history is often represented by a replay buffer and

methods such as Discriminator Actor Critic (DAC) show large improvements in scalability

and sample complexity over their on-policy counterparts. However, these methods

modify the distribution matching objective as a divergence minimization between the

replay buffer’s and the expert’s distribution, losing the guarantee of matching the expert’s

behavior.

Algorithms like ValueDICE (Kostrikov et al., 2020) address this problem by deriving

a new formulation of the AIL divergence minimization objective to be entirely off-

61


policy. ValueDICE, however, in principle relies on the environments to have deterministic

dynamics.1 In this chapter, we consider a new perspective towards making AIL off-

policy. We present a new principled off-policy AIL algorithm, AILBoost, via the gradient

boosting framework (Mason et al., 1999). AILBoost maintains an ensemble of properly

weighted weak learners or policies as well as a weighted replay buffer to represent the

state-action distribution induced by our ensemble. Our distribution matching objective

is then to minimize the divergence between the weighted replay buffer’s distribution

(i.e., the state-action distribution induced by the ensemble) and the expert demonstrator’s

distribution, making the divergence minimization problem an off-policy learning problem.

Similar to boosting and gradient boosting, at every iteration, we aim to find a weak learner,

such that when added to the ensemble, the divergence between the updated ensemble’s

distribution and the expert’s distribution decreases. In other words, our approach can be

understood as performing gradient boosting in the state-action occupancy space, where

black-box RL optimizer is used a weak learning procedure to train weak learners, i.e.,

policies.

We evaluate AILBoost on the DeepMind Control Suite (Tassa et al., 2018) and

compare against a range of off-policy AIL algorithms (Behavior cloning, ValueDICE,

DAC) as well as a state-of-the-art IL algorithm, IQ-Learn. We show that our algorithm

is comparable to or more sample efficient than state-of-the-art IL algorithms in various

continuous control tasks, achieving strong imitation performance with as little as one

expert demonstration. We also show that our approach scales to vision-based, partially

observable domains, where we again outperform DAC.

1One cannot derive an unbiased estimate of the objective function proposed in ValueDICE unless it has
infinite expert samples and the transition is deterministic (Kostrikov et al., 2020). See section 4.3.3 for
more detailed discussion.

62


4.2 Related works

Off-policy and Offline IL There has also been a wide variety of research conducted

on off-policy and offline IL, where the goal is to be either more sample efficient or

safer by utilizing a replay buffer or not collecting any environmental transitions during

training, respectively. The most prominent of said methods, and the closest to our work,

is Discriminator-Actor-Critic (DAC) (Kostrikov et al., 2019b), which essentially replaces

the on-policy RL algorithm in the adversarial IL setup with an off-policy one such as

DDPG (Lillicrap et al., 2019) or SAC (Haarnoja et al., 2018a). However, as mentioned

previously, DAC doesn’t necessarily guarantee a distribution match between the expert

and the learned policy, prompting further work to be done. Further work has primar-

ily focused on weighting on-policy and off-policy data differently in both the policy

update and the discriminator update. ValueDICE (Kostrikov et al., 2020) mitigates this

problem by deriving an objective from the original distribution matching problem that

only requires off-policy samples to compute. More recently, methods such as IQ-Learn

(Garg et al., 2021) have been developed to learn soft Q functions over the environment

space, which encodes both a reward and a policy for inverse reinforcement learning,

and model-based methods such as V-MAIL (Rafailov et al., 2021) have shown that using

expressive world models (Hafner et al., 2020) leads to strong imitation results in domains

with high-dimensional observations. Other off-policy IL works include SoftDICE (Sun

et al., 2021), SparseDICE (Camacho et al., 2021), and AdVIL/AdRIL/DAeQuIL (Swamy

et al., 2021).

Orthogonally, on the offline side, where environment interaction is prohibited, works

both on the model-based side (Chang et al., 2021) and the model-free side (Kim et al.,

2022; Yu et al., 2023) has shown that distribution matching is still possible in these

63


settings. These approaches generally operate either by learning a transition model of the

environment, with which to roll out in to do policy optimization (Chang et al., 2021), or

optimizing a modified version of the objective introduced in (Kostrikov et al., 2020) by

using samples from the suboptimal offline dataset as opposed to on-policy samples for

computation.

Boosting style approach in deep learning & RL The idea of using boosting for policy

learning is not new in the deep learning or reinforcement learning literature. On the

deep learning side, AdaGAN (Tolstikhin et al., 2017) apply standard adaptive boosting

to GANs (Goodfellow et al., 2014) to address and fix issues such as mode collapse,

while concurrent work (Grover and Ermon, 2017) showed benefits of boosting in general

Bayesian mixture models. In RL, the conservative policy iteration (CPI) (Kakade and

Langford, 2002a) can be understood as performing gradient boosting in the policy space

(Scherrer and Geist, 2014). The authors in (Hazan et al., 2019) use a gradient boosting

style approach to learn maximum entropy policies. Here, we perform gradient boosting

in the space of state-action occupancy measures, which leads to a principled off-policy

IL approach.

4.3 Preliminaries

We consider a discounted infinite horizon MDPM = ⟨S, P,A, r, γ, µ0⟩ where S is the

state of states,A is the set of actions, r : S ×A 7→ R is the reward function and r(s, a)

is the reward for the given state-action pair, γ ∈ (0, 1) is the discount factor, µ0 ∈ ∆(S)

is the initial state distribution, and P : S × A 7→ ∆(S) is the transition function. A

policy π : S → ∆(A) interacts in said MDP, creating trajectories τ composed of state-

action pairs {(st, at)}Tt=1. We denote dπt to represent the state-action visitation distribution

64


induced by π at timestep t and dπ = (1− γ)
∑∞

t=0 γ
tdπt as the average state-action visitation

distribution induced by policy π. We define the value function and Q-function of our

policy as Vπ(s) = Eπ[
∑∞

t=0 γ
tr(st)|s0 = s] and Qπ(s, a) = r(s, a) + Es′∼P(·|s,a)[Vπ(s′)]. The

goal of RL is to find a policy that maximizes the expected cumulative reward.

In imitation learning, instead of having access to the reward function, we assume

access to demonstrationsDe = {(si, ai)}Ni=1 from an expert policy πe that our policy can

take advantage of while training. Note that πe might not necessarily be a Markovian policy.

It is possible that πe is an ensemble of weighted Markovian policies, i.e., πe = {αi, πi}
n
i=1

with αi ≥ 0,
∑

i αi = 1, which means that for each episode, πe will first randomly sample

a policy πi with probability αi at t = 0, and then execute πi for the entire episode (i.e., no

switch to other policies during the execution for an episode). It is well known that the

space of state action distributions induced by such ensembles is larger than the space of

state-action distributions induced by Markovian policies (Hazan et al., 2019). The goal in

IL is then to learn a policy that robustly mimics the expert. The simplest imitation learning

algorithm to address this issue is behavior cloning (BC): argminπ∈Π E(s,a)∼De[ℓ(π(s), a)]

where ℓ is a classification loss and Π is our policy class. Though this objective is simple,

it is known to suffer from covariate shift at test time (Pomerleau, 1988; Ross et al.,

2011a). Instead of minimizing action distribution divergence conditioned on expert states,

algorithms such as inverse RL (Ziebart et al., 2008) and adversarial IL (Ho and Ermon,

2016a; Finn et al., 2016a; Ke et al., 2020; Sun et al., 2019d) directly minimize some

divergence metrics between state-action distributions, which help address the covariate

shift issue (Agarwal et al., 2019).

65


4.3.1 Adversarial Imitation Learning (AIL)

The goal of AIL is to directly minimize some divergence between some behavior policy

state-action visitation dπ and an expert policy state-action visitation dπ
e
. The choice of

divergence results in variously different AIL algorithms.

The most popular AIL algorithm is Generative Adversarial Imitation Learning (GAIL)

(Ho and Ermon, 2016a) which minimizes the JS-divergence. This algorithm is a on-policy

adversarial imitation learning algorithm that connects Generative Adversarial Networks

(GANs) (Goodfellow et al., 2014) and maximum entropy IRL (Ziebart et al., 2008).

GAIL trains a binary classifier called the discriminator D(s, a) to distinguish between

samples from the expert distribution and the policy generated distribution. Using the

discriminator to define a reward function, GAIL then executes an on-policy RL algorithm

such as Trust Region Policy Optimization (TRPO) (Schulman et al., 2017a) or Proximal

Policy Optimization (PPO) (Schulman et al., 2017b) to maximize the reward. That gives

us the following adversarial objective:

min
π

max
D
Es,a∼π

[
log D(s, a)

]
+ Es,a∼πe

[
log(1 − D(s, a))

]
− λH(π) (4.1)

where H(π) is an entropy regularization term. The first term in Equation (4.1) can be

viewed as a pseudo reward that can be optimized with respect to the the policy π on-policy

samples. Note that GAIL typically optimizes both policies and discriminators using on-

policy samples, making it quite sample inefficient. Using different divergences, there are

various reward functions that can be optimized with this framework (Orsini et al., 2021).

In this work, while our proposed approach in general is capable of optimizing many

common divergences, we mainly focus on reverse KL divergence in our experiments.

Reverse KL divergence has been studied in prior works including Fu et al. (2018); Ke et al.

(2020). But different from prior works, we propose an off-policy method for optimizing

66


reverse KL by leveraging the framework of boosting.

4.3.2 Discriminator Actor Critic (DAC)

One reason GAIL need a lot of interactions with the environment to learn properly is

because of the dependency on using on-policy approaches to optimize discriminators and

policies. In particular, GAIL does not reuse any old samples. Discriminator Actor Critic

(DAC) (Kostrikov et al., 2019b) extends GAIL algorithms to take advantage of off-policy

learning to optimize the discriminators and policies.

DAC introduces a replay buffer R to represent the history of transitions observed

throughout training in the context of IRL. This replay buffer allows DAC to perform

off-policy training of the policy and the discriminator (similar to (Sasaki et al., 2019)).

Formally, DAC optimizes its discriminator with the objective:

max
D
Es,a∼R

[
log D(s, a)

]
+ Es,a∼πe

[
log(1 − D(s, a))

]
. (4.2)

where this objective minimize the divergence between the expert distribution and the

replay buffer R distribution. Intuitively, this divergence does not strictly capture the

divergence of our policy distribution and the expert distribution, but a mixture of evenly

weighted policies learned up until the current policy. To rigorously recover a divergence

between our policy distribution and the expert distribution we need to apply importance

weights: min
π

max
D
Es,a∼R

[
pπ(s,a)
pR(s,a) log D(s, a)

]
+ Es,a∼πe

[
log(1 − D(s, a))

]
− λH(π). While

this objective recovers the on-policy objective of GAIL (Equation (4.1)), the authors note

that estimating the density ratio is difficult and has high variance in practice. Furthermore,

they note that the not using importance weights (Equation (4.2)) works well in practice,

but does not guarantee successful imitation, especially when the distribution induced by

the replay buffer, R, is far from our current policy’s state-action distribution. This is a

67


fundamental problem of DAC.

4.3.3 ValueDICE

ValueDICE (Kostrikov et al., 2020) was proposed to address the density estimation issue

of off-policy AIL algorithms formalized in DAC (see Section 4.3.2). ValueDICE aims

to minimize the reverse KL divergence written in its Donsker-Varadhan (Donsker and

Varadhan, 1983) dual form:

−KL(dπ||dπe) = min
x:S×A7→R

logE(s,a)∼dπe [ex(s,a)] − E(s,a)∼dπ[x(s, a)] (4.3)

Motivated from DualDICE (Nachum et al., 2019a), ValueDICE performs a change of

variable using the Bellman operator Bπ2 with respect to the policy π; x(s, a) = ν(s, a) −

Bπ(s, a); resulting the following objective:

max
π

min
ν:S×A→R

logEs,a∼πe
[
exp (ν(s, a) − Bπν(s, a))

]
− (1 − γ)Es0∼µ0,

a0∼π
[ν(s0, a0)] . (4.4)

Now the objective function does not contain on-policy distribution dπ (in fact only the

initial state distribution µ0 and the expert distribution). Despite being able to only using

dπ
e

and µ0, the authors have identified two aspects of the objective that will yield biased

estimates. First, the first expectation has a logarithm outside of it which would make mini-

batch estimates of this expectation biased. Moreover, inside the first expectation term,

we have ν(s, a) − Bπν(s, a) with Bπ being the Bellman operator. This limits ValueDICE’s

objective to only be unbiased for environments with deterministic transitions. This is

related to the famous double sampling issue in TD learning. Although many popular RL

benchmarks have deterministic transitions (Bellemare et al., 2013; Tassa et al., 2018;

Todorov et al., 2012b), this was a limitation not present in the GAIL.

2A bellman operator Bπ is defined as follows: given any function f (s, a), we have Bπ f (s, a) :=
r(s, a) + Es′∼P(s,a) f (s′, π(s′),∀s, a.

68


We take a different perspective than ValueDICE to derive an off-policy AIL algorithm.

Different from ValueDICE, our approach is both off-policy and is amenable to mini-batch

updates even with stochastic environment transition dynamics.

4.4 Algorithm

Our algorithm, Adversarial Imitation Learning via Boosting (AILBoost) – motivated by

classic gradient boosting algorithms (Friedman, 2001; Mason et al., 1999) – attempts

to mitigate a fundamental issue related to off-policy imitation learning formalized in

DAC (see Section 4.3.2). The key idea is to treat learned policies as weak learners, form

an ensemble of them (with a proper weighting scheme derived from a gradient boosting

perspective), and update the ensemble via gradient boosting.

Weighted policy ensemble. Our algorithm will learn a weighted ensemble of policies,

denoted as π := {αi, πi}
n
i=1 with αi ≥ 0,

∑
i αi = 1 and πi being some Markovian policy.

The way the mixture works is that when executing π, at the beginning of an episode, a

Markovian policy πi is sampled with probability αi, and then πi is executed for the entire

episode (i.e., no policy switch in an episode). Note that π itself is not a Markovian policy

anymore due to the sampling process at the beginning of the episode, and in fact, such

mixture policy’s induced state-action distribution can be richer than that from Markovian

policies (Hazan et al., 2019). This is consistent with the idea of boosting: by combining

weak learners, i.e., Markovian policies, we form a more powerful policy. Given the above

definition of π, we immediately have dπ :=
∑

i αidπi , i.e., the weighted mixture of the

state-action distributions induced by Markovian policies πi.

Notation wise, given a dataset D, we denote ÊD[ f (x)] as the empirical function

average across the dataset, i.e., ÊD[ f (x)] =
∑

x∈D f (x)/ |D|.

69


4.4.1 AILBoost: Adversarial Imitation Learning via Boosting

We would like to minimize the reverse KL divergence between our policy state-action

distribution dπ and the expert distribution dπ
e

– denoted by ℓ(dπ, dπ
e
) = KL(dπ ||dπ

e
) :=∑

s,a dπ(s, a) ln(dπ(s, a)/dπ
e
(s, a)). The reasons that we focus on reverse KL is that (1)

it has been argued that the mode seeking property of reverse KL is more suitable for

imitation learning (Ke et al., 2020), (2) reverse KL is on-policy in nature, i.e., it focuses

on minimizing the divergence of our policy’s action distribution and the expert’s at the

states from our policy, which help address the covariate shift issue, and (3) the baselines

we consider in experiments, DAC and ValueDICE, all minimize the reverse KL divergence

such as AIRL in practice 3. At a high level, our approach directly optimizes ℓ(dπ, dπ
e
) via

gradient boosting (Mason et al., 1999) in the state-action occupancy space. Our ensemble

π induces the following mixture state-action occupancy measure:

dπ :=
t∑

i=1

αidπi , αi ≥ 0.

To compute a new weak learner πt+1, we will first compute the functional gradient of

loss ℓ with respect to dπ, i.e., ∇ℓ(d, dπ
e
)|d=dπ . The new weak learner πt+1 is learned via

the following optimization procedure: πt+1 = argmaxπ̃∈Π⟨d
π̃,−∇ℓ(d, dπ

e
)|d=dπ⟩. Namely,

we aim to search for a new policy πt+1 such that its state-action occupancy measure

dπt+1 is aligned with the negative gradient −∇ℓ as much as possible. Note that the

above optimization problem can be understood as an RL procedure where the reward

function is defined as −∇ℓ(d, dπe)|d=dπ ∈ R
S A. Once we compute the weak learner πt+1,

we mix it into the policy ensemble with a fixed learning rate α ∈ (0, 1) – denoted as

dπ′ = (1 − α)dπ + αdπt+1 . Note that the above mixing step can be interpreted as gradient

boosting in the state-action occupancy space directly: we re-write the update procedure

as dπ′ = dπ + α(dπt+1 − dπ), where the ascent direction dπt+1 − dπ is approximating the

3See the official repository

70

https://github.com/google-research/google-research/tree/master/dac


Algorithm 5 AILBOOST (Adversarial Imitation Learning via Boosting)
Require: number of iterations T , expert dataDe, weighting parameter α

1: Initialize π1 weight α1 = 1, replay buffer B = ∅
2: for t = 1, . . . ,T do
3: Construct the t-th datasetDt = {(s j, a j)}Nj=1 where s j, a j ∼ dπt ∀ j.
4: Compute discriminator ĝ using the weighted replay buffer:

ĝ = argmax
g

Ês,a∈Dπe
[
− exp(g(s, a))

]
+

t∑
i=1

αiÊs,a∈Di

[
g(s, a)

] (4.5)

5: Set B ← B ∪Dt

6: Compute weak learner πt+1 via an off-policy RL approach (e.g., SAC) on reward
−ĝ(s, a) with replay buffer B

7: Set αi ← αi(1 − α) for i ≤ t, and αt+1 = α
8: end for
9: Return Ensemble π = {(αi, πi)}Ti=1

(negative) functional gradient −∇ℓ, since argmaxπ⟨d
π − dπ,−∇ℓ⟩ = πt+1 by the definition

of πt+1. It has been shown that such procedure is guaranteed to minimize the objective

function (i.e., reverse KL in this case) as long as the objective is smooth (our loss ℓ will

be smooth as long as dπ is non-zero everywhere) (e.g., see (Hazan et al., 2019) for the

claim).4

Algorithmically, we first express the reverse KL divergence in its variational form

(Nowozin et al., 2016; Ke et al., 2020):

KL(dπ ||dπ
e
) := max

g

[
Es,a∼dπe

[
− exp(g(s, a))

]
+ Es,a∼dπg(s, a)

]
where g : S × A 7→ R is a discriminator. The benefit of using this variational form

is that computing the functional (sub)-gradient of the reverse KL with respect to dπ

is easy, which is ĝ = argmaxg

[
Es,a∼dπe

[
− exp(g(s, a))

]
+ Es,a∼dπg(s, a)

]
, i.e., we have ĝ

being a functional sub-gradient of the loss KL(dπ ||dπ
e
) with respect to dπ. The maximum

discriminator ĝ will serve as a reward function for learning the next weak learner πt+1,
4Note that similar to AdaBoost, each weaker is not directly optimizing the original objective, but the

weighted combination of the weaker learners optimizes the original objective function – the reverse KL in
our case.

71


that is

πt+1 = argmax
π

Es,a∼dπ
[
−ĝ(s, a)

]
= argmax

π

⟨dπ,−ĝ(s, a)⟩. (4.6)

To compute ĝ in practice, we need unbiased estimates of the expectations via sample

averaging which can be done easily in our case. The expectation Es,a∼dπe can be easily

approximated by the expert dataset De. To approximate Es,a∼dπ where dπ is a mixture

distribution, we maintain a replay buffer Di for each weak learner πi which contains

samples s, a ∼ dπi , and then weightDi via the weight αi associated with πi. In summary,

we optimize g as shown in Equation (4.5) in Algorithm 5 (the highlighted red part denotes

the empirical expectation induced by weighted replay buffer). The optimization problem

in Equation (4.5) can be solved via stochastic gradient ascent on g.5 With ĝ, we can

optimize for πt+1 using any off-shelf RL algorithm, making the entire algorithm off-policy.

In our experiments, we use SAC as the RL oracle for argmaxπ Es,a∼dπ[−ĝ(s, a)]. Once

πt+1 is computed, we mix πt+1 into the mixture, and adjust the weights of older policies

accordingly, i.e., αt+1 = α, and αi ← αi(1 − α),∀i ≤ t. Note that this weighting scheme

ensures that older policies get less weighted in the ensemble.

Remark 19. The use of SAC as the weak learning algorithm and the new way of com-

puting discriminator from Equation (4.5) make the whole training process completely

off-policy. Particularly, unlike most adversarial IL approaches, which compute discrimi-

nators by comparing on-policy samples from the latest policy and the expert samples, we

train the discriminator using all the data collected so far (with proper weighting derived

based on the boosting framework). The connection to boosting and the proper weighting

provides a principled way of leveraging off-policy samples for updating discriminators.

As we will show, compared to DAC which also uses off-policy samples for training policies

and discriminators, our principled approach leads to better performance.
5Note that unlike ValueDICE, here we can easily use a finite number of samples to obtain an unbiased

estimate of the loss by replacing expectations by their corresponding sample averages.

72


Algorithm 5 AILBoost, summarizes the above procedure. In Line 6, we use SAC

as the RL oracle for computing the weak learner. In practice, we do not run SAC from

scratch every time in Line 6. Instead, SAC maintains its own replay buffer which contains

all interactions it has with the environment so far. When computing πt+1, we first update

the reward in the replay buffer using the latest learned reward function −ĝ, and we always

warm start from πt. We include the detailed algorithmic description in Appendix C.1.

Memory cost. Note that at the end, our algorithm returns a weighted ensemble of

Markovian policies. Comparing to prior works such as DAC, the maintenance of weak

learners may increase additional memory cost. However, the benefit of the weighted

ensemble is that it induces richer state-action distributions than that of Markovian policies.

In practice, if memory cost really becomes a burden (not in our experiments even with

image-based control policies), we may just keep the latest few policies (note that very

old policy has exponentially small weight anyway).

4.5 Experiments

In this section we aim to empirically investigate the following questions: (1) How does

AILBoost perform relative to other off-policy and state-of-the-art IL methods? (2) Does

AILBoost enjoy the sample complexity and scalability benefits of modern off-policy

IL methods? (3) How robust is AILBoost across various different adversarial training

schedules?

Task Difficulty

Ball in Cup Catch Easy

Walker Walk Easy

Cheetah Run Medium

Quadruped Walk Medium

Humanoid Stand Hard

Table 4.1: Spread of environments eval-

uated from the DeepMind Control Suite

with hardness designations from (Yarats

et al., 2022).

We evaluate AILBoost on 5 environ-

ments on the DeepMind Control Suite bench-

mark(Tassa et al., 2018): Walker Walk,

Cheetah Run, Ball in Cup Catch,

73


Quadruped Walk, and Humanoid Stand. For

each game, we train an expert RL agent us-

ing the environment’s reward and collect 10

demonstrations which we use as the expert

dataset throughout our experiments. We com-

pare AILBoost against the following base-

lines: DAC, an empirically succesful off-

policy IL algorithm; IQ-Learn, a state-of-the-art IL algorithm; ValueDICE, another

off-policy IL method; and BC on the expert data used across all algorithms. We em-

phasize our comparison to IQ-Learn, as it has been shown to outperform many other

imitation learning baselines (e.g., SQIL (Reddy et al., 2019)) across a variety of control

tasks (Garg et al., 2021).

The base RL algorithm we used for training the expert, as well as for AILBoost and

DAC, was SAC for controller state-based experiments and DrQ-v2 (Yarats et al., 2022)

for image-based experiments. For IQ-Learn and ValueDICE, we used their respective

codebases and hyperparameters provided by the authors and both methods use SAC as

their base RL algorithm. Please refer to Appendix C.2 for experimental details, training

hyperparameters, and expert dataset specifications.

4.5.1 Controller State-based Experiments

Figure 4.1 shows our aggregate results across the five DeepMind Control Suite (DMC)

tasks that we tested on. We chose these five tasks by difficulty as shown in Table 4.1.

For evaluation, we follow the recommendations of (Agarwal et al., 2021b) and report

the aggregate inter-quartile mean, mean, and optimiality gap of AILBoost and all the

74


10 Demos

5 Demos

1 Demo

Expert Normalized Score

0.60 0.75 0.90
BC

ValueDICE
IQ-Learn

DAC
AILBoost

IQM

0.60 0.75 0.90

Mean

0.15 0.30 0.45

Optimality Gap

0.4 0.6 0.8 1.0
BC

ValueDICE
IQ-Learn

DAC
AILBoost

0.45 0.60 0.75 0.90 0.15 0.30 0.45 0.60

0.25 0.50 0.75 1.00
BC

ValueDICE
IQ-Learn

DAC
AILBoost

0.4 0.6 0.8 0.2 0.4 0.6 0.8

Figure 4.1: Aggregate metrics on DMC environments with 95% confidence intervals
(CIs) based on 5 environments spanning easy, medium, and hard tasks. Higher inter-
quartile mean (IQM) and mean scores (right) and lower optimality gap (left) is better.
The CIs were calculated with percentile bootstrap with stratified sampling over three
random seeds and all metrics are reported on the expert normalized scores. AILBoost
outperforms DAC, ValueDICE, IQ-Learn, and BC across all metrics, amount of expert
demonstrations, and tasks.

baselines on the DMC suite with 95% confidence intervals. We find that AILBoost not

only outperforms all baselines but also consistently matches the expert with only 1 expert

trajectory.

When we inspect the 1 trajectory case closer, Figure 4.2 shows the learning curves

on three representative (1 easy, 1 medium, 1 hard task) environments where we see

AILBoost maintain high sample efficiency and strong imitation while state-of-the-art

baselines like IQ-Learn completely fail on Humanoid Stand. Finally, we note that

AILBoost greatly outperforms ValueDICE which aimed to make AIL off-policy from a

different perspective. We refer readers to Figure C.2 in the appendix for the learning

curves on all five environments with different numbers of expert demonstrations.

75


Figure 4.2: Learning curves with 1 expert trajectory across 3 random seeds. Note
AILBoost successfully imitates expert on all environments where other baselines
fail and achieves better sample complexity than DAC. Note that when the environment
difficulty level increases, our method shows a larger performance gap compared to
baselines (e.g., humanoid stand).

0 50000 100000 150000 200000 250000
Samples

0

200

400

600

800

M
ea

n 
Sc

or
e

Walker Walk

0 50000 100000 150000 200000 250000
Samples

0

200

400

600

800

Cheetah Run

Expert BC DAC AILBoost

Figure 4.3: Image based: performance on image-based DMC environments, Walker
Walk and Cheetah Run, comparing AILBoost, DAC, and BC on three random seeds.

4.5.2 Image-based Experiments

Figure 4.3 demonstrates the scalability of AILBoost on a subset of environments with

10 expert trajectories. For these experiments, we use DrQ-v2 (Yarats et al., 2022) as

the underlying off-policy RL algorithm for both DAC and AILBoost. On Walker Walk

and Cheetah Run, we see comparable to better performance than DAC demonstrating

that our boosting strategy successfully maintains the empirical, scaling properties of

76


DAC. Furthermore, our use of different off-policy RL algorithms show the versatility of

AILBoost for IL.

4.5.3 Sensitivity to gradient-based optimization for weak learners

and discriminators

0.0 0.5 1.0 1.5
Samples 1e6

0

200

400

600

800

1000

M
ea

n 
Sc

or
e

Ball in Cup Catch

0.0 0.5 1.0 1.5
Samples 1e6

Walker Walk

Expert
1000 P, 100 D
1000 P, 10 D
1000 P, 1 D
100 P, 100 D

Figure 4.4: Policy and Discriminator Update Schedules: Learning curves for AILBoost
on two representative DMC environments, Walker Walk and Ball in Cup Catch, when
optimizing with varying policy and discriminator update schemes across 3 seeds.

Our algorithm relies on solving optimization problems in Eq. 4.6 and Eq. 4.5 for weak

learners and discriminators, where weak learner is optimized by SAC and discriminators

are optimized by SGD. While it is hard to guarantee in general that we can exactly

solve the optimization problem due to our policies and discriminators are both being

non-convex neural networks, we in general found that approximately solving Eq. 4.6

and Eq. 4.5 via gradient based update is enough to ensure good performance. In this

section, we test AILBoost across a variety of optimization schedules. Overall, we find that

AILBoost to be robust to optimization schedules — approximately optimizing Eq. 4.6 and

Eq. 4.5 with sufficient amount of gradient updates ensures successful imitation; however,

there exists a sample complexity cost when over-optimizing either the discriminator or

the policy.

Figure 4.4 shows our investigation of how sensitive AILBoost is to different opti-

77


mization schedules for both the policy and discriminator on two representative DMC

environments. In particular, we test with 5 expert demonstrations, where we vary the

number of discriminator and policy updates. We test the following update schemes:

• 1000 policy updates per 100 discriminator updates

• 1000 policy updates per 10 discriminator updates

• 1000 policy updates per 1 discriminator update

• 100 policy updates per 100 discriminator updates

These ranges, test various optimization schemes around the schedule that we chose

for the main results. We find that the more policy updates we do per discriminator

update, the algorithm becomes significantly less sample efficient despite asymptotically

reaching expert performance. We also found that an insufficient amount of updates on

the discriminator general hurts the performance. This is also expected since insufficient

update on the discriminators may result a ĝ which does not optimize Eq. 4.5 well enough.

4.6 Conclusion

We present a fully off-policy adversarial imitation learning algorithm, AILBoost. Differ-

ent from previous attempts at making AIL off-policy, via the gradient boosting framework,

AILBoost provides a principled way of re-using old data for learning discriminators and

policies. We show that our algorithm achieves state-of-the-art performance on state-based

results on the DeepMind Control Suite while being able to scale to high-dimensional,

pixel observations. We are excited to extend this framework to discrete control as well as

investigate imitation learning from observations alone under this boosting framework.

78


Part II

IL and RL for Generative Models

79


CHAPTER 5

LEARNING TO GENERATE BETTER THAN YOUR LLM

Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning

Large Language Models (LLMs) for text generation. In particular, recent LLMs such

as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning

with RL. Capitalizing on key properties of text generation, we investigate RL algorithms

beyond general purpose algorithms like Proximal Policy Optimization (PPO). In particu-

lar, we extend RL algorithms to allow them to interact with a dynamic black-box guide

LLM and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM

fine-tuning. We provide two ways for the guide LLM to interact with the LLM to be

optimized for maximizing rewards. The guide LLM can generate text which serves as

additional starting states for the RL optimization procedure. The guide LLM can also

be used to complete the partial sentences generated by the LLM that is being optimized,

treating the guide LLM as an expert to imitate and surpass eventually. We experiment

on the IMDB positive sentiment, CommonGen, and TL;DR summarization tasks. We

show that our RL algorithms achieve higher performance than supervised learning (SL)

and the RL baseline PPO, demonstrating the benefit of interaction with the guide LLM.

On both CommonGen and TL;DR, we not only outperform our SL baselines but also

improve upon PPO across a variety of metrics beyond the one we optimized for. Our

code can be found at https://github.com/Cornell-RL/tril.

80

https://github.com/Cornell-RL/tril


5.1 Introduction

Large Language Models (LLMs) have become very capable in various real-world ap-

plications ranging from being able to answer open-ended questions on numerous top-

ics (Zhang et al., 2022), write articles from short descriptions (Goyal et al., 2022),

generate code (Github, 2023), follow robot commands (Huang et al., 2022), solve puz-

zles (Bubeck et al., 2023), and even showcased as assistive models for education (Khan

Academy, 2023) and healthcare (Lee et al., 2023c).

However, using supervised learning (SL) to train LLMs presents a challenging metric

mismatch (Wiseman and Rush, 2016) between the training and testing regimes. The met-

ric mismatch arises from the training metric being the log-loss while the testing metrics

are task-specific such as BLEU or user satisfaction rating. This discrepancy is magnified

when fine-tuning LLMs on downstream tasks where the main goal is not just producing

fluent text but also being proficient at solving the specific task. Another mismatch is the

training and testing distributions mismatch. SL methods train model on the given static

datasets, while in inference time, the LLMs need to make prediction conditioned on the

text it has generated by itself. Such a distribution mismatch during training and testing

has been widely observed in literature such as Imitation Learning and RL (Ross et al.,

2011a), robotics (Ross et al., 2013), and NLP (Bengio et al., 2015; Arora et al., 2022).

Reinforcement Learning (RL) addresses these mismatches by directly optimizing the

metrics through reward feedback on the states generated by the RL agent itself. The ability

to test in real world and obtain reward feedback to correct and improve the agents’ behav-

iors on the fly makes RL a more powerful learning paradigm than SL. Recently, OpenAI

fine-tuned LLMs with RL from human feedback (RLHF) to better align LLMs to human

intentions, leading to the great success of ChatGPT (OpenAI, 2023). Following this, mul-

81


Figure 5.1: (Left) Reinforcement Learning with Guided Feedback (RLGF) flow chart
showing how breaking up generations into two parts, rollins and rollouts done by different
LLMs opens up a rich framework of interaction when compared to (Right) Reinforcement
Learning (RL). RLGF uses a guide policy πg to guide the policy training on π for
maximizing a reward function. The guide policy πg can be used to complete the partial
sentences generated by the policy π, which allows the RL algorithms to treat πg as an
expert to imitate and surpass. RLGF can also use πg to generate partial sentences from
which the RL algorithms start optimizing π. RLGF treats πg as a black-box model which
gives RLGF the flexibility of using different pre-trained LLMs (or even a human expert)
as πg. We mainly experiment using a supervised fine tuned model followed by nucleus
sampling (SFT+nucleus) as πg. Our experiments show that RLGF is capable of learning
a policy that is better than πg and the policy learned by standard RL alone.

tiple other models trained with RL such as Anthropic’s Claude2 (Anthropic, 2023) and

Meta’s LLama2 (Touvron et al., 2023) further proved the effectiveness of RL. Recently,

GRUE benchmark (Ramamurthy et al., 2022a) systematically studied RL versus SL when

finetuning LLMs on downstream tasks with predefined rewards. GRUE’s preliminary

results demonstrate the benefit of RL when fine-tuning LLMs, leading to the release of

popular codebases such as RL4LMs (Ramamurthy et al., 2022a), TRLx (CarperAI, 2023)

and AlpacaFarm (Dubois et al., 2023), that enables RL for language models. However,

ChatGPT, RL4LMs, TRLX, and AlpacaFarm all use vanilla policy gradient methods

known to be sample inefficient and sensitive to local minima due to the combinatorially

large search space of natural language generation (Ramamurthy et al., 2022a).

Here, we focus on more efficient RL methods for fine-tuning LLMs on downstream

tasks with predefined rewards (e.g., well-defined metric such as Bleu, or reward learned

82


from human preference feedback). Our approach is motivated by the classic prior work

on RL with rich reset distributions (Kakade and Langford, 2002b; Bagnell et al., 2003)

and Imitation Learning (IL) (Ross et al., 2011a; Sun et al., 2017a; Chang et al., 2015b),

which often leverages an existing guide policy (not necessarily an optimal policy) to

reduce the search space for more efficient and optimal learning. Our key observation is

that since modern pre-trained LLMs exhibit impressive general language capabilities,

they can serve as guide policies to improve the RL procedure. Our framework, which we

call, RL with guided feedback (RLGF), integrates a guide policy into a policy gradient

framework (Figure 5.1). When the guide policy can provide reasonable but potentially

sub-optimal predictions for downstream tasks, our framework can then leverage to learn

a near-optimal strategy. We introduce simple and novel algorithms for fine-tuning LLMs

using our RLGF framework while capturing various existing IL and RL algorithms. Our

proposed algorithms are simple and introduce little overhead on computation and memory

compared to PPO (especially when using LoRA adapters), making it straightforward to

replace PPO by our algorithms in any RLHF pipeline.

We evaluate on three tasks. The first is IMDB where the goal is to generate a positive

and fluent review given an initial context. The second is CommonGen where the goal

is to write a fluent text that uses a given set of words. Finally, we test on the TL;DR

summarization task where the objective is to learn to generate summaries using human

preference data. For all tasks, we find evidence of metric mismatch from SL-based

fine-tuning approaches and show that RL-based methods which utilize reward signals

outperform on the task metric. We then demonstrate RLGF outperforming PPO on

reward, fluency, as well as automated lexical metrics such as Rouge. In our experiments,

our guide policy is the SFT model equipped with nucleus sampling. Thus comparing to

the baseline PPO which uses the SFT model as a warm start, our algorithms use the same

amount of information and thus is a fair comparison to PPO. Finally, we investigate how

83


various baselines and RLGF algorithms balance the inherent trade-off between reward

optimization and the KL constraint in the RLHF objective. We provide both theoretical

justification and empirical evidence to show the benefit of using RLGF for fine-tuning

LLMs on downstream tasks.

84


5.2 Related Work

Here we present the most relevant works at the intersection of IL, RL, and natural

language generation. Please see Appendix D.1 for a more thorough treatment of the

literature.

IL for Structured Prediction: Algorithms such as Schedule Sampling (SS) (Ben-

gio et al., 2015), methods using SS (Duckworth et al., 2019; Mihaylova and Martins,

2019; Goyal et al., 2017), SEARNN (Leblond et al., 2017), Bridging the Gap (Zhang

et al., 2019b), Mixer (Ranzato et al., 2015) been inspired by IL for structured predic-

tion algorithms DAGGER (Ross et al., 2011a), DAD (Venkatraman et al., 2015), and

SEARN (Daumé III et al., 2009). Our work is inspired by AggreVaTeD (Sun et al.,

2017a) (Differentiable AggreVaTe Ross and Bagnell (2014b)) where the algorithm makes

use of differentiable policies and multi-step feedback rather than immediate one-step

predictions to imitate. Similarly, we present a differentiable version of LOLS (Chang

et al., 2015b) as well as an improvement, D2LOLS.

LLM Fine-tuning from Human Preferences: Recent advancements in fine-tuning

of Large Language Models (LLMs) have shown incredible success in tasks through

learning from human preferences. Being simpler to accumulate human preferences, Re-

inforcement Learning from Human Feedback (RLHF) (Stiennon et al., 2020) introduced

a paradigm to utilize RL to improve downstream performance on translation (Kreutzer

et al., 2018b), summarization (Stiennon et al., 2020), storytelling (Ziegler et al., 2019),

and instruction following (OpenAI, 2023). Another family of work use supervised learn-

ing style methods for fine-tuning LLMs (Zhao et al., 2023; Yuan et al., 2023; Rafailov

et al., 2023a; Liu et al., 2023c). DPO, SLiC, RRHF, and RSO are methods that optimize

for compatibility with a preference dataset under a preference reward model (either

85


explicitly modeling a reward function or implicitly representing a reward function via

an LLM itself) such as the Bradley Terry model (Bradley and Terry, 1952). Whether or

not one should use RL or SFT to fine-tune LLM is not the question we aim to address

here, instead, our work mainly focus on improving PPO for fine-tuning LLMs, and our

key contribution is novel RL algorithms that can outperform PPO on various tasks.

LLM Distillation: With an ever growing arsenal of powerful, black-box LLMs,

recent work has aimed to distill specific capabilities into a smaller model. Knowledge dis-

tillation (Buciluǎ et al., 2006; Hinton et al., 2015) in autoregressive models investigated

matching sequence level log probabilities (Kim and Rush, 2016), model hidden states

(Jiao et al., 2019), or attention scores (Wang et al., 2020b). Recently, more sophisticated

methods, inspired from the IL literature, are being proposed to better imitate the expert

LLM’s performance (Lin et al., 2020a; Agarwal et al., 2023; Mukherjee et al., 2023), with

ORCA (Mukherjee et al., 2023) reaching parity performance with ChatGPT (OpenAI,

2023) by distilling the reasoning traces from GPT4 (OpenAI, 2023). Distinct from this

line of work, RLGF does not aim to replicate the guidance policy. Rather, our objective is

to leverage generation traces derived from a guide policy to condense the search space for

RL algorithms. More importantly, our goal goes beyond imitation of the guidance policy

and focuses on algorithms that better optimize a reward with guidance policy feedback.

5.3 Preliminaries

The sequential nature in the task of Text generation with LLMs allows one to model

it via RL. In this setting, we are given a set of prompts {xi}Ni=1, and a reward function

R that measures some user-specified quality of the generated text. The reward R can

be pre-defined evaluation metrics or a learned reward model from human preference

86


datasets. The text generation RL problem can then be defined as a token-level finite-

horizon MDP ⟨S,A, P,R,H, µ⟩ using a finite vocabulary V. We are given a labeled

datasetD =
{
(xi, yi)

}N

i=1
of N samples , where xi is a prompt text and yi is the target text

generation. We define µ ∈ ∆(D) as the initial distribution over prompts, and the action

space A as the set of tokens in our vocabulary V. The state space S = ∪h=1,··· ,HV
h is

the set of all possible token sequences and a state sh ∈ S is the prompt x and previously

generated tokens (a0, a1, . . . , ah−1), i.e., sh = (x, a0, a1, . . . , ah−1). The transition function

P : S × A → ∆(S) is a deterministic known transition function that appends the next

action ah to the state sh+1. The time horizon H ∈ Z+ is the maximum generation length.

Finally, R : S → R is the reward function such as the task evaluation metric or a metric

learned from a preference dataset. We define our policy π as an LLM that maps from

state (i.e. prompt + partial generation) to action (next token).

Let dπh represent the state distribution of visiting a state at time h. Let dπ = 1
H

∑H
h=0 dπh

be the average visitation if we follow π for H steps in a trajectory. With an LLM policy

π, we define the value function and Q-function as Vπ
h (s) = Eπ[

∑H
h′=h R(sh′)|sh = s] and

Qπ
h(s, a) = R(s) + Es′∼P(·|s,a)[Vπ

h+1(s′)] respectively. Finally, we define the advantage

function for an LLM policy π as Aπ(s, a) = Qπ(s, a) − Vπ(s).

Guide policy πg In our setting, we additionally assume access to a black-box LLM-

based guide policy πg that can assist our policy π. The guide policy can be used to alter

the initial state distribution µ and to compute the advantage function Aπg
(s, a). In our

experiments, we mainly investigate using a supervised fine-tuned (SFT) model followed

by some decoding strategy (e.g., Nucleus sampling (Holtzman et al., 2019)) as πg. Note,

RLGF treats πg as a query-able, black-box model that we do not need update. This allows

for πg to be any black-box model such as GPT4 or a human-expert. Our work aims to

show that RLGF is capable of learning policies that are (much) better than πg, and by

87


<Prompt> Two roads to separate pathsdiverged

from the street

in a wood

rollin rollout reward

1

2 3

Figure 5.2: RLGF’s main mechanism of incorporating guidance through interactions
between two LLMs: rollin and rollout policies. (1) the rollin policy generates a trajectory.
(2) the rollout policy restarts to a sampled point in the generation (i.e. s2) and completes
the generation. (3) the rollout policy receives a score (i.e. reward) for the generation.

leveraging πg, it can outperform standard RL algorithm PPO.

5.4 Reinforcement Learning from Guided Feedback

Unlike other tasks studied in RL (e.g., robotics control problems), text generation prob-

lems have two key properties: a deterministic transition function and a policy’s ability to

restart to any state. Because our state is the set of previously generated tokens, we can

easily alter the words in the generation (add, remove or swap), and restart our policy πθ

to any point of the generation.

Restarts allow us to execute rollin and rollout policies as seen in Figure 5.2. The rollin

policy is used to generate sequences that the rollout policy evaluates. Specifically, we

sample a prompt x from our initial distribution µ. We then generate an entire trajectory

using our rollin policy starting from the sampled prompt. We combine the state-action

pairs from the collected rollin trajectory with the initial prompts – creating a modified

initial state for the rollout policy. The rollout policy samples a state along the rollin

generation, restarts to this state and performs rollouts. The rollout policy collects a

88


reward at the end of the generation. The rollin and rollout policies can be our LLM policy

πθ, guide policy πg. Depending on the choice of rollin and rollout policies, we invoke

different algorithms. Note that PPO uses πθ for both rollin and rollout policies.

PPO: Rollin πθ and Rollout πθ Under this schematic, notice how when both the rollin

and rollout policies are our current LLM policy πθ that is being fine-tuned, the resulting

RL algorithm is PPO. That is, we would be collecting generations from a single LLM. This

configuration does not take advantage of the ability to modify the initial state distribution

nor the availability of a guide policy πg.

Algorithm 6 PPO++

1: Input: πθ, guide πg, iterations T , mixing parameter β ∈ [0, 1], dataset D ={
(xi, yi)

}N

i=1
2: for t ∈ [T ] do
3: Rollin with (s, a) ∼ βdπ

g
+ (1 − β)dπ

t
θ starting from x ∼ D

4: Rollout with πt
θ to collect trajectories

5: Update Vπt
θ

ϕ with trajectories and compute advantage estimates Aπt
θ

6: Update πθ using PPO loss with Aπt
θ

7: end for
8: return πθ

PPO++: Rollin πg and Rollout πθ The new scheme we propose is rollin with our guide

policy πg and rollout with our LLM policy πθ. This strategy is motivated from a popular

Approximate Policy Iteration algorithm (Bertsekas, 2011): Conservative Policy Iteration

(CPI) (Kakade and Langford, 2002b). CPI proposes to use a diverse initial state distri-

bution to address the exploration issue in PG methods. Particularly, it proposes to use an

initial state distribution that covers some high-quality policy’s state distribution. The first

key idea of PPO++ is to take advantage of a guide policy πg to provide an enlarged initial

state distribution – so that the rollout policy, πθ, can visit diverse and relevant states it

would otherwise not visit. The second key idea of PPO++ is using a mixture policy with

89


state distribution βdπ
g
+ (1 − β)dπθ , for rollin (see Algorithm 6 Line 3). This ensures that

with probability (1 − β), PPO++ is executing the default PPO update, making sure PPO++

maintains the benefits of PPO and never underperforms PPO.

Algorithm 7 AggreVaTeD

1: Input: πθ, guide πg, iterations T , mixing parameter β ∈ [0, 1], dataset D ={
(xi, yi)

}N

i=1
2: for t ∈ [T ] do
3: Rollin with (s, a) ∼ (1 − β)dπ

t
θ + βdπ

g
starting from x ∼ D

4: Rollout with πg to collect trajectories
5: Update Vπg

ϕ with trajectories and compute advantage estimates Aπg

6: Update πθ using PPO loss with Aπg

7: end for
8: return πθ

AggreVaTeD: Rollin πθ and Rollout πg The next scheme performs rollin with our LLM

policy πθ and rollout with our guide policy πg – the opposite of PPO++. This scheme

is an interactive imitation learning algorithm, AggreVaTeD (Sun et al., 2017a), a differ-

entiable policy gradient version of AggreVaTe (Aggregate Values to Imitate (Ross and

Bagnell, 2014b)) as seen in Algorithm 7. AggreVaTeD is an API algorithm similar to

CPI and also uses a mixture policy with state distribution βdπ
g
+ (1 − β)dπθ for rollin.

This algorithm first generates rollins with the mixture policy to collect sequences. Then

AggreVaTeD generates rollouts with the guide policy and evaluates the quality of the

generated rollouts. It then uses the rollouts to train a value network Vπg

ϕ that measures

the reward-to-go of πg, which in turn is used to construct the advantage of πg: Aπg
. With

this advantage Aπg
, AggreVaTeD updates the policy like PPO (i.e., update πθ so that it

increases the probabilities of selecting actions with larger Aπg
). Intuitively, the algorithm

aims to learn the policy argmaxa Aπg
(s, a), which ensures that that the LLM policy πθ can

be at least as good as or better than the guide policy πg.

90


Algorithm 8 D2LOLS

1: Input: πθ, guide πg, iterations T , datasetD =
{
(xi, yi)

}N

i=1
2: Run π1

θ = AggreVaTeD(πθ, πg, αT, β1,D)
3: Run π2

θ = PPO++(π1
θ , π

g, (1 − α)T, β2,D)
4: return π2

θ

D2LOLS: combines PPO++ and AggreVaTeD Given the previous approaches of inter-

action, we can come up with multiple ways to combine PPO, PPO++, and AggreVaTeD.

In Algorithm 8, we present Direct and Differentiable Locally Optimal Learning to

Search (D2LOLS), which is a simple approach to combine the previous methods. D2LOLS

is a differentiable policy gradient version of Locally Optimal Learning to Search

(LOLS)(Chang et al., 2015b) and addresses limitations of how LOLS combines PPO, PPO++,

and AggreVaTeD. The original formulation of LOLS requires computing cost-sensitive

classification similar to AggreVaTe; instead we take inspiration from AggreVaTeD’s

differentiable approach to develop a differentiable version of LOLS. Furthermore, LOLS

(Algorithm 15) has a mixing probability parameter α which directly merges the advantage

function between PPO and AggreVaTeD, leading to theoretical issues. D2LOLS removes

this mixing probability and replaces it with a mixing time variable α that decides how

many iterations to perform AggreVaTeD before switching to PPO++. This simple modi-

fication not only makes D2LOLS more practical to optimizing LLMs, but also fixes LOLS’s

issue arising from interweaving guidance. Thus D2LOLS should be understood as a more

practical and more principled alternative of LOLS.

5.5 Theoretical Justification

In this section, we provide theoretical justification for various rollin and rollout schemes

mentioned in Section 5.4. Each algorithmic scheme takes advantage of a guide policy

91


πg, the ability to restart the policy to any state, and access to the reward signal. Our

theoretical justification are derived from the original algorithms that each method has

built upon.

Interactive Imitation Learning: AggreVaTeD In our interactive IL setting, we assume

access to the ground truth reward and to a guide policy πg that may not necessarily

be an expert policy π⋆ (i.e. optimal at the task). Our AggreVaTeD (Algorithm 7)

implementation is a modification of the original AggreVaTeD (Sun et al., 2017a) to

incorporate a PPO policy gradient loss. The overall idea is to perform policy gradient

updates on the loss function ℓt(π) := Es∼dπtEa∼π(·|s)[Aπg
(s, a)], where πt is our latest learned

policy. We can define the average-regret and best policy performance in our policy class

over T -iterations as:

ϵregret =
1
T

− T∑
t=0

ℓt(πt) +max
π∈Π

T∑
t=0

ℓt(π)

 ϵclass = max
π∈Π

1
T

T∑
t=0

Es∼dπt

[
Aπg

(s, π(s))
]
.

If the gradient update procedure achieves no-regret, i.e., ϵregret → 0 as T → ∞,

AggreVaTeD achieves the following guarantee; there exists t ∈ [T ], such that:

Vπt
≥ Vπg

+ Hϵclass.

When the guide policy is included in our policy class πg ∈ Π, e.g., when our policy πθ and

our guide πg have the same GPT2 model architecture, then our ϵclass term is guaranteed

to be non-negative. Furthermore, this term is positive when πg is not globally optimal

with respect to its advantage function (i.e., maxa Aπg
(s, a) can be positive). Thus when

ϵregret → 0 (i.e., no-regret), AggreVaTeD guarantees to learn a policy πt that outperforms

the guide policy by a margin. This was originally confirmed empirically in Sun et al.

(2017a) and is also confirmed in our experiments. With our SFT model with nucleus

sampling as πg, AggreVaTeD learns a policy πt outperforming πg.

92


Reinforcement Learning with better restart distribution: PPO++ Although

AggreVaTeD is capable of outperforming πg, it is an imitation learning algorithm, mean-

ing by design, its performance is limited by the performance of πg. In contrast, RL has

the potential to learn the near optimal policy, but popular RL approaches suffer from a

lack of exploration. We propose to leverage rollin’s with the guide policy to overcome

RL’s exploration issues. PPO++ Algorithm 6 implements this idea using a PPO loss. We

can interpret the rollin policy distribution with the guide policy, as a restart distribution

that alters the initial distribution of our policy, i.e., µmix := (1 − β)µ + βdπ
g
, where recall

µ ∈ ∆(D) is the original initial state distribution over our data.

Policy gradient theory (Kakade and Langford, 2002b; Bagnell et al., 2003; Agarwal

et al., 2019, 2021a) ensures that as long as a near optimal policy is covered by the

restart distribution, we can learn to perform as well as the near optimal policy. More

formally, consider the special case where β = 1/2, and π⋆ is the globally optimal policy;

and assume that at some iteration t one-step local improvement over πt is small, i.e.,

Es,a∼dπt
µmix

[
maxa Aπt

(s, a)
]
≤ ϵ, then with some small ϵ we have:

Vπt
≥ Vπ⋆ − O

(
H2 max

s

(
dπ

⋆
(s)

dπg(s)

)
ϵ

)
We refer readers to the proof of theorem 6.2 in Kakade and Langford (2002b). Note that

compared to the result from AggreVaTeD, we are able to compare against the globally

optimal policy π⋆ under the condition that πg’s state distribution covers π⋆’s state distri-

bution (i.e., the guide policy has a good sense of what states π⋆ will likely visit). In our

experiments, we mainly use a SFT model with nucleus sampling as our guide policy πg.

While we do not expect the SFT policy πg is as good as the optimal π⋆, it is reasonable to

expect that dπ
g

provides coverage to dπ
⋆
. Our experiments verify that restarting based on

states from dπ
g

improves the performance of PPO.

93


Combine Reinforcement Learning and Imitation Learning: D2LOLS D2LOLS is the

simplest approach to combine AggreVaTeD and PPO++. This algorithm runs AggreVaTeD

for a fixed period of time and then PPO++ for the remaining time. If our policy gradient

algorithm is Trust-region policy optimization (TRPO) 1 (Schulman et al., 2015a) or CPI

(Kakade and Langford, 2002b), then our algorithm has a guaranteed monotonic policy

improvement. This means that upon convergence, we achieve two properties: (1) our

learned policy is at least as good or better than the guide policy πg, (2) our policy is

locally optimal, i.e., the local one-step improvement, Es,a∼dπµmix
[maxa Aπ(s, a)], has to be

small (otherwise TRPO and CPI can keep improving).

There exist several algorithms in the literature that combine RL and IL (Cheng et al.,

2018; Sun et al., 2018; Chang et al., 2015b; Rajeswaran et al., 2017a; Nair et al., 2018).

The key difference between D2LOLS and LOLS is how PPO++ and AggreVaTeD is combined.

LOLS uses a mixing probability α to combine our πθ and the guide policy πg advantage

function αAπt
θ + (1 − α)Aπg

(s, a); whereas D2LOLS uses a mixing time parameter α to

decide when to switch from doing AggreVaTeD to PPO++ for the remainder of training.

LOLS can achieve the property of outperforming better than πg and also being locally

optimal, but only under the assumption that the following gap is small:

∀π :
∣∣∣∣Es∼dπ

[
max

a
Aπg

(s, a) +max
a

Aπ(s, a)
]
− Es∼dπ max

a

[
Aπg

(s, a) + Aπ(s, a)
] ∣∣∣∣ ≤ ε,

with some small ε. However, such a gap can exist in practice and does not vanish

even with enough training data. Intuitively this gap is non-trivial when the one-step

improvement over π contradicts with the one-step improvement over πg. The simplest

approach D2LOLS works the best, and achieves the guarantee that LOLS aimed for without

the additional assumption of the above gap being small.

1in our experiments, instead of using TRPO, we use PPO – a scalable version of TRPO that is more
suitable for high-dimensional problems. However we emphasize the TRPO and PPO use the same
principle for policy optimization: make conservative policy update (Kakade and Langford, 2002b) to
ensure monotonic improvement.

94


IMDB Sentiment CommonGen
Algorithms Semantic and Fluency Metrics Lexical and Semantic Metrics

Sentiment Score Perplexity Output-Perplexity Bleu-4 CIDEr-D SPICE
(↑) (↓) (↓) (↑) (↑) (↑)

Zero-Shot 0.48 ± 0.00 32.55 ± 0.00 5.64 ± 0.00 0.00 ± 0.00 6.02 ± 0.55 15.02 ± 0.40
SFT 0.55 ± 0.00 35.67 ± 0.00 6.19 ± 0.00 22.31 ± 0.12 14.32 ± 0.15 31.73 ± 0.34

SFT+PPO 0.97 ± 0.01 44.92 ± 1.78 3.17 ± 0.62 27.98 ± 0.32 16.91 ± 0.29 32.61 ± 0.06
SFT+PPO++ 0.97 ± 0.01 44.83 ± 2.10 3.34 ± 0.80 28.48 ± 0.24 16.94 ± 0.53 32.75 ± 0.21
SFT+AggreVaTeD 0.95 ± 0.03 52.56 ± 5.38 5.04 ± 2.30 28.14 ± 0.31 16.90 ± 0.09 32.44 ± 0.02
SFT+LOLS 0.93 ± 0.05 53.30 ± 16.70 3.44 ± 4.96 28.15 ± 0.16 16.91 ± 0.22 32.80 ± 0.20
SFT+D2LOLS 0.97 ± 0.00 43.88 ± 2.37 2.92 ± 0.13 28.54 ± 0.12 16.96 ± 0.18 32.83 ± 0.09

Table 5.1: IMDB and CommonGen Results: We compute the mean and standard
deviation over 3 seeds for both the IMDB and the CommonGen tasks. For our reward
function each task we use the bold metric(s). The zero-shot model is the performance
of the pretrained model used for IMDB and CommonGen, GPT-2 and T5 respectively.
SFT+Alg indicates running Alg after supervised finetuning. SFT+nucleus is used as our
guide policy πg for all experiments.

5.6 Experiments

We perform all of our experiments using a modified PPO objective Jppo (Ouyang et al.,

2022; Wu et al., 2016). This objective combines the original PPO objective with a

maximum-likelihood estimation (MLE) objective of the ground-truth dataset’sD refer-

ences:

Jppo(πθ) = E(s,a)∼πθ

[
R(s) − λKL(πθ(a|s)||π0(a|s))

]
+ ηE(s,a)∼D

[
log πθ(a|s)

]
,

where λ is the KL coefficient and η is the MLE coefficient. For all of our proposed

RLGF algorithms discussed in section 5.4 we consider setting πg to the supervised fine-

tuned model (SFT) with nucleus sampling for decoding (i.e., πg =SFT+nucleus). We

treat SFT+nucleus as a black-box model that we can only query for text generation and

do not perform updates to it. By using SFT+nucleus as our guide policy, we run all of

our experiments under the exact same conditions as those of RLHF. Note, RLHF already

requires keeping SFT to compute the KL constraint, KL(πθ||π0), in Jppo.

95


Task Details In our experiments, perplexity measures how likely our learned model,

πθ, is to generate the references in the task dataset, whereas output perplexity computes

how likely a general LLM (e.g. GPTJ) is to generate the generations from our learned

policy, πθ. Both perplexity metrics have been reported as a measure of fluency (Fedus

et al., 2018; Ramamurthy et al., 2022a).

We perform experiments on three tasks. IMDB is the first task and the objective is

to generate fluent and positively sentiment-ed text continuations for IMDB (Maas et al.,

2011) movie reviews prompts. We use a sentiment classifier (Sanh et al., 2019) as our

reward function that is trained on review texts and sentiment labels from the dataset,

which then provides sentiment scores indicating how positive a given piece of text is. For

training supervised SFT baselines, we consider only the examples with positive labels.

We chose GPT2 (Radford et al., 2019) as the base language model (LM) for this task.

We evaluate all algorithms on three metrics: sentiment reward score, perplexity, and

output-perplexity.

Next, we consider CommonGen (Lin et al., 2020b), a challenging constrained, text

generation task that tests the ability of generative common sense reasoning. We optimize

the SPIDER (Liu et al., 2017) reward function, a weighted combination of the CIDEr-D

and SPICE metric. We chose T5-base (Raffel et al., 2020) as our base LLM and prefixed

each concept set input with: "generate a sentence with:". We report four metrics: BLEU

(Papineni et al., 2002), CIDEr-D (Vedantam et al., 2015), and SPICE (Anderson et al.,

2016). For IMDB and CommonGen, we perform one epoch of supervised finetuning for

our SFT models.

The final task we consider is Reddit TL;DR summarization dataset (Völske et al.,

2017) where the objective is to generated summaries. We use the filtered dataset with

additional human preference data used in Stiennon et al. (2020). The base LLM that

96


TL;DR Summarization
Algorithms Semantic and Fluency Metrics

RM Score Perplexity Output-Perplexity Win Rate Rouge 1 Rouge 2 RougeL
(↑) (↓) (↓) (↑) (↑) (↑) (↑)

Zero-Shot 1.57 14.07 11.51 44.12% 0.27 0.07 0.18
SFT 5.68 14.09 12.81 44.29% 0.34 0.25 0.25
Best-of-N (N = 8) 5.98 14.09 12.86 47.60% 0.36 0.13 0.27

SFT+PPO 6.01 15.05 17.67 54.25% 0.35 0.13 0.27
SFT+PPO++ 6.11 14.53 16.15 55.01% 0.36 0.14 0.27
SFT+AggreVaTeD 5.93 14.69 16.41 48.98% 0.36 0.15 0.29

SFT+PPO (N = 8) 6.20 14.87 16.53 57.53% 0.36 0.15 0.27
SFT+PPO++ (N = 8) 6.52 13.42 15.23 60.30% 0.38 0.15 0.28
SFT+AggreVaTeD (N = 8) 6.11 13.53 15.61 54.12% 0.37 0.16 0.28

Table 5.2: TL;DR Summarization Results: We report the mean over 1 seed. Our RM
Score is under our trained preference reward model and the Win Rate is evaluated by
Llama2-13B-Chat. We use SFT+nucleus as πg. We also report Best-of-8 results with our
trained policies.

we use for this task is GPT-J (Wang and Komatsuzaki, 2021) and we train all models

in our algorithms using LoRA adapters (Hu et al., 2021). We evaluate all algorithms

on 5 metrics: reward score, perplexity, output-perplexity, win rate and Rouge (Lin,

2004). For win rate, we use the open source Llama2-13B-chat (Touvron et al., 2023)

model as our evaluator model. We compare all algorithm generations to the preferred

summary references. For our SFT model, we use an open-source GPT-J model2.Refer to

Appendix D.3.2, for the exact Win Rate prompt, example evaluations and implementation

details.

5.6.1 Experimental Results

RLGF vs. RLHF Performance Table 5.1 and Table 5.2 compares all of the RLGF

algorithms proposed in Section 5.4 against standard RLHF algorithms and baselines. For

all tasks, our πg is SFT+nucleus which is sub-optimal, performing worse than all RL

based algorithms across most lexical and semantic metrics. Utilizing this πg, for IMDB,

2https://huggingface.co/CarperAI/openai_summarize_tldr_sft

97

https://huggingface.co/CarperAI/openai_summarize_tldr_sft


D2LOLS outperforms PPO on all metrics while PPO++ outperforms PPO on both semantic

reward and perplexity, and for CommonGen, D2LOLS outperforms PPO in all metrics

including the ones that are not included in the reward function. Finally, for TL;DR

summarization we see that PPO++ performs better than PPO as well as a competitive

baseline, Best-of-N (Dubois et al., 2023). Furthermore, when applying Best-of-N

inference on our trained policies, we see that PPO++ improves even more beyond PPO.

Notably, with or without best-of-N procedure, PPO++ outperforms PPO on all metrics.

Supporting our justification from Section 5.5, AggreVaTeD improves beyond our

guide policy, providing an alternative as a warm-starting methodology to warm-starting

with SFT. PPO++, on the other hand, is better than or competitive to our RL baseline

demonstrating a simple, yet powerful alternative to PPO as the RL procedure. Even in

practice, we observe the benefit of restarting from an initial state distribution that better

covers an optimal policy’s state distribution. The combination of these two, D2LOLS,

achieves the best of both worlds and fully leverages the capabilities of utilizing a guide

policy.

Reward Optimization Tradeoff In Figure 5.3 we evaluate how well RLGF algorithms

trade-off optimizing the reward while minimizing the perplexity and kl-constraint
√

KL.

For fair comparisons, we kept λ and η the same across all algorithms. For both plots,

the top right corner indicates the policy has both high reward and low perplexity and low

divergence from π0. For each algorithm we plot 5 checkpoints ranging from 20 to 100

iterations.PPO++ mostly matches or has higher reward than PPO while maintaining a lower

perplexity. Separately, AggreVaTeD trade-offs reward for perplexity, and has comparable

reward scores as PPO while drastically reducing its perplexity. For the kl-constraints plot

on the left of Figure 5.3 we see that although PPO has a set of points with high reward,

most of these points also have high KL divergences. Whereas, a subset of PPO++ matches

98


5.6 5.8 6.0 6.2

RM Score (→)

0

2

4

6√
K
L

(π
||π

0
)

(→
)

5.6 5.8 6.0 6.2

RM Score (→)

14.0

14.5

15.0

15.5

p
er

p
le

xi
ty

(→
)

PPO++ AggreVaTeD PPO SFT

Figure 5.3: We investigate the reward optimization, kl-constriant, and fluency trade-off in
our TL;DR summarization task. The dashed line represents our SFT policy’s performance
across each metric. Both PPO++ and AggreVaTeD learn a policy that has a better trade-off
than PPO.

easy hard
prompt difficulty

6

8

10

12

m
ea

n 
CI

De
r-D

 sc
or

e

T5 SFT PPO AggreVaTeD LOLS PPO++ D2LOLS

Figure 5.4: Comparison of CIDer-D scores grouped by prompt difficulty on CommonGen.
The performance gap between easy and hard prompts is evident for SFT, and PPO++, while
our proposed algorithms AggreVaTeD, LOLS and D2LOLS exhibit a significantly smaller
gap, showcasing their effectiveness on challenging prompts.

or has higher reward than PPO while having a lower kl-constraint.

99


RLGF Performance on Difficult Prompts Our evaluation was carried out on the

CommonGen task where we categorized the prompts based on their difficulty level. For

CommonGen, we classify the prompts into easy and hard based on the number of unseen

concepts in the prompt. Specifically, we categorized prompts with 3 concepts as easy and

more than 3 concepts as hard. Figure 5.4 presents a comparison of scores for different

algorithms grouped by prompt difficulty. The results reveal a notable performance

gap between easy and hard prompts for algorithms such as SFT and PPO, whereas our

proposed algorithms PPO++, AggreVaTeD, LOLS and D2LOLS exhibit a smaller gap, with

D2LOLS having the least gap . In other words, even on challenging prompts, our interactive

algorithms produce better text continuations. See Appendix D.5 for example generations.

MLE and KL coefficient Sensitivity We test the sensitivity of PPO and RLGF al-

gorithms to two regularization hyperparameters in the Jppo objective, namely the KL

coefficient, λ, and the MLE coefficient, η. The left 2 plots in Figure 5.5 show the reward

and perplexity when we keep η fixed and vary λ while the right 2 show the performance

when we keep λ fixed and vary η. As shown in the left two figures, all RL algorithms are

robust to varying KL coefficients. We see that when we varying λ, while our algorithms

PPO++, D2LOLS and the baseline PPO has similar rewards, our algorithms consistently

maintain a lower (or equal) perplexity than PPO. From the right two figures, we observe

much more instability on perplexity when relaxing our MLE regularization with both

PPO and RLGF algorithms’ perplexities blowing up. Note that when increasing η, our

algorithm PPO++ consistently has higher rewards and lower (or equal) perplexity than

PPO.

100


10010−110−210−3

kl coefficient (λ)

0.6

0.7

0.8

0.9

1.0

re
w

ar
d

sc
or

e
(→

)

10010−110−210−3

kl coefficient (λ)

40

60

80

p
er

p
le

xi
ty

(←
)

10010−110−210−3

MLE (η)

0.8

0.9

1.0

re
w

ar
d

sc
or

e
(→

)

10010−110−210−3

MLE (η)

102

103

p
er

p
le

xi
ty

(←
)

PPO AggreVaTeD PPO++ LOLS D2LOLS

Figure 5.5: Jppo KL coefficient (λ) and MLE coefficient (η) ablation. We show the
sensitivity of PPO and RLGF algorithms to each regularization term in the objective.
Note that all RL algorithms are robust to changes in KL coefficient with relatively minor
changes in the Perplexity while being more sensitive to changes in MLE objective (Right)
with blowups in the perplexity.

5.7 Conclusion and Future Work

We presented a unifying framework of incorporating a guide policy to enhance rein-

forcement learning for natural language generation. Through theoretical justification and

experimental validation, we demonstrate that our RLGF framework can outperform PPO

for fine-tuning LLMs. Our proposed algorithms PPO++ and D2LOLS only require black-box

access to the guide policy and are conceptually simple and easy to implement based on

PPO. While in our experiment, we demonstrate that supervised fine-tuned models with

standard decoding strategies is a good candidate for the guide policy, our framework is

general enough to leverage any large LLMs as the guide policy, including those that are

not open-sourced. Finally, RLGF’s contributions to the broader large language model

literature is complementary to model enhancements, dataset improvements, and prompt-

ing discoveries such as in-context prompting. We leave it to exciting future work to test

the full capabilities of bootstrapping the state-of-the-art advancements in each research

direction with RLGF to improve reinforcement learning for natural language generation.

101


CHAPTER 6

PROVABLY EFFICIENT RL WITH PREFERENCE-BASED FEEDBACK VIA

DATASET RESET

Reinforcement Learning (RL) from Preference-based feedback is a popular paradigm

for fine-tuning generative models, which has produced impressive models such as GPT-4

and Claude3 Opus. This framework often consists of two steps: learning a reward

model from an offline preference dataset followed by running online RL to optimize

the learned reward model. In this work, leveraging the idea of reset, we propose a new

RL algorithm that can learn from preference-based feedback with provable guarantees.

Motivated by the fact that offline preference dataset provides informative states (i.e., data

that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization

(DR-PO), integrates the existing offline preference dataset into the online policy training

procedure via dataset reset: it directly resets the policy optimizer to the states in the

offline dataset, instead of always starting from the initial state distribution. In theory, we

show that DR-PO learns to perform at least as good as any policy that is covered by the

offline dataset under general function approximation with finite sample complexity. In

experiments, we demonstrate that on both the TL;DR summarization and the Anthropic

Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from

Proximal Policy Optimization (PPO) and Direction Policy Optimization (DPO), under

the metric of GPT4 win-rate.

6.1 Introduction

Reinforcement learning aims at maximizing a cumulative reward function. However,

specifying a reward function in practice can be challenging (Wirth et al., 2017). Rein-

forcement Learning with Human Feedback (RLHF) has become an effective approach

102


when a reward function does not exist (Christiano et al., 2017). Operating under a setting

where human labelers provide preference-based feedback (e.g., ranking of generations

from an RL agent), RLHF learns a reward model and then optimizes the reward model via

RL techniques. RLHF has found applications across various domains, including games

(MacGlashan et al., 2017; Christiano et al., 2017; Warnell et al., 2018), large language

models (LLMs) (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021; Nakano et al.,

2021; Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022a; Ramamurthy et al.,

2022b; Liu et al., 2023b), and robot learning (Brown et al., 2019; Shin et al., 2023).

RLHF typically consists of the following two steps: (1) fitting a reward model using

a pre-collected offline preference-based dataset (often generated from some pre-trained

models and labeled by humans), (2) and learn a policy via online RL (e.g., Proximal

Policy Optimization (Schulman et al., 2017b)) to optimize the learned reward model.

These two steps are often done separately in the sense that once the reward model

is learned, step (2) only optimizes the reward model without ever using the offline

preference dataset. Is there any benefit of re-using the offline data during the procedure

of optimizing the reward model via online RL? Prior work on hybrid RL (Song et al.,

2022; Ball et al., 2023) demonstrated that combining offline data and online data can

often significantly boost learning efficiency. Can we achieve a similar boost in learning

efficiency for RLHF?

Towards answering this, we propose an algorithm called Dataset Reset Policy Opti-

mization (DR-PO), operating under the assumption of being able to reset, i.e., we can

go back to any state and start policy optimization and data collection from that point (as

opposed to reseting to initial states). While being able to reset is certainly an assumption,

it is naturally satisfied in when using RL to fine-tune generative models like language

models and diffusion models (Lee et al., 2023b). This is because the underlying Markov

103


transitions are simple, known, and deterministic. Our algorithm, DR-PO, is a hybrid

RL approach which integrates offline data into an online RL procedure: when collecting

online data, DR-PO resets the policy optimizer to the states in the offline dataset for

exploration. Algorithmically, DR-PO is simple: it iteratively collects a batch of online

data by resetting the policy to states in the offline data, performs policy rollouts, and

optimizes the policy using the online batch via policy optimization techniques such

as Natural Policy Gradient (NPG) (Kakade, 2001a) or Actor-critic methods (e.g., PPO

(Schulman et al., 2017b)).

While DR-PO is as simple to implement as most of the existing policy optimization

algorithms, we demonstrate that DR-PO achieves strong theoretical guarantees under

natural assumptions. Specifically, DR-PO is capable of learning a policy that is at least as

good as any policy which is covered by the offline data in terms of maximizing the ground

truth rewards, and DR-PO achieves this result under general function approximation

with finite sample complexity. DR-PO is also computationally tractable since it only

requires supervised learning style oracles such as a Maximum Likelihood Estimation

(MLE) oracle (for fitting reward models) and a Least Squares Regression oracle (for

learning value functions). Thus DR-PO advances the status of the theoretical work on

RLHF (see more detailed discussion in Section 6.1.1). To support our new theory, we

test our approach on a standard RLHF dataset: TL;DR summarization (Stiennon et al.,

2020). We demonstrate that the summaries generated by DR-PO outperform those from

PPO and DPO (Rafailov et al., 2023b) in terms of GPT4 win-rate. We also show that

when transferring the policies trained on TL;DR to the CNN/DailyMail news articles in a

zero-shot manner, policies trained via DR-PO again generate summaries that outperform

those from PPO and DPO. Finally, we test how DR-PO scales on Anthropic HH (Bai

et al., 2022b) across three different model scales and show that DR-PO scales just as well

as PPO while still outperforming baselines.

104


6.1.1 Related Work

Provably efficient RLHF. The theoretical investigation on online RLHF started in

bandit setting with the notion of dueling bandits (Yue et al., 2012; Zoghi et al., 2014;

Dudík et al., 2015), which aims at identifying the optimal arm with human preference

feedback over action pairs. Extending this discussion to tabular MDPs, Novoseller et al.

(2020) proposes a dueling posterior sampling algorithm that requires computing and

sampling from the posterior of the model dynamics and reward function, leading to

potential computational inefficiency. Another PAC RLHF algorithm for tabular MDPs is

presented by Xu et al. (2020). However, this method involves computing complicated

bonus terms to guide exploration. Additionally, Pacchiano et al. (2021); Chen et al.

(2022) have designed online RLHF algorithms with provable guarantees by updating a

confidence set of the policies iteratively, which, unfortunately, are not practically feasible

either. In a more recent study, Zhan et al. (2023b) tackles the problem of reward-free

RLHF. Nevertheless, their algorithm introduces a series of non-convex optimization

problems which are challenging to solve. Notably, these works either only focus on

tabular MDPs Novoseller et al. (2020); Xu et al. (2020); Pacchiano et al. (2021) or rely on

specialized function approximation such as linear parametrization (Pacchiano et al., 2021;

Zhan et al., 2023b) and function classes with small Eluder dimension (Chen et al., 2022;

Wu and Sun, 2023), which further restricts their application in practice. In contrast, we

focus on the setting where preference-labeled data is only available offline, which is more

consistent with the settings considered in applications of fine-tuning language models.

Also by using the idea of dataset reset, our algorithm works with function approximation

that is much more general than the above prior works.

The study on theoretical offline RLHF is more limited. Li et al. (2023) focuses

on learning the reward from a human’s behavior in dynamic discrete choice models

105


rather than from human preference feedback, and thus, the setting is different. Zhu

et al. (2023a) studies PAC algorithms for linear models and Zhan et al. (2023a) extends

the analysis to general function approximation. However, both of their algorithms are

not computationally efficient because they rely on constructing a confidence set for the

reward function and solving a constrained maximin problem.

Tiapkin et al. (2023) studied the setting where high-quality expert demonstrations

exist. They use behavior cloning to train a policy using expert demonstrations and then

run an Upper-confidence-bound style algorithm to optimize a reward function under a

KL regularization to the behavior-cloned policy. They show that for tabular and linear

MDP, the expert demonstrations reduce the sample complexity of online RL. We consider

preference-based offline datasets, which may not necessarily come from a high-quality

expert, and function approximation that is significantly more general than linear and

tabular functions. Note that UCB based algorithms can quickly become computationally

intractable beyond tabular and linear settings (e.g., Jiang et al. (2016); Du et al. (2021)).

Our algorithm uses the idea of dataset reset for exploration and does not involve any

optimism-based exploration strategy, making it computationally tractable even when

dealing with general function approximation. We think that the key idea of dataset reset

can also be used in the setting from Tiapkin et al. (2023) to make their algorithm extend

beyond the tabular and linear MDP settings.

Empirical RLHF algorithms. This work continues the recent literature of RLHF

algorithms that perform online RL (Zhu et al., 2023b; Wu et al., 2023; Chang et al.,

2023) to finetune large generative models. There have also been efforts to build on top of

DPO (Rafailov et al., 2023b) with algorithms such as IPO (Azar et al., 2011) and KTO

(Contextual.ai, 2023). In this paper, our work is complementary to many of these efforts

in augmenting RL through the incorporation of dataset resets in online generation. Ideas

106


from this work could directly be applied to existing online RLHF algorithms such as P3O

(Wu et al., 2023) and APA (Zhu et al., 2023b). Given the recent work (Yuan et al., 2024)

in incorporating online generations to improve DPO, an offline RLHF method, the idea

of dataset resets could also be relevant in this space of hybrid RLHF methods.

Using reset in RL The idea of reset is not new in RL (Kakade et al., 2003a; Bagnell,

2004; Nair et al., 2018; Salimans and Chen, 2018; Yin et al., 2022; Uchendu et al., 2023;

Silver et al., 2016a; Agarwal et al., 2019; Daumé III and Marcu, 2005; Daumé III et al.,

2009). When resetting is available, it helps address exploration and credit assignment

problems. In this work, we show that resetting to an offline dataset helps in RLHF. The

key challenge in RLHF is that the reward model is learned purely from offline data which

may not have a global coverage to the entire state space. Our algorithm incorporates KL

regularization to ensure the learned policies do not deviate too much from the offline data

so that we do not over-optimize the learned reward model (e.g., reward hacking). While

the idea of KL-regularization was also used in prior empirical RLHF works (e.g.,Stiennon

et al. (2020); Bai et al. (2022a)), we show that by combining the two key ideas, KL

regularization and dataset reset, our algorithm achieves strong performance in both theory

and practice. We also demonstrate the efficacy of our approach in the application of

fine-tuning language models.

6.2 Preliminaries

Markov Decision Processes. In this paper we consider an episodic time-

inhomogeneous Markov Decision Process (MDP) M with state space S = {Sh}
H
h=1,

action spaceA and horizon H. Here Sh is the subspace of all states at step h. We suppose

the states incorporate the information of the current step and thus {Sh}
H
h=1 are mutually

107


disjoint. We assume that every episode begins at the same state s1 and ends at the dummy

state sH+1, but our analysis can be extended to a random starting state easily. In each

episode, at step h ∈ [ht], the agent observes the current sh and executes an action ah.

Then the environment generates a reward r⋆(sh, ah) (which can be unobservable to the

agent), and transits to a new state sh+1, which is sampled from the transition probability

P(·|sh, ah). Here we suppose the reward function r⋆ : S ×A 7→ [0, 1] is bounded, and for

any possible trajectory τ = (sh, ah)H
h=1, we have

∑H
h=1 r⋆(sh, ah) ≤ rmax. Note that when

the reward is sparse, rmax can be much smaller than H.

A policy π : SÕ∆A specifies the action selection probability of the agent conditioned

on the current state. Given a policy π, we define its state-action visitation measure

as dπh(s, a) = Pπ(sh = s, ah = a) for all s ∈ Sh, a ∈ A, h ∈ [ht] where Pπ(·) denotes

the distribution of the trajectory when executing policy π. We will also use dπh(s) =∑
a∈A dπh(s, a) to denote the state visitation measure and dπ(τ) to denote the distribution

of the trajectory under policy π. We can further define the associated value functions

and Q functions of policy π and reward function r as Vπ,r(s) = Eπ[
∑H

t=h r(st, at) | sh =

s],Qπ,r(s, a) = Eπ[
∑H

t=h r(st, at) | sh = s, ah = a] for all h ∈ [ht], s ∈ Sh, a ∈ A.1 They

characterize the expected cumulative reward under policy π starting from a state or a

state-action pair.

We aim to find an ϵ-optimal policy π̂ with respect to the true reward r⋆ and a target

policy π⋆ which we denote as some high-quality policy (π⋆ is not necessarily the globally

optimal policy), i.e., Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ ϵ. Particularly, we would only utilize

common oracles such as Maximum Likelihood Estimator (MLE) and Least Squares

Regression (LSR). We also want our algorithms to be able to leverage general function

classes.
1For notation simplicity, we drop the usual subscript h in value functions, as we have assumed state s

contains the information of time step h.

108


RL from Human Feedback (RLHF). We consider the setting where the true reward

r⋆ is unobservable. Instead, we have access to an offline trajectory-pair dataset DR =

{(τ0
m, τ

1
m, om)M

m=1} labeled with human preference, where the trajectories τ0
m and τ1

m are i.i.d.

sampled from some pre-trained policy πSFT (e.g., in NLP tasks, this can be the instruction

fine-tuned policy, which is also called supervised fine-tuned (SFT) policy). In this work,

we do not explicitly consider the learning procedure of πSFT, and we assume it is given to

us. Here om ∈ {0, 1} characterizes the human preference over the trajectory pairs (τ0
m, τ

1
m)

and we suppose the human preference is modeled by a monotonically increasing link

function Φ:

P(o = 1 | τ0, τ1) = P(τ1 ≻ τ0) = Φ(r⋆(τ1) − r⋆(τ0)),

where we use r⋆(τ) to denote
∑H

h=1 r⋆(sh, ah) for any trajectory τ = (sh, ah)H
h=1. A widely-

used model is the Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952) where

the link fucntion is chosen to be the sigmoid function σ(x) = 1/{1 + exp(−x)}.We will

use κ = 1
infx∈[−rmax ,rmax] Φ′(x) to measure the non-linearity of the link function Φ, which in turn

reflects the hardness of learning the reward model from the human preference. Given

DR, we can learn a reward model r̂ using MLE:

r̂ = argmin
r∈R

M∑
m=1

− log P(o = om | τ
0
m, τ

1
m), (6.1)

With the BTL model, the above NLL becomes

1I(om = 1) · log
(
1 + exp(r(τ0

m) − r(τ1
m))

)
+ 1I(om = 0) · log

(
1 + exp(r(τ1

m) − r(τ0
m))

)
,

which is a loss function that has been used in many prior RLHF works(Christiano

et al., 2017; Stiennon et al., 2020). We also assume that we have an unlabeled dataset

DTR = {τn}
N
n=1 where τn is i.i.d. sampled from πSFT.

The Ability to Reset. We consider the setting where we can reset the system. More

formally, given any state sh at time step h, we can reset the RL agent directly to sh and

109


rollout a policy π. While this is certainly an assumption, it is satisfied in many important

applications, e.g., fine-tuning generative models such as LLMs (Ouyang et al., 2022)

and Diffusion models (Lee et al., 2023b) with RL. In text generation, a state sh typically

means a partial sentence. Resetting from this state would then mean that we feed the

partial sentence sh to a transformer based policy and have it generate new tokens one

by one starting from the given partial sentence. We emphasize that in the RL literature,

prior works (e.g., PPO and many RL theoretical works (Agarwal et al., 2021a; Azar

et al., 2017; Jin et al., 2020a; Zhan et al., 2022)) typically do not assume the ability to

reset – they often assume the agent has to always start from some initial states. However,

when reset is available, it is often a game changer, in both theory Yin et al. (2022) and in

practice (e.g., AlphaGo Silver et al. (2016a)).

6.3 Dataset Reset Policy Optimization

We present a meta-algorithm here to provide the details of how we leverage the idea of

dataset reset to collect online batch data. We abstract away the policy optimization oracle

here for the purpose of emphasizing the novelty in terms of how we interact with the

environment for online data collection. Once the online batch data is collected, we feed

it to a policy optimization oracle, e.g., PG, NPG, Actor-critic methods, or a PPO-style

update 2.

Algorithm 9 summarizes the key idea of dataset reset in DR-PO. The key difference

between DR-PO and a more standard policy optimizer is that in DR-PO, for each episode,

the policy collects online trajectories via resetting to a state randomly sampled from
2Here we mean the specific actor-critic style policy optimization formulation where clipping is used to

ensure small policy update, and critic is learned via GAE, on a gvien online batch data Schulman et al.
(2017b).

110


Algorithm 9 Dataset Reset Policy Optimization (DR-PO)
1: Input: Preference datasetDR, unlabeled datasetDTR, reward function class R, total

number of iterations T .
2: Initialize: π1 = πSFT.
3: Learn a reward model r̂ via MLE based on Eq. equation 6.1.
4: for t = 1, · · · ,T do
5: Initialize an empty online batchDon.

/* Online data collection */
6: for n = 1, · · ·N do
7: Randomly sample a trajectory inDTR and a state sh from it where h ∈ [ht].
8: Reset πt to sh and rollout πt to generate trajectory {sh, ah, . . . , sH, aH}.
9: Add trajectory {sh′ , ah′ , r̂(sh′ , ah′), ln(πt(ah′ |sh′)/πSFT(ah′ |sh′))}Hh′=h toDon.

10: end for
11: Policy update: πt+1 ⇐ PO(πt,Don). {PG, NPG / TRPO, CPI, Actor-Critic, PPO}
12: end for

some trajectory in the offline datasetDTR. In other words, we do not rollout the policy

π from the initial state s1 as typically done in standard policy optimization algorithms like

PG. The online data collection procedure collects a batch of online trajectoriesDon. Note

for each online trajectory, we record each state-action pair’s reward measured under the

learned reward model r̂, and also the log ratio of πt and πSFT which serves as an empirical

estimate of the policy KL divergence, i.e., KL(πt(sh′)||πSFT(sh′)). Such a KL divergence

term can be optionally used as a reward penalty to ensure the learned policies do not

deviate too far from πSFT so that the reward model r̂ stays as a good approximation of the

true reward r⋆ under learned policies’ trajectory distributions. We use this KL penalty

both in theory and in practice.

Once the online data is collected, we feed it to a policy optimization oracle PO for a

policy update. A PO oracle can be a PG, NPG, or PPO style update. To be more specific,

for a PPO style update procedure, we use Don to fit a critic for advantage estimation

Â(s, a)3 (e.g., via generalized advantage estimation used in PPO), and then update the

policy onDon with the clipping trick: πt+1 ⇐ argmaxπ
∑

s,a∈Don
Clip

(
π(a|s)
πt(a|s)

)
Â(s, a). This

3when using KL penalty, this advantage function measures the advantage under KL regularized reward
— r̂ − λKL with λ ∈ R+ as coefficient for the KL penalty.

111


is the policy update that we use in our experiments. In our theory, we use NPG as the

PO oracle. While PPO and NPG are different when it comes to exact implementation,

PPO can be understood as a heuristic that approximates NPG for the purpose of being

more scalable for large-scale optimization (e.g., the clipping trick induced by PPO is

approximately trying to ensure that the new policy does not deviate too much from the

old one – a key property that NPG methods advocated for (Kakade, 2001a; Kakade and

Langford, 2002a; Bagnell and Schneider, 2003; Schulman et al., 2015a)).

6.4 Theoretical Analysis

In this section, we analyze the DR-PO (Alg 9) by instantiating the policy optimization

oracle PO to be a Natural Policy Gradient (NPG) oracle. For completeness, we describe

PO in Algorithm 10, which in high level consists of policy evaluation via least square

regression, and then policy update via Mirror Descent style procedure. We leave the

detailed full description of the algorithm in Appendix E.1.

In Alg. 10, we use the online data to fit a Q function estimate of the current policy

πt. Note that here we do not use the KL penalty ln(πt(ah′ |sh′)/πSFT(ah′ |sh′)) directly

when calculating the trajectory total reward. In Appendix E.3, we provide a version of

NPG which includes the KL penalty when calculating the trajectory’s total reward and

corresponding analysis. Once we learn the critic, we perform policy update via running

KL-based Mirror Descent. Note that this step has a closed-form expression for πt+1:

πt+1(a|s) ∝
(
πSFT(a|s)

) ηλ
ηλ+1
·
(
πt(a|s)

) 1
ηλ+1 · exp

(
η

ηλ + 1
· Q(s, a)

)
Note that the KL penalty to πSFT in the policy update procedure is important to ensure

that πt+1 does not deviate too much from πSFT. Also this type of updates ensures that the

support of πt(·|s) is always a subset of the support of πSFT(s) for all state s.

112


Algorithm 10 NPG update for the PO oracle in Alg. 9

1: Input: Online datasetDon, the previous policy πt, Q function class F , regularization
parameter λ, learning rate η

2: Create an empty regression datasetD.
3: for each (partial) trajectory τ inDon do
4: Take the first state-action pair (sh, ah) in τ and calculate the total reward y =∑H

h′=h r̂(sh′ , ah′)
5: Add ((sh, ah), y) toD
6: end for
7: Learn critics:

Q = argmin
f∈F

1
|D|

∑
(s,a,y)∈D

[
( f (s, a) − y)2

]
.

8: Policy update:

πt+1(s) = argmin
p∈∆(A)

⟨−Q(s, ·), p⟩ + λKL(p∥πSFT(s)) +
1
η

KL(p∥πt(s)),∀s.

Remark 20. Tiapkin et al. (2023) also investigates the theoretical guarantee of KL

divergence regularization in RLHF. However, there are two key differences between

our work and theirs. First, they consider a behavior cloning setting where there exists

an expert demonstration dataset from a near optimal expert, while our offline dataset

and the supervised finetuned policy πSFT can be quite sub-optimal (we only need πSFT

to cover the target policy π⋆ in the subsequent analysis). Second, they implement a

UCBVI-type algorithm for policy alignment, which requires optimistic planning and is

limited to tabular MDPs and linear MDPs only. Designing computationally tractable

UCB-style algorithms beyond tabular or linear models is more challenging. In contrast,

our algorithm combines offline data and online data and modifies NPG by resetting it to

the offline dataset. While resetting is an assumption, it naturally holds in applications

such as fine-tuning LLMs. We also believe the idea of dataset reset can be used in the

setting from Tiapkin et al. (2023) to extend their results beyond tabular or linear models.

Remark 21. Though we mainly focus on the settings where we can reset, when resetting

is not possible (e.g., real robotics applications), we can implement the reset by a roll-in

113


and roll-out procedure since we have access to πSFT: we roll-in πSFT to some sh, and

then continue by rolling out our policy that is being optimized. This procedure is closely

related to the PPO++ algorithm proposed in Chang et al. (2023), where the authors

empirically demonstrated that it outperforms vanilla PPO on some RLHF benchmarks

(but no detailed theoretical investigation). When resetting is available, by directly

resetting to the offline data generated by πSFT, we further reduce computation.

6.4.1 Theoretical Sample Complexity

Now we introduce the required assumptions in our analysis.

Function classes. We first assume that the reward function class and Q function class

are realizable and bounded:

Assumption 22 (reward function classes). Suppose that we have r⋆ ∈ R. In addition,

assume that 0 ≤ r(τ) ≤ rmax for all r ∈ R and trajectory τ.

Assumption 23 (Q function classes). Suppose that we have Qπt ,̂r ∈ F for all t ∈ [T ]. In

addition, assume that 0 ≤ f (s, a) ≤ rmax for all f ∈ F , s ∈ S, a ∈ A.

Realizability is a standard assumption used in the theoretical analysis of supervised

learning. It is possible to extend our analysis to the setting where model-misspecification

exists, and we leave this extension as a future work.

Concentrability. Then we assume that πSFT can cover the comparator policy π⋆:

114


Assumption 24 (single-policy concentrability). Suppose that we have:

(1) max
τ

dπ
⋆
(τ)

dπSFT(τ)
= CTR < ∞;

(2) max
h∈[ht],s∈Sh,a∈A

dπ
⋆

h (s, a)

dπSFT

h (s, a)
= CST < ∞.

Note that in Assumption 24 we need πSFT to cover π⋆, both trajectory-wise and

state-action-wise. In particular, we always have CST ≤ CTR. Assuming trajectory-wise

covering is necessary in RLHF because the human feedback is also trajectory-wise, as

shown by the lower bounds in Zhan et al. (2023a). Intuitively, if the offline data only

covers low performance policies’ traces, then the learned reward model cannot guarantee

to recognize trajectories from a high performance policy during test time (because it has

never seen such things in training).

Remark 25. We can indeed relax Assumption 24 by leveraging the information in R and

F , as shown in the discussion in Appendix E.2.

Under the above assumptions, we have the following theorem to characterize the

suboptimality of π̂ returned by Algorithm 16. Recall that κ = 1
infx∈[−rmax ,rmax] Φ′(x) measures

the non-linearity of the link function Φ.

Theorem 26. Suppose Assumption 22,23,24 hold. For any δ ∈ (0, 1], let

ϵMLE := Θ


√
κ2

M
log
|R|

δ

 , ϵeval := Θ


√

r2
max

N
log

T |F |
δ

 ,
and set

T =
H

2
5 r

4
5
max

ϵ
4
5

MLE

∧
rmax

ϵeval
, η =

√
1

Tr2
max

, λ =
T

1
3 r

1
3
maxϵ

2
3

MLE

H
1
3

,

then with probability at least 1− δ, we have Algorithm 9 with NPG update (Algorithm 10)

returns a policy π̂ which satisfies

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ H
4
5 r

3
5
maxϵ

2
5
MLE log CST +

√
CTRϵMLE + H

√
rmaxCSTϵeval.

115


Theorem 26 indicates that the suboptimality of π̂ scales with 1
M and 1

N polynomially.

In particular, from Theorem 26, we can easily obtain the sample complexity of DR-PO,

as shown in the following corollary:

Corollary 27. Suppose Assumption 22,23,24 hold and set

T =
H

2
5 r

4
5
max

ϵ
4
5

MLE

∧
rmax

ϵeval
, η =

√
1

Tr2
max

, λ =
T

1
3 r

1
3
maxϵ

2
3

MLE

H
1
3

,

then if we have

M = Ω
((

H4r3
max log5 CST

ϵ5 +
CTR

ϵ2

)
κ2 log

|R|

δ

)
,

N = Ω
(

H4r4
maxC

2
ST

ϵ4 log
T |F |
δ

)
,

we have with probability at least 1 − δ that Algorithm 9 with NPG update (Algorithm 10)

returns a policy π̂ which satisfies

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ ϵ.

Theorem 26 and Corollary 27 indicate that DR-PO with NPG update is capable of

finding an ϵ-optimal policy with polynomial sample complexity, i.e., Õ(1/ϵ5) labeled

trajectory pairs and Õ(1/ϵ4) unlabeled trajectories under single-policy concentrability.

Algorithmically, our algorithm does not require pessimism and is model-free, which is

much easier and more practical than the pessimistic model-based algorithm proposed in

Zhan et al. (2023a).

Remark 28 (Tighter bounds). In Appendix E.3, assuming the KL penalty does not blow

up throughout training, we further reduce the sample complexity of labeled trajectories

to Õ(1/ϵ2) by directly including the KL penalty ln(πt(a|s)/πSFT(a|s)) into the reward.

Remark 29. In Theorem 26 and Corollary 27 we assume R and F are finite, but our

results can be extended to infinite classes directly by replacing |R|(|F |) with their covering

numbers.

116


6.5 Experiments

We empirically evaluate DR-PO’s ability to learn from dataset resets. First, we test how

well DR-PO is able to both efficiently optimize the reward score as well as minimize

the KL-divergence with the reference policy. We also test the generation quality of our

resulting policies in terms of Rouge (Lin, 2004) and win rate (Rafailov et al., 2023b)

against human references measured by GPT4 (Achiam et al., 2023). Next, we conduct an

ablation study, incrementally relaxing the the proportion of dataset resets in our online

data collection to study how sensitive DR-PO is to this hyperparameter. Finally, we

investigate DR-PO’s performance when transferring to another summarization task such

as CNN/DailyMail (See et al., 2017). We find that collecting online generations with

dataset resets results in a policy with a better tradeoff between reward optimization

and KL-divergence, leading to improved generations over baseline RL algorithms, PPO

(Schulman et al., 2017b) and Direct Preference Optimizaion (DPO) (Rafailov et al.,

2023b).

Task We evaluated DR-PO on the TL;DR summarization dataset used in Stiennon et al.

(2020)4 and tested scaling performance on the Anthropic Helpful Harmful (HH) task(Bai

et al., 2022b). For TL;DR, a model is trained to generate summaries of online Reddit

posts guided by human preference data. The task consists of two datasets: one with

human reference summaries and another with preference data. Following the standards

set by both Stiennon et al. (2020) and Rafailov et al. (2023b), we train our reward models

and DPO baseline on the preference dataset while performing online RL (for PPO and

DR-PO) on the human reference dataset. We set the maximum context length to be 512

and the maximum generation length to be 53, ensuring that it is possible to generate all

4Dataset can be obtained from https://github.com/openai/summarize-from-feedback

117

https://github.com/openai/summarize-from-feedback


references in the dataset. For Anthropic HH, the model is asked to respond to a dialogue

sequence in a helpful, harmless manner. We follow much of design choices from TRLx5

for dataset processing, context length, and generation length. For more details about the

dataset, please see Appendix E.6

Evaluation To test the performance of DR-PO against our baselines we evaluate each

method by its tradeoff between reward model score and KL-divergence with the reference

policy, testing the effectiveness of the algorithm in optimizing the regularized RLHF

objective. Furthermore, we compute the Rouge score and GPT4 win rate to evaluate

the generation quality of our resulting policies. Note for our win rate calculation, we

report the win rate of a randomly sampled subset (10%) of the test set for a total of

600 samples. Please see Appendix E.6.3 for the prompt used to query GPT4 as well

as an example response. When evaluating the on CNN/DailyMail we make use of the

constructed preference dataset from Stiennon et al. (2020) and for training a supervised

finetuned model, we use HuggingFace’s dataset version 2.0.06.

Methods We instantiate DR-PO by using PPO style policy optimization (Schulman

et al., 2017b) as the policy optimizer (PO in Algorithm 9). First for TL;DR, we maintain

the same pretrained LLM and supervised finetuned model for all of our experiments.

For supervised finetuning, we trained a Pythia 2.8B7 (Biderman et al., 2023) parameter

model for 1 epoch over the dataset with human references as labels. Similarly for the

reward model, we trained a Pythia 2.8B parameter model for 1 epoch over the preference

labeled dataset. Then, for DPO, PPO, and DR-PO, we trained our policy and critic with

low rank adapters (LoRA) (Hu et al., 2022) on top of our supervised finetuned (SFT)

5https://github.com/CarperAI/trlx
6https://huggingface.co/datasets/cnn_dailymail
7HuggingFace Model Card: EleutherAI/pythia-2.8b-deduped

118

https://github.com/CarperAI/trlx
https://huggingface.co/datasets/cnn_dailymail
EleutherAI/pythia-2.8b-deduped


Algorithms TL;DR Summarization

Win Rate RM Score KL(π||πre f ) Rouge 1 Rouge 2 RougeL
(↑) (↑) (↓) (↑) (↑) (↑)

SFT 31.6 ± 0.2% -0.51 ± 0.04 - 32.17 ± 1.01 12.27 ± 0.67 24.87 ± 1.22
DPO 52.6 ± 0.4% - 37.33 ± 2.01 30.03 ± 3.23 7.93 ± 1.02 22.05 ± 0.83
PPO 62.3 ± 2.5% 1.17 ± 0.13 16.32 ± 1.46 33.73 ± 2.34 11.97 ± 0.91 24.97 ± 1.03
DR-PO 70.2 ± 1.7% 1.52 ± 0.09 16.84 ± 0.83 33.68 ± 1.78 11.90 ± 0.06 25.12 ± 0.76

Table 6.1: TL;DR Summarization Results: Our RM Score is under our trained prefer-
ence reward model and the win rate is evaluated by GPT4. All evaluated policies except
for SFT are models with LoRA adapters. We present results across 3 seeds.

0 20 40
KL(π||πref ) (←)

0

1

2

R
M

S
co

re
(→

)

TL;DR Summarization

DR-PO PPO SFT Reference

Figure 6.1: Reward vs KL-Divergence Frontier: Plotting the regularized optimization
tradeoff between DR-PO and our baselines over the entire test set. DR-PO is able to
achieve a much better tradeoff by learning higher reward generations with lower KL. The
average reference and SFT scores under the RM are shown as dashed lines.

model and our reward model (RM) respectively. Finally for our scaling experiments for

Anthropic HH, we trained Pythia 125M, 1B, and 6.9B parameter models for 1 epoch over

the HH dataset for both SFT and RM training. Please see Appendix E.6 for details.

6.5.1 How well can DR-PO optimize the RLHF objective?

Table 6.1 compares DR-PO against PPO, DPO, and supervised finetuning. The KL-

regularized reward optimization broadly used in RLHF as well as analyzed in Section 5.5

119


balances reward exploitation and deviation from a reference policy. When computing

the KL-divergence, we use our SFT policy as our reference policy for all our methods.

Notably, DR-PO scores a higher RM value over the test set over all baselines with a

slightly larger KL discrepancy than PPO. We also see that with GPT4 win rate, DR-PO

achieves the highest preference over human references showcasing the benefit of learning

from resets. Figure 6.1 plots a more detailed frontier of the reward and KL tradeoff for

DR-PO and PPO. We generate this plot by binning the test scores according to KL. We

see that for most KL values, DR-PO is able to achieve a higher score than PPO.

6.5.2 Analysis of Dataset Reset Proportion

Algorithms Win Rate RM Score KL(π||πre f )
(↑) (↑) (↓)

PPO 60.7% 1.14 15.08
DR-PO (β = 0.25) 61.7% 1.28 14.77
DR-PO (β = 0.5) 66.5% 1.28 15.63
DR-PO (β = 0.75) 64.3% 1.25 14.32
DR-PO (β = 1.0) 68.5% 1.47 16.65

Table 6.2: DR-PO Ablation of Datset Reset Proportion: Our RM Score is under our
trained preference reward model and the Win Rate is evaluated by GPT4. β represents
the proportion of online data generated from dataset resets with 1.0 being all generations
are from resets and 0.0 being PPO (i.e., always reset to initial prompts).

Next, we investigate how sensitive DR-PO is to the amount of dataset resets done

during online generation. We define β as the proportion of generations in a given online

batch of generations with dataset resets. More specifically, our main results are with

β = 1.0 which translates to all generations during online training of DR-PO starting from

a randomly sampled reset from the human references. Note that a β value of 0 recovers

the baseline PPO (e.g., all generations start from initial prompts). Table 6.2 shows the

expected RM score, KL, and win rate of DR-PO as we increase the mixing proportion

120


0 25
KL(π||πref ) (←)

1.0

1.5

2.0

2.5

R
M

S
co

re
(→

)

P
ro

p
ortio

n
o

f
D

a
ta

set
R

esets0

1

Figure 6.2: Ablation of Dataset Reset: Plotting the RM score and KL-Divergence
tradeoff as a function of dataset reset proprotion. Blue represents no mixing while red
represents every online generation starting from a reset.

from 0% (PPO) to 100% (DR-PO) after 2 epochs of training. Notably, even with a small

amount of dataset resets DR-PO is able to learn higher scoring generations with a lower

KL than PPO. Moreover, we see that DR-PO with any amount of reference resets leads

to higher win rates than PPO. Figure 6.2 plots the RM score/KL-divergence frontier of

our learned policies on the test set. Note that DR-PO is robust to the amount of dataset

resets in optimizing the regularized RLHF objective. Finally, supporting our analysis

from Section 5.5, DR-PO generally performs better the more online data we gather from

resets with a 100% reset proportion performing the best.

6.5.3 DR-PO Transfer Performance

Finally, we investigate DR-PO’s ability to do zero-shot transfer to another summarization

task, ensuring that learning a policy by reseting from human references does not diminish

the generalization observed with PPO in Stiennon et al. (2020). Specifically, we investi-

gate whether leveraging human references on TL;DR has the unintended consequence of

overfitting to the specific dataset rather than learning more generally to summarize. For

our baselines, we test the zero-shot capabilities of both PPO and DPO as well as report

the performance of a supervised finetuned policy on CNN/DailyMail using the same base

121


Algorithms CNN/DM Summarization

Win Rate Rouge 1 Rouge 2 RougeL
(↑) (↑) (↑) (↑)

SFT (CNN/DM) 10.5% 25.60 12.27 19.99

DPO 6.0% 20.71 9.47 15.70
PPO 8.5% 23.62 12.29 18.56
DR-PO 12.0% 29.53 15.36 22.88

Table 6.3: Zero-shot transfer to CNN/DM: the Win Rate is evaluated by GPT4.

model, Pythia 2.8B. Table 6.3 demonstrates DR-PO’s zero-shot capabilities, being the

only policy to outperform a supervised finetuned model on all metrics. Therefore, we

see that learning from resets not only improves RLHF on the training task but also the

zero-shot transfer performance to another summarization task.

6.5.4 DR-PO Scaling Performance on Anthropic HH

Figure 6.3 shows DR-PO’s performance across different model scales on Anthropic HH

task. Specifically we tested three model sizes: 125M, 1B, and 6.9B. We specifically

trained on the Pythia models (Biderman et al., 2023) using TRLx8. We see that DR-PO

has similar scaling improvements at PPO while still producing generations that are more

preferred than those from our baselines.

6.6 Conclusion

We present DR-PO, a provably efficient algorithm that exploits a generative model’s

ability to reset from offline data to enhance RL from preference-based feedback. Both in

8https://github.com/CarperAI/trlx

122

https://github.com/CarperAI/trlx


108 109 1010

Model size

0.15

0.20

0.25

0.30

0.35

G
P

T
4

W
in

R
at

e

Anthropic HH Scaling

SFT DPO PPO DR-PO

Figure 6.3: Scaling on Anthropic HH: The GPT4 win rate of DR-PO when tested across
3 model scales: 125M, 1B, and 6.9B. Reported winrates are mean and std across 3 seeds.

theory and in practice, we demonstrate the effectiveness of incorporating dataset resets

into online RL. While in our experiments we specifically demonstrate dataset resets

on a PPO style policy optimizer, the idea of dataset reset is both general and simple to

implement into any online data collection component of other RLHF algorithms. We

leave it to exciting future work to test the full capabilities of dataset resets in other RLHF

methods.

123


CHAPTER 7

RL FOR CONSISTENCY MODELS: FASTER REWARD GUIDED

TEXT-TO-IMAGE GENERATION

Reinforcement learning (RL) has improved guided image generation with diffusion

models by directly optimizing rewards that capture image quality, aesthetics, and instruc-

tion following capabilities. However, the resulting generative policies inherit the same

iterative sampling process of diffusion models that causes slow generation. To overcome

this limitation, consistency models proposed learning a new class of generative models

that directly map noise to data, resulting in a model that can generate an image in as

few as one sampling iteration. In this work, to optimize text-to-image generative models

for task specific rewards and enable fast training and inference, we propose a frame-

work for fine-tuning consistency models via RL. Our framework, called Reinforcement

Learning for Consistency Model (RLCM), frames the iterative inference process of a

consistency model as an RL procedure. RLCM improves upon RL fine-tuned diffusion

models on text-to-image generation capabilities and trades computation during inference

time for sample quality. Experimentally, we show that RLCM can adapt text-to-image

consistency models to objectives that are challenging to express with prompting, such as

image compressibility, and those derived from human feedback, such as aesthetic quality.

Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves

the quality of the generation measured under the reward objectives, and speeds up the

inference procedure by generating high quality images with as few as two inference steps.

Our code is available at https://rlcm.owenoertell.com.

124

https://rlcm.owenoertell.com


Inference Time (Seconds)

1 2 3 4 20

RLCM

DDPO

Train Time Test Time

Figure 7.1: Reinforcement Learning for Consistency Models (RLCM). We propose a
new framework for finetuning consistency models using RL. On the task of optimizing
aesthetic scores of a generated image, comparing to a baseline which uses RL to fine-tune
diffusion models (DDPO), RLCM trains (left) and generates images (right) significantly
faster, with higher image quality measured under the aesthetic score. Images generated
with a batch size of 8 and RLCM horizon set to 8.

7.1 Introduction

Diffusion models have gained widespread recognition for their high performance in

various tasks, including drug design (Xu et al., 2022) and control (Janner et al., 2022).

In the text-to-image generation community, diffusion models have gained significant

popularity due to their ability to output realistic images via prompting. Despite their

success, diffusion models in text-to-image tasks face two key challenges. First, generating

the desired images can be difficult for downstream tasks whose goals are hard to specify

via prompting. Second, the slow inference speed of diffusion models poses a barrier,

making the iterative process of prompt tuning computationally intensive.

To enhance the generation alignment with specific prompts, diffusion model inference

can be framed as sequential decision-making processes, permitting the application of

reinforcement learning (RL) methods to image generation (Black et al., 2024; Fan et al.,

2023). The objective of RL-based diffusion training is to fine-tune a diffusion model to

maximize a reward function directly that corresponds to the desirable property.

125


Diffusion models also suffer from slow inference since they must take many steps

to produce competitive results. This leads to slow inference time and even slower

training time. Even further, as a result of the number of steps we must take, the resultant

Markov decision process (MDP) possesses a long time horizon which can be hard for

RL algorithms optimize. To resolve this, we look to consistency models. These models

directly map noise to data and typically require only a few steps to produce good looking

results. Although these models can be used for single step inference, to generate high

quality samples, there exits a few step iterative inference process which we focus on.

Framing consistency model inference, instead of diffusion model inference, as an MDP

(as shown in Figure 7.2) admits a much shorter horizon. This enables faster RL training

and allows for generating high quality images with just few step inference.

More formally, we propose a framework Reinforcement Learning for Consistency

Models (RLCM), a framework that models the inference procedure of a consistency

model as a multi-step Markov Decision Process, allowing one to fine-tune consistency

models toward a downstream task using just a reward function. Algorithmically, we

instantiate RLCM using a policy gradient algorithm as this allows for optimizing general

reward functions that may not be differentiable. In experiments, we compare to the

current more general method, DDPO (Black et al., 2024) which uses policy gradient

methods to optimize a diffusion model. In particular, we show that on an array of tasks

(compressibility, incompressibility, prompt image alignment, and LAION aesthetic score)

proposed by DDPO, RLCM outperforms DDPO in most tasks in training time, inference

time, and sample complexity (i.e., total reward of the learned policy vesus number of

reward model queries used in training) (Section 7.5).

Our contributions in this work are as follows:

• In our experiments, we find that RLCM has faster training and faster inference

126


than existing methods.

• Further, that RLCM, in our experiments, enjoys better performance on most tasks

under the tested reward models than existing methods.

7.2 Related Works

Diffusion Models Diffusion models are a popular family of image generative models

which progressively map noise to data (Sohl-Dickstein et al., 2015). Such models

generate high quality images (Ramesh et al., 2021; Saharia et al., 2022) and videos (Ho

et al., 2022; Singer et al., 2022). Recent work with diffusion models has also shown

promising directions in harnessing their power for other types of data such as robot

trajectories and 3d shapes (Janner et al., 2022; Zhou et al., 2021). However, the iterative

inference procedure of progressively removing noise yields slow generation time.

Consistency Models Consistency models are another family of generative models

which directly map noise to data via the consistency function (Song et al., 2023). Such a

function provides faster inference generation as one can predict the image from randomly

generated noise in a single step. Consistency models also offer a more fine-tuned trade-off

between inference time and generation quality as one can run the multistep inference

process (Algorithm 17, in Appendix F.1) which is described in detail in Section 7.3.2.

Prior works have also focused on training the consistency function in latent space

(Luo et al., 2023) which allows for large, high quality text-to-image consistency model

generations. Sometimes, such generations are not aligned with the downstream for which

they will be used. The remainder of this work will focus on aligning consistency models

to fit downstream preferences, given a reward function.

127


RL for Diffusion Models Popularized by Black et al. (2024); Fan et al. (2023), training

diffusion models with reinforcement learning requires treating the diffusion model

inference sequence as an Markov decision process. Then, by treating the score function

as the policy and updating it with a modified PPO algorithm (Schulman et al., 2017b),

one can learn a policy (which in this case is a diffusion model) that optimizes for a given

downstream reward. Further work on RL fine-tuning has looked into entropy regularized

control to avoid reward hacking and maintain high quality images (Uehara et al., 2024).

Another line of work uses deterministic policy gradient methods to directly optimize the

reward function when the reward function is differentiable (Prabhudesai et al., 2023).

Note that when reward function is differentiable, we can instantiate a deterministic policy

gradient method in RLCM as well. We focus on REINFORCE (Williams, 1992) style

policy gradient methods for the purpose of optimizing a black-box, non-differentiable

reward functions.

7.3 Preliminaries

7.3.1 Reinforcement Learning

We model our sequential decision process as a finite horizon Markov Decision Process

(MDP),M = (S,A, P,R, µ,H). In this tuple, we define our state space S, action space

A, transition function P : S × A → ∆(S), reward function R : S × A → R, initial

state distribution µ, and horizon H. At each timestep t, the agent observes a state st ∈ S,

takes an action according to the policy at ∼ π(at|st) and transitions to the next state

st+1 ∼ P(st+1|st, at). After H timesteps, the agent produces a trajectory as a sequence of

states and actions τ = (s0, a0, s1, a1, . . . , sH, aH). Our objective is to learn a policy π that

128


maximizes the expected cumulative reward over trajectories sampled from π.

JRL(π) = Eτ∼p(·|π)

 H∑
t=0

R(st, at)



7.3.2 Diffusion and Consistency Models

Generative models are designed to match a model with the data distribution, such that

we can synthesize new data points at will by sampling from the distribution. Diffusion

models belong to a novel type of generative model that characterizes the probability

distribution using a score function rather than a density function. Specifically, it produces

data by gradually modifying the data distribution and subsequently generating samples

from noise through a sequential denoising step. More formally, we start with a distribution

of data pdata(x) and noise it according to the stochastic differential equation (SDE) (Song

et al., 2020b):

dx = µ(xt, t)dt + σ(t)dw

for a given t ∈ [0,T ], fixed constant T > 0, and with the drift coefficient µ(·, ·), diffusion

coefficient σ(·), and {w}t∈[0,T ] being a Brownian motion. Letting p0(x) = pdata(x) and

pt(x) be the marginal distribution at time t induced by the above SDE, as shown in

Song et al. (2020b), there exists an ODE (also called a probability flow) whose induced

distribution at time t is also pt(x). In particular:

dxt =

[
µ(xt, t) −

1
2
σ(t)2∇ log pt(xt)

]
dt

The term ∇ log pt(xt) is also known as the score function (Song and Ermon, 2019;

Song et al., 2020b). When training a diffusion models in such a setting, one uses a

technique called score matching (Dinh et al., 2016; Vincent, 2011) in which one trains a

network to approximate the score function and then samples a trajectory with an ODE

129


solver. Once we learn such a neural network that approximates the score function, we

can generate images by integrating the above ODE backward in time from T to 0, with

xT ∼ pT which is typically a tractable distribution (e.g., Gaussian in most diffusion

model formulations).

This technique is clearly bottle-necked by the fact that during generation, one must

run a ODE solver backward in time (from T to 0) for a large number of steps in order

to obtain competitive samples (Song et al., 2023). To alleviate this issue, Song et al.

(2023) proposed consistency models which aim to directly map noisy samples to data.

The goal becomes instead to learn a consistency function on a given probability flow.

The aim of this function is that for any two t, t′ ∈ [ϵ,T ], the two samples along the

probability flow ODE, they are mapped to the same image by the consistency function:

fθ(xt, t) = fθ(xt′ , t′) = xϵ where xϵ is the solution of the ODE at time ϵ. At a high level,

this consistency function is trained by taking two adjacent timesteps and minimizing the

consistency loss d( fθ(xt, t), fθ(xt′ , t′)), under some image distance metric d(·, ·). To avoid

the trivial solution of a constant, we also set the initial condition to fθ(xϵ , ϵ) = xϵ .

Inference in consistency models After a model is trained, one can then trade inference

time for generation quality with the multi-step inference process given in Appendix F.1,

Algorithm 17. At a high level, the multistep consistency sampling algorithm first par-

titions the probability flow into H + 1 points (T = τ0 > τ1 > τ2 . . . > τH = ϵ). Given

a sample xT ∼ pT , it then applies the consistency function fθ at (xT ,T ) yielding x̂0.

To further improve the quality of x̂0, one can add noises back to x̂0 using the equation

x̂τn ← x̂0+

√
τ2

n − τ
2
Hz, and then again apply the consistency function to (x̂τn , τn), getting

x̂0. One can repeat this process for a few more steps until the quality of the generation

is satisfied. For the remainder of this work, we will be referring to sampling with the

multi-step procedure. We also provide more details when we introduce RLCM later.

130


7.3.3 Reinforcement Learning for Diffusion Models

Black et al. (2024) and Fan et al. (2023) formulated the training and fine-tuning of

conditional diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020)

as an MDP. Black et al. (2024) defined a class of algorithms, Denoising Diffusion Policy

Optimization (DDPO), that optimizes for arbitrary reward functions to improve guided

fine-tuning of diffusion models with RL.

Diffusion Model Denoising as MDP Conditional diffusion probabilistic models con-

dition on a context c (in the case of text-to-image generation, a prompt). As intro-

duced for DDPO, we map the iterative denoising procedure to the following MDP

M = (S,A, P,R, µ,H). Let r(s, c) be the task reward function. Also, note that the

probability flow proceeds from xT → x0. Let T = τ0 > τ1 > τ2 . . . > τH = ϵ be a

partition of the probability flow into intervals:

st := (c, τt, xτt) π(at|st) := pθ
(
xτt+1 |xτt , c

)
P(st+1|st,at) := (δc, δτt+1 , δxτt+1

)

at := xτt+1 µ :=
(
p(c), δτ0 ,N(0, I)

)
R(st,at) =


r(st, c) if t = H

0 otherwise

where δy is the Dirac delta distribution with non-zero density at y. In other words, we

are mapping images to be states, and the prediction of the next state in the denosing flow

to be actions. Further, we can think of the deterministic dynamics as letting the next state

be the action selected by the policy. Finally, we can think of the reward for each state

being 0 until the end of the trajectory when we then evaluate the final image under the

task reward function.

131


This formulation permits the following loss term:

LDDPO = ED

T∑
t=1

[
min

{
r(x0, c)

pθ(xt−1|xt, c)
pθold(xt−1|xt, c)

, r(x0, c)clip
(

pθ(xt−1|xt, c)
pθold(xt−1|xt, c)

, 1 − ϵ, 1 + ϵ
)}]

where clipping is used to ensure that when we optimize pθ, the new policy stay close

to pθold , a trick popularized by the well known algorithm Proximal Policy Optimization

(PPO) (Schulman et al., 2017b).

In diffusion models, horizon H is usually set as 50 or greater and time T is set as

1000. A small step size is chosen for the ODE solver to minimize error, ensuring the

generation of high-quality images as demonstrated by Ho et al. (2020). Due to the long

horizon and sparse rewards, training diffusion models using reinforcement learning can

be challenging.

7.4 Reinforcement Learning for Consistency Models

To remedy the long inference horizon that occurs during the MDP formulation of diffusion

models, we instead frame consistency models as an MDP. We let H also represent the

horizon of this MDP. Just as we do for DDPO, we partition the entire probability flow

([0,T ]) into segments, T = τ0 > τ1 > . . . > τH = ϵ. In this section, we denote t as the

discrete time step in the MDP, i.e., t ∈ {0, 1, . . . ,H}, and τt is the corresponding time

in the continuous time interval [0,T ]. We now present the consistency model MDP

formulation.

132


3/7/24, 10:09 PMUntitled Diagram

Page 1 of 1https://app.diagrams.net/

Data

Noise

Multi-Step Inference as MDP

Consistency
Models

Figure 7.2: Consistency Model As MDP: In this instance, H = 3. Here we first start
at a randomly sampled noised state s0 ∼

(
N(0, I), δτ0 , p(c)

)
. We then follow the policy

by first plugging the state into the the consistency model and then noising the image
back to τ1. This gives us a0 which, based off of the transition dynamics becomes s1. We
then transition from s1 by applying π(·), which applies the consistency function to x̂0 and
then noises up to τ2. We repeat this process until we reach timestep H. To calculate the
end of trajectory reward, we apply the consistency function one more time to get a final
approximation of x̂0 and apply the given reward function to this image.

Consistency Model Inference as MDP We reformulate the multi-step inference pro-

cess in a consistency model (Algorithm 17) as an MDP:

st := (xτt , τt, c) π(at|st) := fθ
(
xτt , τt, c

)
+ Z P(st+1|st,at) := (δxτt+1

, δτt+1 , δc)

at := xτt+1 µ :=
(
N(0, I), δτ0 , p(c)

)
RH(sH) = r( fθ(xτH , τH, c), c)

where is Z =
√
τ2

t − τ
2
Hz which is noise from line 5 of Algorithm 17. Further, where

r(·, ·) is the reward function that we are using to align the model and RH is the reward at

timestep H. At other timesteps, we let the reward be 0. We can visualize this conversion

from the multistep inference in Figure 7.2.

Modeling the MDP such that the policy π(s) := fθ(xτt , τt, c) + Z instead of defining

π(·) to be the consistency function itself has a major benefit in the fact that this gives us a

stochastic policy instead of a deterministic one. This allows us to use a form of clipped

133


Algorithm 11 Policy Gradient Version of RLCM
1: Input: Consistency model policy πθ = fθ(·, ·) + Z, finetune horizon H, prompt set P,

batch size b, inference pipeline P
2: for i = 1 to M do
3: Sample b contexts from C, c ∼ C.
4: X0 ← P( fθ,H, c) {where X0 is the batch of images, x0 }
5: Normalize rewards r(x0, c) per context
6: Split X0 into k minibatches.
7: for each minibatch do
8: for t = 0 to H do
9: Update θ using rule:

∇θED
∑T

t=1

[
min

{
r(x0, c) ·

πθi+1 (at |st)
πθi (at |st)

, r(x0, c) · clip
(
πθi+1 (at |st)
πθi (at |st)

, 1 − ϵ, 1 + ϵ
)}]

10: end for
11: end for
12: end for
13: Output trained consistency model fθ(·, ·)

importance sampling like Black et al. (2024) instead of a deterministic algorithm (e.g.

DPG (Silver et al., 2014)) which we found to be unstable and in general is not unbiased.

Thus a policy is made up of two parts: the consistency function and noising with Gaussian

noises. The consistency function takes the form of the red arrows in Figure 7.2 whereas

the noise is the green arrows. In other words, our policy is a Gaussian policy whose mean

is modeled by the consistency function fθ, and covariance being (τ2
t − ϵ

2)I (here I is an

identity matrix). Notice that in accordance with the sampling procedure in Algorithm 17,

we only noise part of the trajectory. Note that the final step of the trajectory is slightly

different. In particular, to calculate the final reward, we just transition with the applying

the consistency function (red/yellow arrrow) and obtain the final reward there.

Policy Gradient RLCM We can then instantiate RLCM with a policy gradient opti-

mizer, in the spirit of Black et al. (2024); Fan et al. (2023). Our algorithm is described

in Algorithm 11. In practice we normalize the reward per prompt. That is, we create

134


a running mean and standard deviation for each prompt and use that as the normalizer

instead of calculating this per batch. This is because under certain reward models, the

average score by prompt can vary drastically.

7.5 Experiments

In this section, we hope to investigate the performance and speed improvements of

training consistency models rather than diffusion models with reinforcement learning.

We compare our method to DDPO (Black et al., 2024), a state-of-the-art policy gradient

method for finetuning diffusion models. First, we test how well RLCM is able to

both efficiently optimize the reward score and maintain the qualitative integrity of the

pretrained generative model. We show both learning curves and representative qualitative

examples of the generated images on tasks defined by Black et al. (2024). Next we

show the speed and compute needs for both train and test time of each finetuned model

to test whether RLCM is able to maintain a consistency model’s benefit of having a

faster inference time. We then conduct an ablation study, incrementally decreasing the

inference horizon to study RLCM’s tradeoff for faster train/test time and reward score

maximization. Finally, we qualitatively evaluate RLCM’s ability to generalize to text

prompts and subjects not seen at test time to showcase that the RL finetuning procedure

did not destroy the base pretrained model’s capabilities.

For fair comparison, both DDPO and RLCM finetune the Dreamshaper v71 and its

latent consistency model counterpart respectively2 (Luo et al., 2023). Dreamshaper v7

is a finetune of stable diffusion (Rombach et al., 2022). For DDPO, we used the same
1https://huggingface.co/Lykon/dreamshaper-7
2https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7

135

https://huggingface.co/Lykon/dreamshaper-7
https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7


hyperparameters and source code3(Black et al., 2024) provided by the authors. We found

that the default parameters performed best when testing various hyperparamters. Please

see Appendix F.2.2 for more details on the parameters we tested.

Compression The goal of compression is to minimize the filesize of the image. Thus,

the reward received is equal to the negative of the filesize when compressed and saved as

a JPEG image. The highest rated images for this task are images of solid colors. The

prompt space consisted of 398 animal categories.

Figure 7.3: Qualitative Generations: Representative generations from the pretrained
models, DDPO, and RLCM. Across all tasks, we see that RLCM does not compromise
the image quality of the base model while being able to transform naturalistic images to
be stylized artwork that maximizes an aesthetic score, removes background content to
maximize compression, and generate images of animals in fictional scenarios like riding
a bike to maximize prompt-alignment.

3https://github.com/kvablack/ddpo-pytorch

136

https://github.com/kvablack/ddpo-pytorch


0K 20K

Reward Queries

−150

−75
N

e
g

F
il
e
s
iz

e
(
k
b
)

Compression

0K 20K

Reward Queries

300

600

F
il
e
s
iz

e
(
k
b
)

Incompression

0K 20K

Reward Queries

6

7

8

L
A

IO
N

A
e
s
t
h
e
t
ic

Aesthetic

0K 30K 60K

Reward Queries

0.76

0.77

0.78

L
L

a
V

A
1
3
B

Prompt-Image Alignment

RLCM DDPO

Figure 7.4: Learning Curves: Training curves for RLCM and DDPO by number
of reward queries on compressibility, incompressibility, aesthetic, and prompt image
alignment. We plot three random seeds for each algorithm and plot the mean and standard
deviation across those seeds. RLCM seems to produce either comparable or better reward
optimization performance across these tasks.

Incompression Incompression has the opposite goal of compression: to make the

filesize as large as possible. The reward function here is just the filesize of the saved

image. The highest rated mages for this task are random noise. Similar to the comparison

task, this task’s prompt space consisted of 398 animal categories.

Aesthetic The aesthetic task is based off of the LAION Aesthetic predictor (Schuh-

mann, 2022) which was trained on 176,000 human labels of aesthetic quality of images.

This aesthetic predictor is a MLP on top of CLIP embeddings (Radford et al., 2021). The

images which produce the highest reward are typically artwork. This task has a smaller

set of 45 animals as prompts.

Prompt Image Alignment We use the same task as Black et al. (2024) in which the

goal is to align the prompt and the image more closely without human intervention.

This is done through a procedure of first querying a LLaVA model (Liu et al., 2023a)

to determine what is going on in the image and taking that response and computing the

BERT score (Zhang et al., 2019a) similarity to determine how similar it is to the original

prompt. This values is then used as the reward for the policy gradient algorithm.

137


0 50 100

GPU Hours (A6000)

−150

−75
N

e
g

F
il
e
s
iz

e
(
k
b
)

Compression

0 50 100

GPU Hours (A6000)

300

600

F
il
e
s
iz

e
(
k
b
)

Incompression

0 50 100

GPU Hours (A6000)

6

7

8

L
A

IO
N

A
e
s
t
h
e
t
ic

Aesthetic

0 100 200 300

GPU Hours (A6000)

0.76

0.77

0.78

L
L

a
V

A
1
3
B

Prompt-Image Alignment

RLCM DDPO

Figure 7.5: Training Time: Plots of performance by runtime measured by GPU hours.
We report the runtime on four NVIDIA RTX A6000 across three random seeds and plot
the mean and standard deviation. We observe that in all tasks RLCM noticeably reduces
the training time while achieving comparable or better reward score performance.

5 10 15

Inference Time (sec)

−150

−75

N
e
g

F
il
e
s
iz

e
(
k
b
)

Compression

5 10 15

Inference Time (sec)

300

600

F
il
e
s
iz

e
(
k
b
)

Incompression

5 10 15

Inference Time (sec)

4

5

6

7

8

L
A

IO
N

A
e
s
t
h
e
t
ic

Aesthetic

5 10 15

Inference Time (sec)

0.7

0.8

L
L

a
V

A
1
3
B

Prompt-Image Alignment

RLCM DDPO

Figure 7.6: Inference Time: Plots showing the inference performance as a function of
time taken to generate. For each task, we evaluated the final checkpoint obtained after
training and measured the average score across 100 trajectories at a given time budget on
1 NVIDIA RTX A6000 GPU. We report the mean and std across three seeds for every
run. Note that for RLCM, we are able to achieve high scoring trajectories with a smaller
inference time budget than DDPO.

7.5.1 RLCM vs. DDPO Performance Comparisons

Following the sample complexity evaluation proposed in Black et al. (2024), we first

compare DDPO and RLCM by measuring how fast they can learn based on the number

of reward model queries. As shown in Figure 7.4, RLCM has better performance on

three out of four of our tested tasks. Note that for the prompt-to-image alignment task,

the initial consistency model finetuned by RLCM has lower performance than the initial

diffusion model trained by DDPO. RLCM is able to close the performance gap between

138


the consistency and diffusion model through RL finetuning4. Figure 7.3 demonstrates

that similar to DDPO, RLCM is able to train its respective generative model to adapt to

various styles just with a reward signal without any additional data curation or supervised

finetuning.

7.5.2 Train and Test Time Analysis

To show faster training advantage of the proposed RLCM, we compare to DDPO in terms

of training time in Figure 7.5. Here we experimentally find that RLCM has a significant

advantage to DDPO in terms of the number of GPU hours required in order to achieve

similar performance. On all tested tasks RLCM reaches the same or greater performance

than DDPO, notably achieving a x17 speedup in training time on the Aesthetic task.

This is most likely due to a combination of factors – the shorter horizon in RLCM

leads to faster online data generation (rollouts in the RL training procedure) and policy

optimization (e.g., less number of backpropagations for training the networks).

Figure 7.6 compares the inference time between RLCM and DDPO. For this ex-

periment, we measured the average reward score obtained by a trajectory given a fixed

time budget for inference. Similar to training, RLCM is able to achieve a higher reward

score with less time, demonstrating that RLCM retains the computational benefits of

consistency models compared to diffusion models. Note that a full rollout with RLCM

takes roughly a quarter of the time for a full rollout with DDPO.

4It is possible that this performance difference on the compression and incompression tasks are due to
the consistency models default image being larger. However, in the prompt image alignment and aesthetic
tasks, we resized the images before reward calculation.

139


7.5.3 Ablation of Inference Horizon for RLCM

0K 20K

Reward Queries

6

7

8

L
A

IO
N

A
e
s
t
h
e
t
ic

Aesthetic Performance

H=8 H=4 H=2 DDPO

0

1

2

3

In
fe

r
e
n
c
e

T
im

e
(
←

)

Aesthetic Inference Speed

#
o
f

In
fe

re
n
c
e

S
te

p
s

2

8

Figure 7.7: Inference time vs Generation Quality:

We measure the performance of the policy gradient

instantiation of RLCM on the aesthetic task at 3 differ-

ent values for the number of inference steps (left) in

addition to measuring the inference speed in seconds

with varied horizons (right). We report the mean and

std across three seeds.

We further explore the effect of

finetuning a consistency model

with different inference horizons.

That is we aimed to test RLCM’s

sensitivity to H. As shown in

Figure 7.7 (left), increasing the

number of inference steps leads

to a greater possible gain in the re-

ward. However, Figure 7.7 (right)

shows that this reward gain comes

at the cost of slower inference

time. This highlights the infer-

ence time vs generation quality

tradeoff that becomes available

by using RLCM. Nevertheless, re-

gardless of the number of infer-

ence steps chosen, RLCM enjoys

faster inference time than diffusion model based baselines.

7.5.4 Qualitative Effects on Generalization

We now test our trained models on new text prompts that do not appear in the training set.

Specifically, we evaluated our trained models on the aesthetic task. As seen in Figure 7.8

which consists of images of prompts that are not in the training dataset, the RL finetuning

140


does not influence the ability of the model to generalize. We see this through testing a

series of prompts (“bike”, “fridge”, “waterfall”, and “tractor”) unseen during training.

7.6 Conclusion and Future Directions

We present RLCM, a fast and efficient RL framework to directly optimize a variety of

rewards to train consistency models. We empirically show that RLCM achieves better

performance than a diffusion model RL baseline, DDPO, on most tasks while enjoying

the fast train and inference time benefits of consistency models. Finally, we provide

qualitative results of the finetuned models and test their downstream generalization

capabilities.

There remain a few directions unexplored which we leave to future work. In particular,

the specific policy gradient method presented uses a sparse reward. It may be possible

Figure 7.8: Prompt Generalization: We observe that RLCM is able to generalize to
other prompts without substantial decrease in aesthetic quality. The prompts used to test
generalization are “bike”, “fridge”, “waterfall”, and “tractor”.

141


to use a dense reward using the property that a consistency model always predicts to

x0. Another future direction is the possibility of creating a loss that further reinforces

the consistency property, further improving the inference time capabilities of RLCM

policies.

142


CHAPTER 8

CONCLUSION

The thesis of this dissertation is that specific data sources demand the design of

specialized algorithms that empower IL and RL agents to become effective decision

making agents. I defended this thesis on two fronts: 1) observations-only, offline, and

off-policy data for IL (Part 1), and 2) RL algorithms for generative models (Part 2).

Following the structure of this thesis, in this chapter I will offer potential future directions

first for Imitation Learning and then for RL for generative models.

8.1 Imitation Learning

In this dissertation, I focused on designing effective imitation learning algorithms from

three different data types: 1) imitation learning from observations alone, 2) imitation

learning from offline data, and 3) imitation learning from off-policy data. In each setting

there is still exciting work to be done.

For imitation learning from observations, MobILE presented a model-based approach

that is different than many of the model-free approaches pursued in recent literature.

Given the advent of incredibly powerful world models (i.e. some of the generative models

from Part 2), it would be interesting to investigate what is strategic exploration with more

modern dynamics models such as diffusion models or vision transformers? Given the

immense scale of modern dynamics models, it seems prohibitive and impractical to learn

one of these models online let alone an ensemble to capture notions of uncertainty and

calibration. In the world of large pretrained models, what would be the bonus design to

capture exploration of the agent on our specific task?

Similar to the imitation learning from observations setting, our model-based, offline

143


IL presents the same interesting questions. Perhaps a more speculative question would

be, given a sufficiently good world model, could we do most tasks offline? Beyond

interesting question with respect to the dynamics models, an exciting future direction

would like in the hybrid setting where a hypothetical algorithm could use both offline

and online data to learn even more effective imitation.

Another exciting future direction for all three settings presented in this dissertation is

in the multi-task setting. Similar to how we now have large pretrained dynamics models,

we also have an incredibly diverse and large dataset of expert data across many tasks.

Would it be possible to go beyond designing IL agents for specific tasks to more general

imitators that can learn a family of tasks under some notions of task abstraction? Some

potential task abstractions could be families of tasks that share the same task dynamics

(i.e. dynamics of a 7-dof robot arm) or have the same state abstractions (i.e. 1080p video

streams).

8.2 Reinforcement Learning for Generative Models

In this dissertation, I mostly focused on investigating RL and IL algorithms across

different types of generative models, from Large Language Models to Consistency

Models. In this section, I would like to shift our focus to the reward modeling of human

intentions in the RLHF pipeline and challenge our existing modeling assumptions for

human alignment.

The first assumption is that preferences are aligned with RL’s notion of reward. In the

context of RL, rewards were designed for systems with clear definitions of optimality or

numerical representations of the behavioral goals of an agent. However, it is still an open

question whether we can learn from preferences a reward model that captures this optimal

144


solution. Currently, researchers have found cases of over optimization of the reward

model leads to worse alignment to human intentions, suggesting that there still exists a

metric mismatch between preferences and LLM alignment. Is IL a viable methodology

here to learn a better reward? Is the solution a combination of IL and preference data?

The second assumption is that a single, learned preference reward captures the human

intentions for a given task. That is, we can distill to a single numeric value what we

may prefer for a task like summarization. Even for general purpose instruction following

language models or general purpose diffusion models, we do RLHF finetuning with

a single preference reward model. This relies on a strong assumption that a single

model can capture the complicated, potentially contradictory preferences of the user base.

Instead can we efficiently learn multiple reward models and how would we optimize

them downstream with RL/IL?

8.3 Concluding Remarks

In conclusion, I take understanding how to do efficient IL and RL from diverse sources

of data to be of fundamental importance for learning decision making agents in the real

world. The analysis and investigations presented in this dissertation build on a long line

of research developing general purpose IL and RL algorithms. There is much work to be

done, but I am excited for the road ahead.

145


Part III

Appendix

146


APPENDIX A

MISSING PROOFS AND DETAILS IN CHAPTER 2

A.1 Analysis of Algorithm 2

We start by presenting the proof for the unified main result in Theorem 3. We then discuss

the bounds for special instances individually.

The following lemma shows that under Assumption 2, with bt(s, a) =

H min{σt(s, a), 2}, we achieve optimism at all iterations.

Lemma 30 (Optimism). Assume Assumption 2 holds, and set bt(s, a) =

H min {σt(s, a), 2}. For all state-wise cost function f : S 7→ [0, 1], denote the bonus

enhance cost as f̃t(s, a) := f (s) − bt(s, a). For all policy π, we have the following

optimism:

Vπ

P̂t , f̃t
≤ Vπ

P, f ,∀t.

Proof. In the proof, we drop subscript t for notation simplicity. We consider a fixed

function f and policy π. Also let us denote V̂π as the value function of π under (P̂, f̃ ),

and Vπ as the value function under (P, f ).

Let us start from h = H, where we have V̂π
H(s) = Vπ

H(s) = 0. Assume inductive

hypothesis holds at h + 1, i.e., for any s, a, we have Q̂π
h+1(s, a) ≤ Qπ

h+1(s, a). Now let us

147


move to h. We have:

Q̂π
h(s, a) − Qπ

h(s, a) = f̃ (s, a) + Es′∼P̂(·|s,a)V̂
π
h+1(s′) − f (s) − Es′∼P(·|s,a)Vπ

h+1(s′)

≤ −H min{σ(s, a), 2} + Es′∼P̂(·|s,a)V
π
h+1(s′) − Es′∼P(·|s,a)Vπ

h+1(s′)

≤ −H min{σ(s, a), 2} + H
∥∥∥∥P̂(·|s, a) − P(·|s, a)

∥∥∥∥
1

≤ −H min{σ(s, a), 2} + H min{σ(s, a), 2} = 0,

where the first inequality uses the inductive hypothesis at time step h + 1. Finally, note

that Vπ
h (s) = Ea∼π(s)Qπ

h(s, a), which leads to V̂π
h (s) ≤ Vπ

h (s). This concludes the induction

step. □

The next lemma concerns the statistical error from finite sample estimation of

Es∼dπe f (s).

Lemma 31. Fix δ ∈ (0, 1). For all t, we have that with probability at least 1 − δ,∣∣∣∣∣∣∣Es∼dπe f (s) −
N∑

i=1

f (se
i )/N

∣∣∣∣∣∣∣ ≤ 2

√
ln

(
2t2|F |/δ

)
N

,∀ f ∈ F .

Proof. For any t, we set the failure probability to be 6δ/(t2π2) at iteration t where we abuse

notation and point out that π = 3.14159.... Thus the total failure probability for all t ∈ N is

at most δ. We then apply classic Hoeffding inequality to bound Es∼dπe f (s)−
∑N

i=1 f (se
i )/N

with the fact that f (s) ∈ [0, 1] for all s. We conclude the proof by taking a union bound

over all f ∈ F . □

Note that here we have assumed se
i ∼ dπ

e
is i.i.d sampled from dπ

e
. This can easily

be achieved by randomly sampling a state from each expert trajectory. Note that we can

easily deal with i.i.d trajectories, i.e., if our expert data contains N many i.i.d trajectories

148


{τ1, . . . , τN}, we can apply concentration on the trajectory level, and get:∣∣∣∣∣∣∣Eτ∼πe

H−1∑
h=0

f (sh)

 − 1
N

N∑
i=1

H−1∑
h=0

f (si
h)

∣∣∣∣∣∣∣ ≤ O

H

√
ln(t2|F |/δ)

N

 ,
where τ ∼ π denotes that a trajectory τ being sampled based on π, si

h denotes the

state at time step h on the i-th expert trajectory. Also note that we have Es∼dπ f (s) =

1
HEτ∼π

[∑H−1
h=0 f (sh)

]
for any π, f . Together this immediately implies that:∣∣∣∣∣∣∣Es∼dπe f (s) −

1
NH

N∑
i=1

H−1∑
h=0

f (si
h)

∣∣∣∣∣∣∣ ≤ O


√

ln(t2|F |/δ)
N

 ,
which matches to the bound in Lemma 31.

Now we conclude the proof for Theorem 3.

Proof of Theorem 3. Assume that Assumption 2 and the event in Lemma 31 hold. Denote

the joint of these two events as E. Note that the probability of E is at most 2δ. For notation

simplicity, denote ϵstats = 2
√

ln(2T 2 |F |/δ)
N .

In each model-based planning phase, recall that we perform model-based optimization

on the following objective:

πt = argmin
π∈Π

max
f∈F

Es,a∼dπ
P̂t

[
f (s) − bt(s, a)

]
−

N∑
i=1

f (se
i )/N

 .
Note that for any π, using the inequality in Lemma 31, we have:

max
f∈Ft

Es,a∼dπ
P̂t

( f (s) − bt(s, a)) −
N∑

i=1

f (se
i )/N


= max

f∈F

Es,a∼dπ
P̂t

( f (s) − bt(s, a)) − Es∼dπe f (s) + Es∼dπe f (s) −
N∑

i=1

f (se
i )/N


≤ max

f∈F

[
Es,a∼dπ

P̂t
( f (s) − bt(s, a)) − Es∼dπe f (s)

]
+max

f∈F

Es∼dπe f (s) −
N∑

i=1

f (se
i )/N


≤ max

f∈F

[
Es,a∼dπ

P̂t
( f (s) − bt(s, a)) − Es,a∼dπe

P̂t

( f (s) − bt(s, a))
]
+ ϵstats

149


where in the last inequality we use optimism from Lemma 30, i.e., Es,a∼dπe

P̂t
( f (s) −

bt(s, a)) ≤ Es∼dπe f (s).

Hence, for πt, since it is the minimizer and πe ∈ Π, we must have:

max
f∈F

Es,a∼dπt
P̂t

( f (s) − bt(s, a)) −
N∑

i=1

f (se
i )/N


≤ max

f∈F

Es,a∼dπe

P̂t
( f (s) − bt(s, a)) −

N∑
i=1

f (se
i )/N


≤ max

f∈F

[
Es,a∼dπe

P̂t
( f (s) − bt(s, a)) − Es,a∼dπe

P̂t
( f (s) − bt(s, a))

]
+ ϵstats = ϵstats.

Note that F contains c, we must have:

Es,a∼dπt
P̂t

[c(s) − bt(s, a)] ≤
N∑

i=1

c(se
i )/N + ϵstats ≤ Es∼dπe c(s) + 2ϵstats,

which means that Vπt

P̂t ;̃ct
≤ Vπe

+ 2Hϵstats.

Now we compute the regret in episode t. First recall that bt(s, a) = H min{σt(s, a), 2},

which means that ∥bt∥∞ ≤ 2H as ∥c∥∞ ≤ 1, which means that ∥c − bt∥∞ ≤ 2H. Thus,∥∥∥∥Vπ

P̂;c−bt

∥∥∥∥
∞
≤ 2H2. Recall simulation lemma (Lemma 40), we have:

Vπt − Vπe
≤ Vπt − Vπt

P̂t ;̃ct
+ 2Hϵstats

= HEs,a∼dπt

[
|̃ct(s, a) − c(s)| + 2H2

∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1

]
+ 2Hϵstat

= HEs,a∼dπt

[
H min{σt(s, a), 2} + 2H2

∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1

]
+ 2Hϵstat

≤ HEs,a∼dπt

[
H min{σt(s, a), 2} + 2H2 min{σt(s, a), 2}

]
+ 2Hϵstat

≤ 3H3Es,a∼dπt min{σt(s, a), 2} + 2Hϵstat

≤ 6H3Es,a∼dπt min{σt(s, a), 1} + 2Hϵstat

Now sum over t, and denote Eπt as the conditional expectation conditioned on the

150


history from iteration 0 to t − 1, we get:

T−1∑
t=0

[
Vπt − Vπe]

≤ 6H2
T−1∑
t=0

Eπt

H−1∑
h=0

min{σt(st
h, a

t
h), 1}

 + 2HT ϵstat

≤ 6H2
T−1∑
t=0

√H

√√
Eπt

H−1∑
h=0

min{σ2
t (st

h, a
t
h), 1}

 + 2HT ϵstat,

where in the last inequality we use E[a⊤b] ≤
√
E[∥a∥22]E[∥b∥22].

Recall that πt are random quantities, add expectation on both sides of the above

inequality, and consider the case where E holds and E holds, we have:

E

T−1∑
t=0

(
Vπt − Vπe) ≤ 6H2.5E

T−1∑
t=0

√√
Eπt

H−1∑
h=0

min
{
σ2

t (st
h, a

t
h), 1

} + 2HT ϵstat + P(E)T H

≤ 6H2.5

√T

√√
E

T−1∑
t=0

H−1∑
h=0

min
{
σ2

t (st
h, a

t
h), 1

}
 + 2HT ϵstat + 2δT H,

where in the last inequality, we use E[a⊤b] ≤
√
E[∥a∥22]E[∥b∥22]. This implies that that:

E
[
min

t
Vπt − Vπe

]
≤

6H2.5

√
T

√√
max

Alg
EAlg

T−1∑
t=0

H−1∑
h=0

min
{
σ2

t (st
h, a

t
h), 1

} + 2Hϵstats + 2Hδ.

Set δ = 1/(HT ), we get:

E
[
Vπ − Vπe]

≤
6H2.5

√
T

√√
max

Alg
EAlg

T−1∑
t=0

H−1∑
h=0

min
{
σ2

t (st
h, a

t
h), 1

} + 2H

√
ln(T 3H|F |)

N
+

2
T

where Alg is any adaptive mapping that maps from history from t = 0 to the end of the

t − 1 iteration to to some policy πt. This concludes the proof. □

Below we discuss special cases.

151


A.1.1 Discrete MDPs

Proposition 32 (Discrete MDP Bonus). With δ ∈ (0, 1). With probability at least 1 − δ,

for all t ∈ N, we have:

∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1
≤ min


√

S ln(t2S A/δ)
Nt(s, a)

, 2

 .
Proof. The proof simply uses the concentration result for P̂t under the ℓ1 norm. For

a fixed t and s, a pair, using Lemma 6.2 in Agarwal et al. (2019), we have that with

probability at least 1 − δ,

∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1
≤

√
S ln(1/δ)
Nt(s, a)

.

Applying union bound over all iterations and all s, a pairs, we conclude the proof. □

What left is to bound the information gain I for the tabular case. For this, we can

simply use the Proposition 35 that we develop in the next section for KNR. This is

because in KNR, when we set the feature mapping ϕ(s, a) ∈ R|S||A| to be a one-hot vector

with zero everywhere except one in the entry corresponding to (s, a) pair, the information

gain in KNR is reduced to the information gain in the tabular model.

Proposition 33 (Information Gain in discrete MDPs). We have:

IT = O
(
HS 2A · ln(TS A/δ) ln(1 + T H)

)
.

Proof. Using Lemma B.6 in Kakade et al. (2020a), we have:

T−1∑
t=0

min

H−1∑
h=0

1
Nt(st

h, a
t
h)
, 1

 ≤ 2S A ln (1 + T H) .

152


Now using the definition of information gain, we have:

IT =

T−1∑
t=0

H−1∑
h=0

min
{
σ2

t (st
h, a

t
h), 1

}
≤ S ln(T 2S A/δ)H

T−1∑
t=0

min

H−1∑
h=0

1
Nt(st

h, a
t
h)
, 1


≤ 2HS 2A ln(T 2S A/δ) ln(1 + T H)

This concludes the proof. □

A.1.2 KNRs

Recall the KNR setting from Example 2. The following proposition shows that the bonus

designed in Example 2 is valid.

Proposition 34 (KNR Bonus). Fix δ ∈ (0, 1). With probability at least 1− δ, for all t ∈ N,

we have: ∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1
≤ min

{
βt

σ
∥ϕ(s, a)∥Σ−1

t
, 2

}
,∀s, a,

where βt =

√
2λ∥W⋆∥22 + 8σ2 (

ds ln(5) + 2 ln(t2/δ) + ln(4) + ln (det(Σt)/ det(λI))
)
.

Proof. The proof directly follows the confidence ball construction and proof from Kakade

et al. (2020a). Specifically, from Lemma B.5 in Kakade et al. (2020a), we have that with

probability at least 1 − δ, for all t:∥∥∥∥(Ŵt −W⋆
)

(Σt)1/2
∥∥∥∥2

2
≤ β2

t .

Thus, with Lemma 41, we have:∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1
≤

1
σ

∥∥∥∥(Ŵt −W⋆)ϕ(s, a)
∥∥∥∥

2

≤

∥∥∥∥(Ŵt −W⋆)(Σt)1/2
∥∥∥∥ ∥ϕ(s, a)∥Σ−1

t
/σ

≤
βt

σ
∥ϕ(s, a)∥Σ−1

t
.

This concludes the proof. □

153


The following proposition bounds the information gain quantity.

Proposition 35 (Information Gain on KNRs). For simplicity, let us assume ϕ : S ×A 7→

Rd, i.e., ϕ(s, a) is a d-dim feature vector. In this case, we will have:

IT = O
(
H

(
d ln(T 2/δ) + dds + d2 ln

(
1 + ∥W⋆∥22T H/σ2

))
ln

(
1 + ∥W⋆∥22T H/σ2

))
.

Proof. From the previous proposition, we know that σ2
t (s, a) = (β2

t /σ
2)∥ϕ(s, a)∥2

Σ−1
t

.

Setting λ = σ2/∥W⋆∥22, we will have β2
t /σ

2 ≥ 1, which means that min{σ2
t (s, a), 1} ≤

(β2
t /σ

2) min
{
∥ϕ(s, a)∥2

Σ−1
t
, 1

}
.

Note that βt is non-decreasing with respect to t, so βt ≤ βT for T ≥ t, where

βT =

√
2σ2 + 8σ2(ds ln(5) + 2 ln(T 2/δ) + ln(4) + d ln(1 + T H∥W⋆∥22/σ

2))

Also we have
∑T−1

t=0
∑H−1

h=0 min
{
∥ϕ(st

h, a
t
h)∥2
Σ−1

t
, 1

}
≤ H

∑T−1
t=0 min

{∑H−1
h=0 ∥ϕ(st

h, a
t
h)∥2
Σ−1

t
, 1

}
,

since min{a1, b1} +min{a2, b2} ≤ min{a1 + a2, b1 + b2}. Now call Lemma B.6 in Kakade

et al. (2020a), we have:

T−1∑
t=0

min

H−1∑
h=0

∥ϕ(st
h, a

t
h)∥2
Σ−1

t
, 1

 ≤ 2 ln (det(ΣT )/ det(λI)) = 2d ln
(
1 + T H∥W⋆∥22/σ

2
)
.

(A.1)

Finally recall the definition of IT , we have:

IT =

T−1∑
t=0

H−1∑
h=0

min
{
σ2

t (st
h, a

t
h), 1

}
≤
β2

T

σ2

T−1∑
t=0

H−1∑
h=0

min
{
∥ϕ(st

h, a
t
h)∥2
Σ−1

t
, 1

}
≤
β2

T

σ2 2Hd ln(1 + ∥W⋆∥22T H/σ2)

≤ 2Hd
(
2 + 8

(
ds ln(5) + 2 ln(T 2/δ) + ln(4) + d ln

(
1 + ∥W⋆∥22T H/σ2

)))
ln

(
1 + ∥W⋆∥22T H/σ2

)
= H

(
4d + 32dds + 32d ln(T 2/δ) + 32d + 2d2 ln

(
1 + ∥W⋆∥22T H/σ2

))
ln

(
1 + ∥W⋆∥22T H/σ2

)
,

which concludes the proof. □

154


Extension to Infinite Dimensional RKHS When ϕ : S × A 7→ H where H is

some infinite dimensional RKHS, we can bound our regret using the following intrinsic

dimension:

d̃ = max
{{st

h,a
t
h}

H−1
h=0 }

T−1
t=0

ln

I +
1
λ

T−1∑
t=0

H−1∑
h=0

ϕ(st
h, a

t
h)ϕ(st

h, a
t
h)⊤

 .
In this case, recall Proposition 34, we have:

βt ≤ βT ≤

√
2λ∥W⋆∥22 + 8σ2 (

ds ln(5) + 2 ln(t2/δ) + ln(4) + ln (det(ΣT )/ det(λI))
)

≤

√
2λ∥W⋆∥22 + 8σ2

(
ds ln(5) + 2 ln(t2/δ) + ln(4) + d̃

)
.

Also recall Eq. (A.1), we have:

T−1∑
t=0

min

H−1∑
h=0

∥ϕ(st
h, a

t
h)∥2
Σ−1

t
, 1

 ≤ 2 ln (det(ΣT )/ det(λI)) ≤ 2d̃.

Combine the above two, following similar derivation we had for finite dimensional

setting, we have:

IT = Õ
(
Hd̃2 + Hd̃ds

)
.

A.1.3 General Function Class G with Bounded Eluder dimension

Proposition 36. Fix δ ∈ (0, 1). Consider a general function class G where G is discrete,

and supg∈G,s,a ∥g(s, a)∥2 ≤ G. At iteration t, denote ĝt ∈ argming∈G
∑t−1

i=0
∑H−1

h=0 ∥g(si
h, a

i
h) −

si
h+1∥

2
2, and denote a version space Gt as:

Gt =

g ∈ G :
t−1∑
i=0

H−1∑
h=0

∥∥∥g(si
h, a

i
h) − ĝt(si

h, a
i
h)
∥∥∥2

2
≤ ct

 , with ct = 2σ2G2ln(2t2|G|/δ).

The with probability at least 1 − δ, we have that for all t, and all s, a:∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1
≤ min

{
1
σ

max
g1∈Gt ,g2∈Gt

∥g1(s, a) − g2(s, a)∥2 , 2
}
.

155


Proof. Consider a fixed function g ∈ G. Let us denote zt
h =

∥∥∥g(st
h, a

t
h) − st

h+1

∥∥∥2

2
−∥∥∥g⋆(st

h, a
t
h) − st

h+1

∥∥∥2

2
. We have:

zt
h =

(
g(st

h, a
t
h) − g⋆(st

h, a
t
h)
)⊤ (

g(st
h, a

t
h) + g⋆(st

h, a
t
h) − 2g⋆(st

h, a
t
h) − 2ϵ t

h
)

=
∥∥∥g(st

h, a
t
h) − g⋆(st

h, a
t
h)
∥∥∥2

2
− 2(g(st

h, a
t
h) − g⋆(st

h, a
t
h))⊤ϵ t

h.

Since ϵ t
h ∼ N(0, σ2I), we must have:

2(g(st
h, a

t
h) − g⋆(st

h, a
t
h))⊤ϵ t

h ∼ N(0, 4σ2
∥∥∥g(st

h, a
t
h) − g⋆(st

h, a
t
h)
∥∥∥2

2
)

Since supg,s,a ∥g(s, a)∥2 ≤ G, then 2(g(st
h, a

t
h) − g⋆(st

h, a
t
h))⊤ϵ t

h is a 2σG sub-Gaussian

random variable.

Call Lemma 3 in (Russo and Van Roy, 2014), we have that with probability at least

1 − δ:

∑
t

∑
h

∥∥∥g(st
h, a

t
h) − st

h+1

∥∥∥2

2
≥

∑
t

∑
h

∥∥∥g⋆(st
h, a

t
h) − st

h+1

∥∥∥2

2

+ 2
∑

t

∑
h

∥∥∥g(st
h, a

t
h) − g⋆(st

h, a
t
h)
∥∥∥2

2
− 4σ2G2 ln(1/δ).

Note that the above can also be derived directly using Azuma-Bernstein’s inequality

and the property of square loss. With a union bound over all g ∈ G, we have that with

probability at least 1 − δ, for all g ∈ G.

∑
t

∑
h

∥∥∥g(st
h, a

t
h) − st

h+1

∥∥∥2

2
≥

∑
t

∑
h

∥∥∥g⋆(st
h, a

t
h) − st

h+1

∥∥∥2

2

+ 2
∑

t

∑
h

∥∥∥g(st
h, a

t
h) − g⋆(st

h, a
t
h)
∥∥∥2

2
− 4σ2G2 ln(|G|/δ).

Set g = ĝt, and use the fact that gt is the minimizer of
∑

t
∑

h ∥g(st
h, a

t
h) − st

h+1∥
2
2, we must

have:

∑
t

∑
h

∥∥∥̂gt(st
h, a

t
h) − g⋆(st

h, a
t
h)
∥∥∥2

2
≤ 2σ2G2ln(2t2|G|/δ).

156


Namely we prove that our version space Gt contains g⋆ for all t. Thus, we have:∥∥∥∥P̂t(·|s, a) − P⋆(·|s, a)
∥∥∥∥

1
≤

1
σ
∥̂gt(s, a) − g⋆(s, a)∥2 ≤

1
σ

sup
g1∈Gt ,g2∈Gt

∥g1(s, a) − g2(s, a)∥2,

where the last inequality holds since both g⋆ and ĝt belong to the version Gt.

□

Now we bound the information gain IT below. The proof mainly follows from the

proof in (Osband and Van Roy, 2014).

Lemma 37 (Lemma 1 in Osband and Van Roy (2014)). Denote βt = 2σ2G2 ln(t2|G|/δ).

Let us denote the uncertainty measure wt;h = sup f1, f2∈Gt
∥ f1(st

h, a
t
h)− f2(st

h, a
t
h)∥2 (note that

wt;h is non-negative). We have:

t−1∑
i=0

H−1∑
h=0

1{w2
t;h > ϵ} ≤

(
4βt

ϵ
+ H

)
dE(
√
ϵ).

Proposition 38 (Bounding IT ). Denote d = dE(1/T H). We have

IT =
(
1/σ2 + HdG2/σ2 + 8G2 ln(T 2|G|/δ)d ln(T H)

)
.

Proof. Note that the uncertainty measures wt;h are non-negative. Let us reorder the

sequence and denote the ordered one as w1 ≥ w2 ≥ w3 · · · ≥ wT H−H. For notational

simplicity, denote M = T H − H We have:
T−1∑
i=0

H−1∑
h=0

w2
t;h =

M−1∑
i=0

w2
i ≤ 1 +

∑
i

w2
i 1{w2

i ≥
1
M
},

where the last inequality comes from the fact that
∑

i w2
i 1{w2

i < 1/M} ≤ M 1
M = 1.

Consider any wt where w2
t ≥ 1/M. In this case, we know that w2

1 ≥ w2
2 ≥ · · · ≥ w2

t ≥ 1/M.

This means that:

t ≤
∑

i

∑
h

1{w2
t;h > w2

t } ≤

(
4βT

w2
t
+ H

)
dE(
√

wt) ≤
(
4βT

w2
t
+ H

)
dE(1/M),

157


where the second inequality uses the lemma above, and the last inequality uses the fact

that dE(ϵ) is non-decreasing when ϵ gets smaller. Denote d = dE(1/M). The above

inequality indicates that w2
t ≤

4βT d
t−Hd . This means that for any w2

t ≥ 1/M, we must have

w2
t ≤ 4βT d/(t − Hd). Thus, we have:

T−1∑
i=0

H−1∑
h=0

w2
t;h ≤ 1 + HdG2 +

M∑
τ=Hd+1

w2
τ1{w

2
τ ≥ 1/M} ≤ 1 + HdG2 + 4βT d ln(M)

= 1 + HdG2 + 4βT d ln(T H).

Finally, recall the definition of IT , we have:

T−1∑
t=0

H−1∑
h=0

min{σ2
t (st

h, a
t
h), 1} ≤

T−1∑
t=0

H−1∑
h=0

σ2
t (st

h, a
t
h)

≤
1
σ2

T−1∑
t=0

H−1∑
h=0

w2
t;h

≤
1
σ2

(
1 + HdG2 + 4βT d ln(T H)

)
.

This concludes the proof. □

A.1.4 Proof of Theorem 7

This section provides the proof of Theorem 7. First we present the reduction from a

bandit optimization problem to ILFO.

Consider a Multi-armed bandit (MAB) problem with A many actions {ai}
A
i=1. Each

action’s ground truth reward ri is sampled from a Gaussian with mean µi and variance

1. Without loss of generality, assume a1 is the optimal arm, i.e., µ1 ≥ µi ∀ i , 1. We

convert this MAB instance into an MDP. Specifically, set H = 2. Suppose we have a

fixed initial state s0 which has A many actions. For the one step transition, we have

P(·|s0, ai) = N(µi, 1), i.e., g∗(s0, ai) = µi. Here we denote the optimal expert policy πe as

158


πe(s0) = a1, i.e., expert policy picks the optimal arm in the MAB instance. Hence, when

executing πe, we note that the state s1 generated from πe is simply the stochastic reward

of a1 in the original MAB instance. Assume that we have observed infinitely many such

s1 from the expert policy πe, i.e., we have infinitely many samples of expert state data,

i.e., N → ∞. Note, however, we do not have the actions taken by the expert (since this

is the ILFO setting). This expert data is equivalent to revealing the optimal arm’s mean

reward µ1 to the MAB learner a priori. Hence solving the ILFO problem on this MDP is

no easier than solving the original MAB instance with additional information which is

that optimal arm’s mean reward is µ1 (but the best arm’s identity is unknown).

Below we show the lower bound for solving the MAB problem where the optimal

arm’s mean is known.

Theorem 39. Consider best arm identification of Gaussian MAB with the additional

information that the optimal arm’s mean reward is µ. For any algorithm, there exists a

MAB instance with number of arms A ≥ 2, such that the expected cumulative regret is

still Ω(
√

AT ), i.e., the additional information does not help improving the worst-case

regret bound to solve the MAB instance.

Proof of Theorem 39. Below, we will construct A many MAB instances where each

instance has A many arms and each arm has a Gaussian reward distribution with the fixed

variance σ2. Each of the A instances has the maximum mean reward equal to ∆, i.e., all

these A instances have the same maximum arm mean reward. Consider any algorithm Alg

that maps ∆ together with the history of the interactionsHt = {a0, r0, a1, r1, . . . , at−1, rt−1}

to a distribution over A actions. We will show for any such algorithm alg that knows ∆,

with constant probability, there must exist a MAB instance from the A many MAB

instances, such that Alg suffers at least Ω(
√

AT ) regret where T is the number of

iterations.

159


Now we construct the A instances as follows. Consider the i-th instance (i = 1, . . . , A).

For arm j in the i-th instance, we define its mean as µi
j = 1{i = j}∆. Namely, for MAB

instance i, its arms have mean reward zero everywhere except that the i-th arm has reward

mean ∆. Note that all these MAB instances have the same maximum mean reward, i.e.,

∆. Hence, we cannot distinguish them by just revealing ∆ to the learner.

We will construct an additional MAB instance (we name it as 0-th MAB instance)

whose arms have reward mean zero. Note that this MAB instance has maximum mean

reward 0 which is different from the previous A MAB instances that we constructed.

However, we will only look at the regret of Alg on the previously constructed A MAB

instances. I.e., we do not care about the regret of Alg(∆,Ht) on the 0-th MAB instance.

Let us denote Pi (for i = 0, . . . , A) as the distribution of the outcomes of algorithm

Alg(∆,Ht) interacting with MAB instance i for n iterations, and E j[Ni(T )] as the expected

number of times arm i is pulled by Alg(∆,Ht) in MAB instance j. Consider MAB

instance i with i ≥ 1:

Ei[Ni(T )] − E0[Ni(T )] ≤ T ∥Pi − P0∥1 ≤ T
√

KL(P0,Pi) ≤ T
√
∆2E0[Ni(T )],

where the last step uses the fact that we are running the same algorithm Alg(∆,Ht)

on both instance 0 and instance i (i.e., same policy for generating actions), and thus,

KL(P0,Pi) =
∑A

j=1 E0[N j(T )]KL (q0( j), qi( j)) (Lemma 15.1 in Lattimore and Szepesvári

(2020)), where qi( j) is the reward distribution of arm j at instance i. Also recall that for

instance 0 and instance i, their rewards only differ at arm i.

This implies that:

Ei[Ni(T )] ≤ E0[Ni(T )] + T
√
∆2E0[Ni(T )].

160


Sum over i = 1, . . . , A on both sides, we have:

A∑
i=1

Ei[Ni(T )] ≤ T + T
A∑

i=1

√
∆2E0[Ni(T )] ≤ T + T

√
A

√√
A∑

i=1

∆2E0[Ni(T )]

≤ T + T
√

A
√
∆2T

Now let us calculate the regret of Alg(∆,Ht) on i-th instance, we have:

Ri = T∆ − Ei[Ni(T )]∆.

Sum over i = 1, . . . , A, we have:

A∑
i=1

Ri = ∆

AT −
A∑

i=1

Ei[Ni(T )]

 ≥ ∆ (
AT − T − T

√
A∆2T

)
Set ∆ = c

√
A/T for some c that we will specify later, we get:

A∑
i=1

Ri ≥ c

√
A
T

(AT − T − cAT ) .

Set c = 1/4, we get:

A∑
i=1

Ri ≥ c

√
A
T

(AT − T − cAT )

≥
1
4

√
AT (A − 1 − A/4)

=
1
4

√
AT (3A/4 − 1)

≥
1
4

√
AT (A/4) ,

assuming A ≥ 2.

Thus there must exist i ∈ {1, . . . , A}, such that:

Ri ≥
1

16

√
AT .

Note that the above construction considered any algorithm Alg(∆,Ht) that maps ∆ and

history to action distributions. Thus it concludes the proof. □

161


The hardness result in Theorem 39 and the reduction from MAB to ILFO together

implies the lower bound for ILFO in Theorem 7, namely solving ILFO with cumulative

regret smaller then O(
√

AT ) will contradict the MAB lower bound in Theorem 39.

A.2 Auxiliary Lemmas

Lemma 40 (Simulation Lemma). Consider any two functions f : S ×A 7→ [0, 1] and

f̂ : S × A 7→ [0, 1], any two transitions P and P̂, and any policy π : S 7→ ∆(A). We

have:

Vπ
P; f − Vπ

P̂, f̂
=

H−1∑
h=0

Es,a∼dπP

[
f (s, a) − f̂ (s, a) + Es′∼P(·|s,a)Vπ

P̂, f̂ ;h
(s′) − Es′∼P̂(·|s,a)V

π

P̂, f̂ ;h
(s′)

]
≤

H−1∑
h=0

Es,a∼dπP

[
f (s, a) − f̂ (s, a) + ∥Vπ

P̂, f̂ ;h
∥∞∥P(·|s, a) − P̂(·|s, a)∥1

]
.

where Vπ
P, f ;h denotes the value function at time step h, under π, P, f .

Such simulation lemma is standard in model-based RL literature and can be found,

for instance, in the proof of Lemma 10 from Sun et al. (2019a).

Lemma 41. Consider two Gaussian distribution P1 := N(µ1, σ
2I) and P2 := N(µ2, σ

2I).

We have:

∥P1 − P2∥1 ≤
1
σ
∥µ1 − µ2∥2 .

The above lemma can be proved by Pinsker’s inequality and the closed-form of the

KL divergence between P1 and P2.

162


A.3 Implementation Details

A.3.1 Environment Setup and Benchmarks

This section sketches the details of how we setup the environments. We utilize the stan-

dard environment horizon of 500, 50, 200 for Cartpole-v1, Reacher-v2, Cartpole-v0.

For Swimmer-v2, Hopper-v2 and Walker2d-v2, we work with the environment horizon

set to 400 (Kurutach et al., 2018; Nagabandi et al., 2018; Luo et al., 2018; Rajeswaran

et al., 2020; Kidambi et al., 2020a). Furthermore, for Hopper-v2, Walker2d-v2, we

add the velocity of the center of mass to the state parameterization (Rajeswaran et al.,

2020; Luo et al., 2018; Kidambi et al., 2020a). As noted in the main text, the expert

policy is trained using NPG/TRPO (Kakade, 2001b; Schulman et al., 2015b) until it hits

a value of (approximately) 460,−10, 38, 3000, 2000, 170 for Cartpole-v1, Reacher-v2,

Swimmer-v2, Hopper-v2, Walker2d-v2, Cartpole-v0 respectively. Furthermore, for

Walker2d-v2 we utilized pairs of states (s, s′) for defining the feature representation

used for parameterizing the discriminator. All the results presented in the experiments

section are averaged over five seeds. Furthermore, in terms of baselines, we compare

MobILE to BC, BC-O, ILPO, GAIL and GAIFO. Note that BC/GAIL has access to expert

actions whereas our algorithm does not have access to the expert actions. We report the

average of the best performance offered by BC/BC-O when run with five seeds, even if

this occurs at different epochs for each of the runs - this gives an upper hand to BC/BC-O.

Moreover, note that for BC, we run the supervised learning algorithm for 500 passes.

Furthermore, we run BC-O/GAIL with same number of online samples as MobILE in

order to present our results. Furthermore, we used 2 CPUs with 16-32 GB of RAM usage

to perform all our benchmarking runs implemented in Pytorch. Finally, our codebase

utilizes Open-AI’s implementation of TRPO (Dhariwal et al., 2017) for environments

163


Algorithm 12 MobILE: Model-based Imitation Learning and Exploring for ILFO (used
in practical implementation)

1: Require: Expert DatasetDe, Access to dynamics of the true environment i.e. P⋆.
2: Initialize Policy π0, Discriminator w0, Replay Buffer of pre-determined size D,

Dynamics Model P̂−1, Bonus b−1.
3: for t = 0, · · · ,T − 1 do
4: Online Interaction: Execute πt in true environment P⋆ to get samples St.
5: Update replay buffer: D = Replay-Buffer-Update(D,St) (refer to section Sec-

tion A.3.2).
6: Update dynamics model: Obtain P̂t by starting at P̂t−1 and update using replay

bufferD (refer to section Section A.3.2).
7: Bonus Update: Update bonus bt : S ×A → R+ using replay bufferD (refer to

section Section A.3.2).
8: Discriminator Update: Update discriminator as wt ←

arg maxw L(w; πt, P̂t, bt,De) (refer to section Section A.3.2).
9: Policy Update: Perform incremental policy update through approximate mini-

mization of L(·),
i.e.: πt ← arg minπ L(π; wt, P̂t, bt,De) by running KPG steps of

TRPO (refer to section Section A.3.2).
10: end for
11: Return πT .

with discrete actions, and the MJRL repository (Rajeswaran et al., 2017b) for working

with continuous action environments. With regards to results in the main paper, our

bar graph presenting normalized results was obtained by dividing every algorithm’s

performance (mean/standard deviation) by the expert mean; for Reacher-v2 because

the rewards themselves are negative, we first added a constant offset to make all the

algorithm’s performance to become positive, then, divided by the mean of expert policy.

A.3.2 Practical Implementation of MobILE

We will begin with presenting the implementation details of MobILE (refer to Algo-

rithm 12):

164


Dynamics Model Training

As detailed in the main paper, we utilize a class of Gaussian Dynamics Models param-

eterized by an MLP (Rajeswaran et al., 2020), i.e. P̂(s, a) := N(hθ(s, a), σ2I), where,

hθ(s, a) = s+σ∆s ·MLPθ(sc, ac), where, θ are MLP’s trainable parameters, sc = (s−µs)/σs,

ac = (a − µa)/σa with µs, µa (and σs, σa) being the mean of states, actions (and standard

deviation of states and actions) in the replay bufferD. Note that we predict normalized

state differences instead of the next state directly.

In practice, we fine tune our estimate of dynamics models based on the new contents

of the replay buffer as opposed to re-training the models from scratch, which is com-

putationally more expensive. In particular, we start from the estimate P̂t−1 in the t − 1

epoch and perform multiple updates gradient updates using the contents of the replay

bufferD. We utilize constant stepsize SGD with momentum (Sutskever et al., 2013) for

updating our dynamics models. Since the distribution of (s, a, s′) pairs continually drift

as the algorithm progresses (for instance, because we observe a new state), we utilize

gradient clipping to ensure our model does not diverge due to the aggressive nature of

our updates.

Replay Buffer

Since we perform incremental training of our dynamics model, we utilize a replay buffer

of a fixed size rather than training our dynamics model on all previously collected online

(s, a, s′) samples. Note that the replay buffer could contain data from all prior online

interactions should we re-train our dynamics model from scratch at every epoch.

165


Design of Bonus Function

We utilize an ensemble of two transition dynamics models incrementally learned using

the contents of the replay buffer. Specifically, given the models hθ1(·) and hθ2(·), we

compute the discrepancy as: δ(s, a) = ||hθ1(s, a) − hθ2(s, a)||2. Moreover, given a replay

bufferD, we compute the maximum discrepancy as δD = max(s,a,s′)∼D δ(s, a). We then

set the bonus as b(s, a) = min (1, δ(s, a)/δD) ·λ, thus ensuring the magnitude of our bonus

remains bounded between [0, λ] roughly.

Discriminator Update

Recall that fw(s) = w⊤ψ(s), where w are the parameters of the discriminator. Given a

policy π, the update for the parameters w take the following form:

max
w:||w||22≤ζ

L(w; π, P̂, b,De) := E(s,a)∼dπ
P̂

[
fw(s) − b(s, a)

]
− Es∼De

[
fw(s)

]
≡ max

w
Lζ(w; π, P̂, b,De) = E(s,a)∼dπ

P̂

[
fw(s) − b(s, a)

]
− Es∼De

[
fw(s)

]
−

1
2
·
(
||w||22 − ζ

)
,

=⇒ ∂wLζ(w; π, P̂, b,De) = Es∼dπ
P̂

[
ψ(s)

]
− Es∼De

[
ψ(s)

]
− w ∈ 0,

where, ∂wLζ(w; π, P̂, b,De) denotes the sub-differential of Lζ(·) wrt w. This in particular

implies the following:

1. Exact Update: w∗ = PB(ζ)

(
Es∼dπ

P̂

[
ψ(s)

]
− Es∼De

[
ψ(s)

])
, P· is the projection opera-

tor, and B(ζ) is the ζ−norm ball.

2. Gradient Ascent Update: wt+1 = PB(ζ)

(
(1 − ηw)wt + ηw ·

(
Es∼dπ

P̂

[
ψ(s)

]
− Es∼De

[
ψ(s)

]))
,

ηw > 0 is the step-size.

We found empirically either of the updates to work reasonably well. In the

Swimmer-v2 task, we use the gradient ascent update with ηw = 0.67, and, in the other

166


tasks, we utilize the exact update. Furthermore, we empirically observe the gradient

ascent update to yield more stability compared to the exact updates. In the case of

Walker2d-v2, we found it useful to parameterize the discriminator based on pairs of

states (s, s′).

Model-Based Policy Update

Once the maximization of the discriminator parameters w is performed, consider the

policy optimization problem, i.e.,

min
π

L(π; w, P̂, b,De) := E(s,a)∼dπ
P̂

[
fw(s) − b(s, a)

]
− Es∼De

[
fw(s)

]
≡ min

π
L(π; w, P̂, b,De) = E(s,a)∼dπ

P̂

[
fw(s) − b(s, a)

]
Hence we perform model-based policy optimization under P̂ and cost function fw(s) −

b(s, a). In practice, we perform approximate minimization of L(·) by incrementally

updating the policy using KPG-steps of policy gradient, where, KPG is a tunable hyper-

parameter. In our experiments, we find that setting KPG to be around 10 to generally

be a reasonable choice (for precise values, refer to Table A.1). This paper utilizes

TRPO (Schulman et al., 2015b) as our choice of policy gradient method; note that this can

be replaced by other alternatives including PPO (Schulman et al., 2017c), SAC (Haarnoja

et al., 2018b) etc. Similar to practical implementations of existing policy gradient

methods, we implement a reward filter by clipping the IPM reward f (s) by truncating

it between cmin and cmax as this leads to stability of the policy gradient updates. Note

that the minimization is done with access to P̂, which implies we perform model-based

planning. Empirically, for purposes of tuning the exploration-imitation parameter λ, we

minimize a surrogate namely: E(s,a)∼dπ
P̂

[
(1 − λ) · fw(s) − b(s, a)

]
(recall that b(s, a) has a

factor of λ associated with it). This ensures that we can precisely control the magnitude

of the bonuses against the IPM costs, which, in our experience is empirically easier to

167


work with.

168


Pa
ra

m
et

er
Ca

rt
po

le
-v

1
Re

ac
he

r-
v2

Sw
im

me
r-

v2
Ca

rt
po

le
-v

0
Ho

pp
er

-v
2

Wa
lk

er
2d

-v
2

E
nv

ir
on

m
en

tS
pe

ci
fic

at
io

ns

H
or

iz
on

H
50

0
50

40
0

20
0

40
0

40
0

E
xp

er
tP

er
fo

rm
an

ce
(≈

)
46

0
−

10
38

18
1

30
00

20
00

#
on

lin
e

sa
m

pl
es

pe
ro

ut
er

lo
op

2
·
H

2
·
H

2
·
H

2
·
H

8
·
H

3
·
H

D
yn

am
ic

sM
od

el

A
rc

hi
te

ct
ur

e/
N

on
-l

in
ea

ri
ty

M
L

P(
64
,6

4)
/R

eL
U

M
L

P(
64
,6

4)
/R

eL
U

M
L

P(
51

2,
51

2)
/R

eL
U

M
L

P(
64
,6

4)
/R

eL
U

M
L

P(
51

2,
51

2)
/R

eL
U

M
L

P(
51

2,
51

2)
/R

eL
U

O
pt

im
iz

er
(L

R
,M

om
en

tu
m

,B
at

ch
Si

ze
)

SG
D

(0
.0

05
,0
.9

9,
25

6)
SG

D
(0
.0

05
,0
.9

9,
25

6)
SG

D
(0
.0

05
,0
.9

9,
25

6)
SG

D
(0
.0

05
,0
.9

9,
25

6)
SG

D
(0
.0

05
,0
.9

9,
25

6)
SG

D
(0
.0

05
,0
.9

9,
25

6)
#

tr
ai

n
pa

ss
es

pe
ro

ut
er

lo
op

20
10

0
10

0
20

50
20

0
G

ra
d

C
lip

pi
ng

2.
0

2.
0

1.
0

2.
0

4.
0

1.
0

R
ep

la
y

B
uf

fe
rS

iz
e

10
·
H

10
·
H

10
·
H

10
·
H

16
·
H

15
·
H

E
ns

em
bl

e
ba

se
d

bo
nu

s

#
m

od
el

s/
bo

nu
s

ra
ng

e
2/

[0
,1

]
2/

[0
,1

]
2/

[0
,1

]
2/

[0
,1

]
2/

[0
,1

]
2/

[0
,1

]

IP
M

pa
ra

m
et

er
s

St
ep

si
ze

fo
rw

up
da

te
(η

w
)

E
xa

ct
E

xa
ct

0.
33

E
xa

ct
E

xa
ct

E
xa

ct
#

R
FF

s/
B

W
H

eu
ri

st
ic

12
8/

0.
1

qu
an

til
e

12
8

/0
.1

qu
an

til
e

12
8

/0
.1

qu
an

til
e

12
8

/0
.1

qu
an

til
e

12
8

/0
.1

qu
an

til
e

12
8

/0
.1

qu
an

til
e

Po
lic

y
pa

ra
m

et
er

iz
at

io
n

A
rc

hi
te

ct
ur

e/
N

on
-l

in
ea

ri
ty

M
L

P(
64
,6

4)
/T

an
H

M
L

P(
64
,6

4)
/T

an
H

M
L

P(
64
,6

4)
/T

an
H

M
L

P(
32
,3

2)
/T

an
H

M
L

P(
32
,3

2)
/T

an
H

M
L

P(
32
,3

2)
/T

an
H

Po
lic

y
C

on
st

ra
in

ts
N

on
e

N
on

e
N

on
e

N
on

e
lo

g
σ

m
in
=
−

1.
0

lo
g
σ

m
in
=
−

2.
0

Pl
an

ni
ng

A
lg

or
ith

m

#
m

od
el

sa
m

pl
es

pe
rT

R
PO

st
ep

2
·
H

10
·
H

4
·
H

4
·
H

8
·
H

20
·
H

#
T

R
PO

st
ep

s
pe

ro
ut

er
lo

op
(K

PG
)

3
10

20
5

10
15

T
R

PO
Pa

ra
m

et
er

s
(C

G
ite

rs
,d

am
pe

ni
ng

,k
l,

ga
e λ

,γ
)

(5
0,

0.
00

1,
0.

01
,

0.
97
,0
.9

95
)

(1
00
,0
.0

01
,0
.0

1,
0.

97
,0
.9

95
)

(1
00
,0
.0

01
,0
.0

1,
0.

97
,0
.9

95
)

(1
00
,0
.0

01
,0
.0

1,
0.

97
,0
.9

95
)

(1
0,

0.
00

01
,0
.0

25
,

0.
97
,0
.9

95
)

(1
0,

0.
00

01
,0
.0

25
,

0.
97
,0
.9

95
)

C
ri

tic
pa

ra
m

et
er

iz
at

io
n

A
rc

hi
te

ct
ur

e/
N

on
-l

in
ea

ri
ty

M
L

P(
12

8,
12

8)
/R

eL
U

M
L

P(
12

8,
12

8)
/R

eL
U

M
L

P(
12

8,
12

8)
/R

eL
U

M
L

P(
32
,3

2)
/R

eL
U

M
L

P(
12

8,
12

8)
/R

eL
U

M
L

P(
12

8,
12

8)
/R

eL
U

O
pt

im
iz

er
(L

R
,B

at
ch

Si
ze

,ϵ
,R

eg
ul

ar
iz

at
io

n)
A

da
m

(0
.0

01
,6

4,
1e
−

5,
0)

A
da

m
(0
.0

01
,6

4,
1e
−

5,
0)

A
da

m
(0
.0

01
,6

4,
1e
−

5,
0)

A
da

m
(0
.0

01
,6

4,
1e
−

5,
0)

A
da

m
(0
.0

01
,6

4,
1e
−

8,
1e
−

3)
A

da
m

(0
.0

01
,6

4,
1e
−

8,
1e
−

3)

#
tr

ai
n

pa
ss

es
pe

rT
R

PO
up

da
te

1
1

1
1

2
2

Ta
bl

e
A

.1
:L

is
to

fv
ar

io
us

H
yp

er
-p

ar
am

et
er

s
em

pl
oy

ed
in

Mo
bI

LE
’s

im
pl

em
en

ta
tio

n.

169


A.3.3 Hyper-parameter Details

This section presents an overview of the list of hyper-parameters necessary to implement

Algorithm 2 in practice, as described in Algorithm 12. The list of hyper-parameters is

precisely listed out in Table A.1. The hyper-parameters are broadly categorized into ones

corresponding to various components of MobILE, namely, (a) environment specifications,

(b) dynamics model, (c) ensemble based bonus, (d) IPM parameterization, (e) Policy

parameterization, (f) Planning algorithm parameters, (g) Critic parameterization. Note

that if there a hyper-parameter that has not been listed, for instance, say, the value of

momentum for the ADAM optimizer in the critic, this has been left as is the default value

defined in Pytorch.

A.4 Additional Experimental Results

A.4.1 Modified Cartpole-v0 environment with noise added to tran-

sition dynamics

1 2 3 4
Online Samples 1e4

0

50

100

150

200

R
et

ur
n 

(V
al

ue
)

CartPole-v0 (stochastic)

BC
Expert
GAIL
BC-O

MobILE (Ours)
GAIFO
ILPO

Figure A.1: Learning curves for Cartpole-v0 with stochastic dynamics with 20 expert
trajectories comparing MobILE with BC, BC-O, GAIL, GAIFO and ILPO.

170


We consider a stochastic variant of Cartpole-v0, wherein, we add additive Gaussian

noise of variance unknown to the learner in order to make the transition dynamics

of the environment to be stochastic. Specifically, we train an expert of value ≈ 170

in Cartpole-v0 with stochastic dynamics using TRPO. Now, using 20 trajectories

drawn from this expert, we wish to consider solving the ILFO problem using MobILE as

well as other baselines including BC, BC-O, ILPO, GAIL and GAIFO. Figure A.1

presents the result of this comparison. Note that MobILE compares favorably against other

baseline methods - in particular, BC tends suffer in environments like Cartpole-v0 with

stochastic dynamics because of increased generalization error of the supervised learning

algorithm used for learning a policy. Our algorithm is competitive with both BC-O, GAIL,

GAIFO and ILPO. Note that BC-O tends to outperform BC both in Cartpole-v1 and in

Cartpole-v0 (with stochastic dynamics).

A.4.2 Swimmer Learning Curves

We supplement the learning curves for Swimmer-v2 (with 40 expert trajectories) with the

learning curves for Swimmer-v2 with 10 expert trajectories in figure A.2. As can be seen,

MobILE outperforms baseline algorithms such as BC, BC-O, ILPO, GAIL and GAIFO

in Swimmer-v2 with both 40 and 10 expert trajectories. The caveat is that for 10 expert

trajectories, all algorithms tend to show a lot more variance in their behavior and this

reduces as we move to the 40 expert trajectory case.

171


0.5 1.0
Online Samples 1e5

0

20

40

R
et

ur
n 

(V
al

ue
) 40 trajectories

BC
Expert
GAIL
BC-O

MobILE (Ours)
GAIFO
ILPO

0.5 1.0
Online Samples 1e5

0

20

40
10 trajectories

Figure A.2: Learning curves for Swimmer-v2 with 40 (left) and 10 (right) expert trajec-
tories comparing MobILE with BC, BC-O, ILPO, GAIL and GAIFO. MobILE continues
to perform well relative to all other benchmarks with both 10 and 40 expert trajectories.
The variance of the algorithm as well as the benchmarks is notably higher with lesser
number of expert trajectories.

1 2 3 4 5
Online Samples 1e4

0

100

200

300

400

500

600

R
et

ur
n 

(V
al

ue
)

CartPole-v1 (10 traj.)

BC
Expert
GAIL
GAIFO

ILPO
BC-O
MobILE (Ours)

0.5 1.0 1.5
Online Samples 1e4

40

30

20

10

R
et

ur
n 

(V
al

ue
)

Reacher-v2 (10 traj.)

0.2 0.4 0.6 0.8 1.0
Online Samples 1e5

0

10

20

30

40

50

R
et

ur
n 

(V
al

ue
)

Swimmer-v2 (40 traj.)

0.5 1.0 1.5
Online Samples 1e6

0

1000

2000

3000

R
et

ur
n 

(V
al

ue
)

Hopper-v2 (10 traj.)

0.25 0.50 0.75 1.00 1.25
Online Samples 1e6

0

500

1000

1500

2000

2500

R
et

ur
n 

(V
al

ue
)

Walker2d-v2 (10 traj.)

Figure A.3: Learning curves tracking the running maximum averaged across seeds
comparing MobILE against BC, BC-O, ILPO, GAIL and GAIFO. MobILE tends to reach
expert performance consistently and in a more sample efficient manner.

A.4.3 Additional Results

In this section, we give another view of our results for MobILE compared against the

baselines (BC/BC-O/ILPO/GAIL/GAIFO) by tracking the running maximum of each

policy’s value averaged across seeds. Specifically, for every iteration t, we plot the best

policy performance obtained by the algorithm so far averaged across seeds (note that

172


this quantity is monotonic, since the best policy obtained so far can never be worse at

a later point of time when running the algorithm). For BC/BC-O/ILPO, we present a

simplified view by picking the best policy obtained through the course of running the

algorithm and averaging it across seeds (so the curves are flat lines). As figure A.3

shows, MobILE reliably hits expert performance faster than GAIL and GAIFO while often

matching/outperforming ILPO/BC/BC-O.

A.4.4 Ablation Study on Number of Models used for Strategic Ex-

ploration Bonus

In this experiment, we present an ablation study on using more number of models in

the ensemble for setting the strategic exploration bonus. Figure A.4 suggests that even

utilizing two models for purposes of setting the bonus is effective from a practical

perspective.

1 2 3 4 5
# Online Samples 1e4

0

100

200

300

400

500

R
et

ur
n 

(V
al

ue
)

CartPole-v1 (5 traj.)

Expert
2 Models
4 models
8 models

Figure A.4: Learning curves for Cartpole-v1 with varying number of dynamics models
for assigning bonuses for strategic exploration.

173


APPENDIX B

MISSING PROOFS AND DETAILS IN CHAPTER 3

B.1 Bonus Designs

We show the bonus design in Section 3.5 is valid, i.e, model is well-calibrated for tabular

MDPs, KNRs, and GPs.

B.1.1 Tabular models

Lemma 42. With probability 1 − δ,

∥P̂(·|s, a) − P(·|s, a)∥1 ≤

√
|S| log 2 + log(2|S||A|/δ)

2{N(s, a) + λ}
+

λ

N(s, a) + λ
∀(s, a) ∈ S ×A.

Proof. When N(s, a) > 0, we use the concentration inequality of discrete distributions

(Jiang, 2020). Then, with probability 1 − δ,

∥∥∥∥∥N(·|s, a)
N(s, a)

− P(·|s, a)
∥∥∥∥∥

1
≤

√
|S| log 2 + log(2|S||A|/δ)

2N(s, a)
∀(s, a) ∈ {(s, a) : N(s, a) > 0}.

Thus, noting 0 < N(s, a)/(N(s, a) + λ) < 1, with probability 1 − δ, we have ∀(s, a) ∈

{(s, a) : N(s, a) > 0},

∥∥∥∥∥ N(·|s, a)
(N(s, a) + λ)

− P(·|s, a) ×
N(s, a)

N(s, a) + λ

∥∥∥∥∥
1
≤

√
|S| log 2 + log(2|S||A|/δ)

2{N(s, a) + λ}
. (B.1)

Besides, the above inequality is still well-defined and holds including the case N(s, a) = 0.

Thus, with probability 1 − δ, we have ∀(s, a) ∈ S ×A, we have Equation (B.1).

Recall the estimator P̂ is N(s′|s, a)/(N(s, a) + λ). Therefore,

174


∥P̂(·|s, a) − P(·|s, a)∥1 ≤
∥∥∥∥∥P̂(·|s, a) − P(·|s, a) ×

N(s, a)
N(s, a) + λ

∥∥∥∥∥
1
+

∥∥∥∥∥P(·|s, a) − P(·|s, a) ×
N(s, a)

N(s, a) + λ

∥∥∥∥∥
1

≤

√
|S| log 2 + log(2|S||A|/δ)

2{N(s, a) + λ}
+

λ

N(s, a) + λ
.

This concludes the proof. □

B.1.2 KNRs

In KNRs, the ground truth model is s′ = W∗ϕ(s, a) + ϵ, ϵ ∼ N(0, ζ2I), where s ∈ RdS , a ∈

RdA , ϕ : S ×A → Rd. We define

∥ϕ(s, a)∥
Σ−1

no
B ϕ⊤(s, a)Σ−1

no
ϕ(s, a).

Lemma 43. With probability at least 1 − δ, we have:∥∥∥∥P̂(·|s, a) − P(·|s, a)
∥∥∥∥

1
≤ min

{
βno

ζ
∥ϕ(s, a)∥Σ−1

no
, 2

}
∀(s, a) ∈ S ×A,

where

βno =

√
2λ∥W⋆∥22 + 8ζ2

(
dS ln(5) + ln(1/δ) + Īno

)
, Īno = ln

(
det(Σno)/ det(λI)

)
.

Proof. The proof directly follows the confidence ball construction and proof from Kakade

et al. (2020b). Specifically, from Lemma B.5 in Kakade et al. (2020b), we have that with

probability at least 1 − δ, ∥∥∥∥(Ŵ −W⋆
) (
Σno

)1/2
∥∥∥∥2

2
≤ β2

no
.

Thus, with Lemma 64, we have:∥∥∥∥P̂(·|s, a) − P(·|s, a)
∥∥∥∥

1
≤

1
ζ

∥∥∥∥(Ŵ −W⋆)ϕ(s, a)
∥∥∥∥

2

≤

∥∥∥∥(Ŵ −W⋆)(Σno)
1/2

∥∥∥∥
2
∥ϕ(s, a)∥Σ−1

no
/ζ ≤

βno

ζ
∥ϕ(s, a)∥Σ−1

no
.

This concludes the proof. □

175


B.1.3 Gaussian processes

Let Hk be the RKHS with the kernel k(·, ·). We denote the associated norm and inner

product by ∥·∥k and ⟨·, ·⟩k. In GPs, the ground truth model is defined as s′ = g∗(s, a)+ϵ, ϵ ∼

N(0, ζ2I), where g∗ belongs to an RKHSHk.

Lemma 44. With probability 1 − δ,

∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min
(
βno

ζ

√
kno((s, a), (s, a)), 2

)
∀(s, a) ∈ S ×A,

and

βno =

√
dS{2 + 150 log3(dSno/δ)Ino}, Ino = log(det(I + ζ−2Kno)).

Proof. Let ĝi and g∗ be i-th component of ĝ and g∗. We have

∥P̂(·|s, a) − P(·|s, a)∥1 ≤
1
ζ
∥ĝ(s, a) − g∗(s, a)∥2 ( Lemma 64)

=
1
ζ

 dS∑
i=1

{ĝi(s, a) − g∗i (s, a)}2


1/2

≤
1
ζ

 dS∑
i=1

kno((s, a), (s, a))∥ĝi − g∗i ∥
2
kno


1/2

.

(CS inequality and g = ⟨g(·), k((s, a), ·)⟩kno
)

By (Srinivas et al., 2010, Theorem 6), with probability 1 − δ, we have

∥ĝi(s, a) − g∗i ∥kno
≤ βno ∀i ∈ [1, · · · , dS].

This concludes the statement.

□

176


B.2 Proof of Theorem 10

In this section, we prove Theorem 10. We also prove the RL version of Theorem 10 when

the cost c is given and the goal is policy optimization. Before that, we prepare several

lemmas.

Lemma 45. With probability 1 − δ, we have ∀ f ∈ F ,

|E(s,a)∼dπe [ f (s, a)] − EDe[ f (s, a)]| ≤ ϵstat, ϵstat =
√

log(2|F |/δ)/2ne.

Proof. From Hoeffding’s inequality and a union bound over F . □

Lemma 46 (Pessimistic Policy Evaluation 1 ). Suppose Assumption 9 holds and

max f∈F ∥ f ∥∞ ≤ 1. With probability at least 1 − δ, ∀π ∈ Π,∀ f ∈ F ,

0 ≤ Vπ

P̂, f+b
− Vπ

P, f .

Proof of Lemma 46. We denote the expected total cost of π under P̂ and cost function f

by Vπ

P̂, f :h
(s, a). In this proof, we condition on the event

∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2) ∀(s, a) ∈ S ×A.

We use the inductive hypothesis argument. We start from h = H + 1, where

Vπ

P̂, f+b:H+1
= Vπ

P, f :H+1 = 0. Assume the inductive hypothesis holds at h + 1, i.e,

0 ≤ Vπ

P̂, f+b:h+1(s) − Vπ
P, f :h+1(s), ∀s ∈ S,∀π ∈ Π, ∀ f ∈ F .

177


Then, ∀π ∈ Π, ∀ f ∈ F ,

Qπ
P, f :h(s, a) − Qπ

P̂, f+b:h(s, a)

= −b(s, a) + Es′∼P̂(·|s,a)[V
π
P, f :h+1(s′)] − Es′∼P(·|s,a)[Vπ

P̂, f+b:h+1(s′)]

≤ −b(s, a) + Es′∼P̂(·|s,a)[V
π
P, f :h+1(s′)] − Es′∼P(·|s,a)[Vπ

P, f :h+1(s′)]

(Inductive hypothesis assumption)

≤ −b(s, a) + H∥P̂(·|s, a) − P(·|s, a)∥1 (∥F ∥∞ ≤ 1)

≤ −H min(σ(s, a), 2) + H min(σ(s, a), 2) = 0. (Bonus construction)

Then, noting Qπ
P, f :h(s, π(s)) − Qπ

P̂, f+b:h
(s, π(s)) = Vπ

P, f :h(s) − Vπ

P̂, f+b:h
(s), we have

Vπ
P, f :h(s) − Vπ

P̂, f+b:h(s) ≤ 0 ∀π ∈ Π, ∀ f ∈ F .

This concludes the induction step.

Then, we have

Vπ
P, f − Vπ

P̂, f+b
= Vπ

P, f :1 − Vπ

P̂, f+b:1 ≤ 0 ∀π ∈ Π, ∀ f ∈ F .

□

Lemma 47 (Pessimistic Policy Evaluation 2 ). Suppose Assumption 9 holds and ∥F ∥∞ ≤

1. With probability at least 1 − δ, ∀π ∈ Π, ∀ f ∈ F ,

Vπ

P̂, f+b
− Vπ

P, f ≤ Error, Error := (3H2 + H)E(s,a)∼dπP[min(σ(s, a), 2)].

Proof of Lemma 47. In this proof, we condition on the event

∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2) ∀(s, a) ∈ S ×A.

178


We invoke simulation Lemma 63. Then, we have ∀π ∈ Π,∀ f ∈ F

Vπ

P̂, f+b
− Vπ

P, f =

H∑
h=1

E(s,a)∼dπP[b(s, a) + Es′∼P̂(·|s,a)[V
π

P̂, f+b;h(s′)] − Es′∼P(·|s,a)[Vπ

P̂, f+b;h(s′)]]

≤

H∑
h=1

E(s,a)∼dπP[b(s, a) + ∥Vπ

P̂, f+b;h∥∞∥P̂(·|s, a) − P(·|s, a)∥1]

≤ HE(s,a)∼dπP[H min(σ(s, a), 2) + H(2H + 1) min(σ(s, a), 2)]

( ∥Vπ

P̂, f+b;h
∥∞ ≤ H(2H + 1))

= (3H2 + H)E(s,a)∼dπP[min(σ(s, a), 2)].

Here, we use ∥Vπ

P̂, f+b;h
∥∞ ≤ H(2H + 1) which is derived by 0 ≤ f + b ≤ 2H + 1. □

By using the above lemmas, we prove our main result.

Proof of Theorem 10. In this proof, we condition on the event

∥P̂(·|s, a) − P(·|s, a)∥1 ≤ min(σ(s, a), 2),

which holds with probability 1 − δ, and the event in Lemma 45, which holds with

probability 1 − δ.

179


Then, with probability 1 − 2δ, we have

V π̂IL
P,c − Vπe

P,c ≤ V π̂IL

P̂,c+b
− Vπe

P,c (Lemma 46)

≤ H max
f∈F
{E(s,a)∼dπ̂IL

P̂
[ f (s, a) + b(s, a)] − E(s,a)∼dπe

P
[ f (s, a)]} (c ∈ F )

≤ H max
f∈F
{E(s,a)∼dπ̂IL

P̂
[ f (s, a) + b(s, a)] − EDe[ f (s, a)]} + Hϵstats ( Lemma 45)

≤ H max
f∈F
{E(s,a)∼dπe

P̂
[ f (s, a) + b(s, a)] − EDe[ f (s, a)]} + Hϵstats

(πe ∈ Π and the definition of π̂IL)

≤ H max
f∈F
{E(s,a)∼dπe

P̂
[ f (s, a) + b(s, a)] − E(s,a)∼dπe

P
[ f (s, a)]} + 2Hϵstats

( Lemma 45)

≤ max
f∈F
{Vπe

P̂, f+b
− Vπe

P, f } + 2Hϵstats

≤ (3H2 + H)E(s,a)∼dπe
P

[min(σ(s, a), 2)] + 2Hϵstats (Lemma 47)

≤ (6H2 + 2H)E(s,a)∼dπe
P

[min(σ(s, a), 1)] + 2Hϵstats.

This concludes the proof. □

Finally, we prove the finite-sample error bounds for the RL case. Similar results are

obtained in (Kidambi et al., 2020b; Yu et al., 2020). We use this theorem in the next

section.

Theorem 48 (Bounds for RL). Consider any comparator policy π̃ ∈ Π. Assume P ∈ P

and Assumption 9 holds, with probability 1 − 2δ, we have

V π̂RL
P,c − V π̃

P,c ≤ (6H2 + 2H)E(s,a)∼dπ̃P
[min(σ(s, a), 1)]. (B.2)

180


Proof of Theorem 48 .

V π̂RL
P,c − V π̃

P,c ≤ V π̂RL

P̂,c+b
− V π̃

P,c (Lemma 46)

= V π̃

P̂,c+b
− V π̃

P,c (π∗ ∈ Π and the definition of π̂RL)

= (3H2 + H)E(s,a)∼dπ̃P
[min(σ(s, a), 2)] (Lemma 47)

≤ (6H2 + 2H)E(s,a)∼dπ̃P
[min(σ(s, a), 1)].

This concludes the proof. □

B.3 Finite sample error bound for each model

In this section, we analyze the bound for the following models: (1) discrete MDPs, (2)

KNRs, (3) GPs. All of the proofs are deferred to Section B.3.4. We will also discuss the

implication to the RL case using Theorem 48.

B.3.1 Discrete MDPs

Recall πe-concentratabiliy coefficient is defined by

Cπe = max
(s,a)

dπe
P (s, a)
ρ(s, a)

.

Then, the error is calculated as follows.

Theorem 49 (Error of MILO for discrete MDPs).

• With probability 1 − δ, when λ = Ω(1),

181


V π̂IL
P,c − Vπe

P,c ≤ Erro + Erre,

Erro = c1H2 log(|S||A|c2/δ)


√

Cπe |S|2|A|

no
+

Cπe |S||A|

no

 , Erre = 2H

√
log(2|F |/δ)

2ne
,
where

c1 and c2 are some universal constants.

• With probability 1 − δ, when λ = Ω(1),

V π̂RL
P,c − Vπe

P,c ≤ c1H2 log(|S||A|c2/δ)


√

Cπ∗ |S|2|A|

no
+

Cπ∗ |S||A|

no

 , Cπ∗ = max
(s,a)

dπ
∗

P (s, a)
ρ(s, a)

.

(B.3)

where c1 and c2 are some universal constants.

The quantity Cπe measures the difference of distributions between the expert and the

batch data. This is much smaller than the common concentratabiliy coefficients in offline

RL:

max
π∈Π

max
(s,a)∈S×A

dπP(s, a)
ρ(s, a)

,
1

min(s,a) ρ(s, a)
,

which measure the worst discrepancy between all policies in Π and the batch data (Yin

and Wang, 2020). These assumptions imply ρ has global coverage. We achieve this

better bound via pessimism. In the RL case, the similar bound as (B.3) has been obtained

in offline policy optimization based on FQI (Rashidinejad et al., 2021). However, their

work is limited to a tabular case. Hereafter, we will show our result is extended to more

general continuous MDPs.

B.3.2 KNRs

As in Proposition 43, σ(s, a) is given by βno/ζ∥ϕ(s, a)∥Σ−1
no

. Thus, from Theorem 10, the

final error bound of V̂ π̂IL
P,c − Vπe

P,c is

(6H2 + 2H) min(E(s,a)∼dπe
P

[βno/ζ∥ϕ(s, a)∥Σ−1
no

], 1) + 2H
√

log(2|F |/δ)/(2ne).

182


Hereafter, we analyze βno and E(s,a)∼dπe
P

[∥ϕ(s, a)∥Σ−1
no

].

Analysis of information gain First, we analyze βno . We need to upper-bound the

information gain Īno in βno . Recall Σρ = E(s,a)∼ρ[ϕ(s, a)ϕ⊤(s, a)] and ϕ(s, a) ∈ Rd.

Theorem 50 (Finite sample analysis of information gain in finite-dimensional linear

models).

Assume ∥ϕ(s, a)∥2 ≤ 1∀(s, a) ∈ S ×A. Let c1, c2 be universal constants.

1. When λ = Ω(1), with probability 1 − δ, we have

Īno = log(det(Σno/λI)) ≤ c1rank(Σρ){rank(Σρ) + log(c2/δ)} log(1 + no).

2. When λ = Ω(1) and ζ2 = Ω(1), With probability 1 − δ, we have

βno ≤ c1

√
∥W∗∥2 + dSrank(Σρ){rank(Σρ) + log(c2/δ)} log(1 + no).

Theorem 50 states Īno = O(rank[Σρ]2 log(no)). We highlight the novelty of our

analysis comparing to the other literature. Seeger et al. (2008) analyzes the expectation

of the information gain in a fixed or random design setting. Following their discussion,

we can prove

E[Īno] ≤ rank(Σρ) log(1 + no)

as Theorem 72 by Jensen’s inequality. Going beyond the expectation, we derive the

finite-sample result by leveraging the variational representation and the uniform law with

localization in Lemma 69. The finite-sample analysis is much harder than calculating the

bound of the expectation.

The worse case of Īno referred to as the maximum information gain has been often

used in online learning (Srinivas et al., 2010; Abbasi-yadkori et al., 2011; Kakade et al.,

183


2020b). From their discussion, we always have Īno = O(d log(n)). Here, we show that

the information gain can be upper-bounded more tightly when rank[Σρ]2 ≤ d in offline

RL (a random design setting). Comparing to the analysis of maximum information gain,

our analysis takes the low-rankness of the design matrix Σρ into consideration by fully

utilizing the random design setting assumption.

Analysis of E(s,a)∼dπe
P

[∥ϕ(s, a)∥Σ−1
no

] and the final bound

Next, we analyze E(s,a)∼dπe
P

[∥ϕ(s, a)∥Σ−1
no

].

Theorem 51. Suppose λ = Ω(1), ζ2 = Ω(1), ∥W∗∥2 = Ω(1). Let c1, c2 be some universal

constants.

1. With probability 1 − δ,

E(s,a)∼dπe
P

[∥ϕ(s, a)∥Σ−1
no

] ≤ c1

√
Cπerank[Σρ]{rank[Σρ] + log(c2/δ)}

no
, Cπe = sup

x∈Rd

(
x⊤Σπe x
x⊤Σρx

)
,

where Σπe = E(s,a)∼dπe
P

[ϕ(s, a)ϕ(s, a)⊤].

2. With probability 1 − δ,

V π̂IL
P,c − Vπe

P,c ≤ Erro + Erre, R̄ = rank[Σρ]{rank[Σρ] + log(c2/δ)}, (B.4)

Erro = c1H2 min(d1/2, R̄)
√

R̄

√
dSCπe log(1 + no)

no
,

Erre = 2H
√

log(2|F |/δ)/(2ne).

3. With probability 1 − δ, let Cπ∗ = supx∈Rd

(
x⊤Σπ∗ x
x⊤Σρx

)
, Σπ∗ = E(s,a)∼d∗P[ϕ(s, a)ϕ⊤(s, a)].

Then, we have

V π̂RL
P,c − Vπ∗

P,c ≤ c1H2{rank(Σρ) + log(c2/δ)}rank(Σρ)

√
dSCπ∗ log(1 + no)

no
.

The final bound (B.4) suggests Erro is Õ(H2rank[Σρ]2√dsCπe/no). We can also

get Õ(H2rank[Σρ]d1/2√dsCπe/no), which implies Õ(H2d3/2√dsCπe/no). In other words,

184


when Cπe , rank[Σρ] are not so large and the offline sample size no is large enough,

O(H
√

log(|F |)/ne) is a dominating term and the covariate shift problem in BC can be

avoided since the horizon dependence is just H. Recall the known BC error bound is

O(H2
√

log |Π|/ne) (Agarwal et al., 2019, Chapter 14).

We see the implication of Erro in more details, which also corresponds to the error of

RL case. The rate regarding no is n−1/2
o , which is the standard rate in parametric regression.

Besides, we can see the bound depends on rank[Σρ],Cπe . Importantly, since we always

have rank[Σρ] ≤ d, our final bound captures the possible low-rankness of the batch data.

The quantity Cπe corresponds to πe-concentrability coefficient (∗-concentrability in the

RL case). This is much smaller than the worst case concentrability coefficients:

sup
π∈Π

Cπ, C̃ = sup
(s,a)
∥ϕ(s, a)∥22∥Σ

−1
ρ ∥2.

Finally, we note the technical novelty by comparing it to the techniques developed

in the offline RL literature. A quantity that is similar to E(s,a)∼dπe
P

[∥ϕ(s, a)∥Σ−1
no

] has been

analyzed in Jin et al. (2020b)1, which studies the error bound of FQI with pessimism in

linear MDPs. (Jin et al., 2020b, Corollary 4.5) assumes that full coverage, i.e., Σρ is full-

rank and has lower bounded eigenvalues. Also the number of offline samples n0 depends

on the smallest eigenvalue. Our analysis just uses partial coverage with the refined

concept of relative condition number and thus does not require the full rank assumption

on Σρ. Moreover, our bound is distribution dependent, i.e., it depends on rank[Σρ] rather

than the ambient dimension of the feature vector ϕ. Thus the bound is much tighter for

benign cases where the offline data from ρ happens to concentrate on a low-dimensional

subspace. Beyond model-based offline RL literature, one can potentially adapt the model-

free offline policy evaluation results (e.g., Duan et al. (2020); Wang et al. (2020a)) with

linear function approximation to offline policy optimization (without pessimism). Such

1They analyze E(s,a)∼dπ∗P
[∥ϕ(s, a)∥Σ−1

no
], which also appears in our RL result Theorem 51.

185


model-free results will also incur supπ∈ΠCπ, C̃ and the ambient dimension d, instead of

much more refined quantities Cπe and rank[Σρ].

B.3.3 Gaussian processes

In this section, we give details on GPs. Note that prior works on model-free and model-

based offline IL do not have results for infinite-dimensional non-parametric models. Thus

our techniques developed in this section are new and relevant even to the offline RL

literature—a point that we will return to at the end of this section.

From Theorem 10, the final error is

(6H2 + 2H) min(Ex∼dπe
P

[βno/ζ
√

kno(x, x)], 1) + 2H
√

log(2|F |/δ)/(2ne).

where x = (s, a). Hereafter, we analyze βno and Ex∼dπe
P

[
√

kno(x, x)]. Before going into the

details, we repeat several important notations below.

In this section, following Srinivas et al. (2010), for simplicity, we suppose the

following:

Assumption 52. k(x, x) ≤ 1,∀x ∈ S ×A. k(·, ·) is a continuous and positive semidefinite

kernel. S ×A is a compact space.

Recall we denote x := (s, a) and we have orthonormal eigenfunctions and eigen-

values {ψi, µi}
∞
i=1 by Mercer’s theorem. We denote the feature mapping ϕ(x) :=

[
√
µ1ψ1(x), . . . ,

√
µ∞ψ∞(x)]⊤.

Assume eigenvalues {µ1, . . . , µ∞} is in non-increasing order, we recall the effective

dimension:

d∗ = min{ j ∈ N : j ≥ B( j + 1)no/ζ
2}, B( j) =

∞∑
k= j

µk.

186


We also introduce the empirical version of d⋆, where µ̂i are eigenvalues of the gram

matrix Kno .

Definition 53 (Empirical effective dimension). d̂ = min{ j ∈ N : j ≥ B( j + 1)/ζ2, B̂( j) =∑no
k= j µ̂k.

Hereafter, for simplicity, we treat ζ2 = 1, that is, ζ2 = Ω(1). Then, since no ≤

B(no + 1)no/ζ
2, we have d∗ ≤ no.

The effective dimensions d̂ and d∗ are widely used in machine learning literature.

The first quantity d∗ is often referred to as the degree of freedom (Zhang, 2005; Bach,

2017). In finite-dimensional linear kernels {x 7→ a⊤ϕ(x), a ∈ Rd} (k(x, x) = ϕ⊤(x)ϕ(x)),

d∗ is rank[Ex∼ρ[ϕ(x)ϕ⊤(x)]]. Thus, d∗ is considered to be a natural extension of

rank[Ex∼ρ[ϕ(x)ϕ⊤(x)]] to infinite-dimensional models. The worst case of the second

quantity:

max
{x1∈S×A,··· ,xno∈S×A}

d̂

is often used in online learning literature (Valko et al., 2013; Janz et al., 2020). Up to

logarithmic factors, it is equal to the maximum information gain (Srinivas et al., 2010):

max
{x1∈S×A,··· ,xno∈S×A}

log det(I +Kno).

as shown in Calandriello et al. (2019); Valko et al. (2013). Importantly, as we will see

soon since our setting is offline (a random design setting), d̂ can be upper-bounded much

tightly than their analysis.

Analysis of information gain With the above in mind, we first analyze βno . To do that,

we need to bound the information gain Ino . From (Seeger et al., 2008, Leemma 1), we

can easily prove

E[Ino] ≤ log(1 + no)d∗.

187


as in Theorem 73. Going beyond the expectation, we derive the finite-sample error bound.

Theorem 54 (Finite sample analysis of information gain in infinite-dimensional models).

Suppose Assumption 52. Let c1 and c2 be universal constants.

1. We have

Ino = log(det(I + ζ−2Kno)) ≤ 2d̂{log(1 + no/ζ
2) + 1}. (B.5)

2. When ζ2 = Ω(1), with probability 1 − δ,

Ino = log(det(I + ζ−2Kno)) ≤ c1{d∗ + log(c2/δ)}d∗ log(1 + no).

3. When ζ2 = Ω(1), with probability 1 − δ,

βno ≤ c1

√
dS log3(c2dSno/δ){d∗ + log(c2/δ)}d∗ log(1 + no).

Theorem 54 states Ino = O((d∗)2 log(no)). Our bound in the offline (a random design)

setting can be much tighter compared to the online setting, that is, the known upper

bound of maximum information gain in Srinivas et al. (2010) though we can always use

this as the bound of Ino with probability 1. We can see this situation in linear kernels

as we see in the previous section. In d-linear dimensional linear kernels, the maximum

information gain is d. On the other hand, {d∗}2 = rank[Σρ]2 can be much smaller than d.

Analysis of learning curves and the final bound We bound Ex∼dπe
P

[
√

kno(x, x)], where

kno(x, x′) = k(x, x′) − k̄no(x)⊤(Kno + ζ
2I)−1k̄no(x′), {xi}

no
i=1 ∼ ρ(x).

where x = (s, a).

188


Recall the definition of eigenvalues {µi} and eigenfunctions {ψi} (which are or-

thonormal), we define the feature mapping ϕ(x) = [
√
µ1ψ1(x), . . . ,

√
µ∞ψ∞(x)]⊤. De-

note Φ ∈ Rno×∞ as a matrix where each row of Φ corresponds to ϕ(xi). Since

k(x, x′) = ϕ(x)⊤ϕ(x′), we can rewrite the kernel kno(x, x) as follows:

kno(x, x) = ϕ(x)⊤ϕ(x) − ϕ(x)⊤Φ⊤
(
ΦΦ⊤ + ζ2I

)−1
Φϕ(x)

= ϕ(x)⊤
[
I − Φ⊤

(
ΦΦ⊤ + ζ2I

)−1
Φ

]
ϕ(x)

= ϕ(x)⊤
(
I + ζ−2Φ⊤Φ

)−1
ϕ(x)

= ϕ(x)⊤Σ−1
no
ϕ(x),

where Σno B I + ζ−2 ∑no
i=1 ϕ(xi)ϕ(xi)⊤, and we use matrix inverse lemma in the third

equality. Note the infinite-dimensional inverse lemma is formalized in the proof.

Now we can use the relative condition number definition and Lemma 60 for a

distribution change, i.e.,

Ex∼dπe
P

[
√

kno(x, x)] ≤
√
Ex∼dπe

P
[kno(x, x)]

=

√
tr

(
Ex∼dπe

P
ϕ(x)ϕ(x)⊤Σno

)
≤

√
Cπe tr

(
Ex∼ρϕ(x)ϕ(x)⊤Σno

)
=

√
CπeEx∼ρkno(x, x),

where

Cπe = sup
∥x∥2≤1

xΣπe x
xΣρx

, Σπe = Ex∼dπe
P

[ϕ(x)ϕ(x)⊤], Σρ = Ex∼ρ[ϕ(x)ϕ(x)⊤].

Now we only need to focus on analyzing Ex∼ρ[kno(x, x)].

Before proceeding to the analysis, we introduce the critical radius (Bartlett et al.,

2005). Given some function class F , consider the localized population Rademacher

complexity:

Rn(δ;F ) = E

 sup
f∈F ,Ex∼ρ[ f 2(x)]≤δ

∣∣∣∣∣∣∣ 1
no

no∑
i=1

ϵi f (xi)

∣∣∣∣∣∣∣


189


where {xi} are i.i.d samples following ρ(x) and {ϵi} are i.i.d Rademacher variables taking

values in {−1,+1} equiprobably, independent of the sequence {xi}. The critical radius is

defined as the minimum solution to

Rn(ξ;F ) ≤ ξ2/b

w.r.t ξ where b is a value s.t. ∥F ∥∞ ≤ b.

Theorem 55. Suppose Assumption 52. Let c1 and c2 be universal constants.

1. Let δno be the critical radius of the function class { f : f ∈ Hk, ∥ f ∥k ≤ 1}. With

probability 1 − δ,

Ex∼dπe
P

[
√

kno(x, x)] ≤ c1ζδ
′
no

√
Cπed∗,

where δ′no
= δno +

√
log(c2/δ)/no.

2. Assume ζ2 = Ω(1). With probability 1 − δ,

δno ≤ c1

√
d∗/no, Ex∼dπe

P
[
√

kno(x, x)] ≤ c1

√
Cπed∗{d∗ + log(c2/δ)}

no
.

3. Assume ζ2 = Ω(1). With probability 1 − δ,

V π̂IL
P,c − Vπe

P,c ≤ Erro + Erre (B.6)

Erro = c1H2{d∗ + log(c2/δ)}d∗
√

dSCπe log3(c2dSno/δ) log(1 + no)
no

Erre = 2H
√

log(2|F |/δ)/(2ne).

4. Assume ζ2 = Ω(1). For offline RL, with probability 1 − δ,

V π̂RL
P,c − Vπ∗

P,c ≤ c1H2{d∗ + log(c2/δ)}d∗
√

dSCπ∗ log3(c2dSno/δ) log(1 + no)
no

,

where Cπ∗ = sup∥x∥2≤1
xΣπ∗ x
xΣρx .

190


The final bound in (B.6) suggests that Erro is Õ(H2{d∗}2
√

dSCπe/no). In other words,

when Cπe , d∗ are not so large and the offline sample size is large enough, Erre dominates

Erre and the covariate shift problem in BC can be avoided since the horizon dependence is

just H. Our bound is the natural extension of Theorem 51 to possibly infinite dimensional

models.

The first and second statements in Theorem 55 are mainly proved in two steps:

formulating kno(x, x) into the variational representation and utilizing the uniform law with

localization. Note the critical radius can be upper-bounded more tightly than O(
√

d∗/no)

depending on the kernels. Besides, Cπe can be replaced with a tighter quantity:

max
i∈N
E(s,a)∼dπe

P
[ψ2

i (s, a)].

Since E(s,a)∼ρ[ψ2
i (s, a)] = 1, this quantity also measure the difference of batch data and

expert. This is less than Cπe noting that xΣπ∗ x
xΣρx = E(s,a)∼dπe

P
[ψ2

i (s, a)] when x is a vector s.t.

only i-th element is 1 and the other elements are 0. The third statement in Theorem 55 is

directly proved by combining the second statement in Theorem 55 and Theorem 54.

Implication to offline RL The final statement in Theorem 55 is the bound for the

RL case. This is the first result showing the error bound for pessimistic offline RL

with nonparametric models. As related literature, in model-free offline RL, Uehara

et al. (2021); Duan et al. (2021) obtained the finite-sample error bounds characterized

by the critical radius for some minimax-type estimators called Modified RBM (Antos

et al., 2008). As we show in Theorem 55, since the critical radius of an RKHS ball is

upper-bounded by the effective dimension d∗, their bounds are also characterized by

the effective dimension. On top of that, several papers derived the bounds under the

general function approximation setting: FQI (Fan et al., 2020; Duan et al., 2021; Munos

and Szepesvári, 2008; Chen and Jiang, 2019), marginal weighting based estimators

191


(Uehara et al., 2020), DICE methods (Zhang et al., 2020; Nachum et al., 2019b), policy

based methods (Liao et al., 2020; Liu et al., 2020) and MABO (Xie and Jiang, 2020).

Comparing to our result, all of their bounds depend on

sup
π∈Π

sup
(s,a)

dπP(s, a)
ρ(s, a)

or sup
(s,a)

1
ρ(s, a)

.

The pessimistic bonus allows us to obtain the bound only depending on Cπ∗ but not the

above constants. Besides, our Cπ∗ in Theorem 55 is more refined quantity than the density

ratios in the sense that it is defined in terms of the relative condition number. Note we

can easily obtain the statements which replace Cπ∗ in Theorem 55 with dπ
∗

P (s,a)
ρ(s,a) .

Remark 56 (Relation with more general offline RL literature). Due to the lack of

exploration, it is known how to deal with the lack of the coverage of the offline data is

a challenging problem (Zanette, 2020; Wang et al., 2020a). We use the penalty terms

based on model-based RL. In the above, we explain how the penalty term in MILO (and its

RL counterpart) is transferred to the final sample-error bounds. The idea of penalization

has been utilized in a variety of other ways in offline RL. The first other way is imposing

constraints on the policy class or Q-function class so that estimated policies are not too

much far away from behavior policies. For example, we can use KL divegences, MMD

distance, Wasserstein distance to measure the distance from behavior policies (Wu et al.,

2019; Fakoor et al., 2021; Matsushima et al., 2020; Touati et al., 2020; Fujimoto et al.,

2019) and add D(π, πb) as penalty terms, where πb is a behavior policy. Another way

is explicitly estimating the lower bound of q-functions (Kumar et al., 2020; Yu et al.,

2021, 2020). By doing so, we can avoid the overestimation of the q-functions in unknown

(non-covered) regions.

Remark 57 (Relation with GP literature). The quantity Ex∼ρ(x)[kno(x, x)] is often referred

to as the learning curve in GP literature (Williams and Vivarelli, 2000; Sollich and

Halees, 2002; Rasmussen and Williams, 2005). Their analysis mainly focuses on the

192


numerical viewpoints, that is, how to approximately calculate Ex∼ρ(x)[kno(x, x)]. Though

Le Gratiet et al. (2015) analyzes the convergence property, their analysis is limited to the

expectation and the result is asymptotic. As far as we know, our result is the first result

showing the finite-sample error rate.

Remark 58 (Duality between KNRs and GPs). KNRs and GPs have a primal and

dual relationship via Mercer’s theorem. In fact, as we see, k(·, ·) = ⟨ϕ(·), ϕ(·)⟩, we

have kno(x, x) = ϕ(x)⊤Σ−1
no
ϕ(x). Thus, our result in GPs can be applied to the result for

infinite-dimensional KNRs with ϕ : S ×A 7→ H whereH is some RKHS.

Remark 59 (Online RL using RKHS). There are several online RL literature using

RKHS such as the model-based way (Calandriello et al., 2019) like our work and the

model-free way (Agarwal et al., 2020a; Yang et al., 2020; Du et al., 2021). In both cases,

the final-sample error bounds incur the maximum information gain, i.e., a worse case

quantity which is distribution independent. Comparing to that, our final bounds use

distribution-dependent quantities d∗.

B.3.4 Missing Proofs

Below, we provide missing proofs for tabular MDPs, KNRs, and non-parametric GP

models.

Missing proofs for tabular result

We start by providing proof of the tabular MDP result.

193


Proof of Theorem 49. We use Theorem 10. Then, we have

V π̂IL
P,c − Vπe

P,c ≤ (6H2 + 2H) min(1,E(s,a)∼dπe
P

[σ(s, a)]) + Hϵstat.

Hereafter, we show how to upper-bound E(s,a)∼dπe
P

[σ(s, a)]. We use Lemma 65. Then, by

letting ξ = c1 log(|S||A|c2/δ), with probability 1 − δ, we have

1
N(s, a) + λ

≤
ξ

noρ(s, a) + λ
∀(s, a) ∈ S ×A.

We condition on the above event. Then,

E(s,a)∼dπe
P

[σ(s, a)] ≤ E(s,a)∼dπe
P


√
|S| log 2 + log(2|S||A|/δ)

2{N(s, a) + λ}
+

λ

N(s, a) + λ


≤

√
E(s,a)∼dπe

P

[
|S| log 2 + log(2|S||A|/δ)

2{N(s, a) + λ}

]
+ E(s,a)∼dπe

P

[
λ

N(s, a) + λ

]
.

From Lemma 65, we have

E(s,a)∼dπe
P

[σ(s, a)] ≤

√
ξE(s,a)∼dπe

P

[
|S| log 2 + log(2|S||A|/δ)

{noρ(s, a) + λ}

]
+ E(s,a)∼dπe

P

[
λξ

noρ(s, a) + λ

]

≤

√
ξCπeE(s,a)∼ρ

[
|S| log 2 + log(2|S||A|/δ)

{noρ(s, a) + λ}

]
+CπeE(s,a)∼ρ

[
λξ

noρ(s, a) + λ

]

≤

√
ξCπe

∑
s,a

[
{|S| log 2 + log(2|S||A|/δ)}ρ(s, a)

{noρ(s, a) + λ}

]
+Cπe

∑
s,a

[
ρ(s, a)λξ

noρ(s, a) + λ

]
≤

√
ξCπe{|S| log 2 + log(2|S||A|/δ)}|S ||A|/no + λCπeξ|S ||A|/no.

where again

Cπe = max
(s,a)

dπe
P (s, a)
ρ(s, a)

.

This concludes the proof. □

Missing proofs for KNR results

Next we move to provide proofs for the KNR results.

194


Proof of Theorem 50. In the proof, we use two statements, Equation (B.8) and Equa-

tion (B.9), in the proof of Theorem 51. We recommend readers to read the proof of

Theorem 51 first.

We denote the eigenvalues of
∑no

i=1 ϕ(si, ai)ϕ⊤(si, ai) by {µ̂i}
d
i=1 s.t. µ̂1 ≥ µ̂2 ≥ · · · .

Since we assume ∥ϕ(s, a)∥2 ≤ 1, we have µ̂1 ≤ no.

First step We first show

log(det(Σno)/ det(λI)) ≤ tr

Σ−1
no

no∑
i=1

ϕ(si, ai)ϕ⊤(si, ai)

 {log(1 + no/λ) + 1}.

Note this directly shows log(det(Σno)/ det(λI)) ≤ d log(1+ no/λ), ϕ(s, a) ∈ Rd. The above

is proved as follows:

log(det(Σno)/ det(λI)) =
d∑

i=1

log
(
1 +

µ̂i

λ

)
=

d∑
i=1

log
(
1 +

µ̂i

λ

)
µ̂i/λ + 1
µ̂i/λ + 1

=

d∑
i=1

log
(
1 +

µ̂i

λ

)
µ̂i/λ

µ̂i/λ + 1
+ log

(
1 +

µ̂i

λ

)
1

µ̂i/λ + 1

≤ log
(
1 +

µ̂1

λ

) d∑
i=1

µ̂i/λ

µ̂i/λ + 1
+

d∑
i=1

µ̂i/λ

µ̂i/λ + 1
(log(1 + x) < x)

≤ {log(1 + no/λ) + 1}
d∑

i=1

µ̂i/λ

µ̂i/λ + 1
(µ̂1 ≤ no)

= {log(1 + no/λ) + 1} tr

Σ−1
no

no∑
i=1

ϕ(si, ai)ϕ⊤(si, ai)

 .
In the last line, letting UVU⊤ be the eigendecomopsition of

∑no
i=1 ϕ(si, ai)ϕ⊤(si, ai), we

use

tr

Σ−1
no

no∑
i=1

ϕ(si, ai)ϕ⊤(si, ai)

 = tr
[
{V + λI}−1V

]
=

d∑
i=1

µ̂i/λ

µ̂i/λ + 1
.

Then, the first statement is proved.

195


Second step Next, we prove the second statement. We have

tr

Σ−1
no

no∑
i=1

ϕ(si, ai)ϕ⊤(si, ai)

 = no∑
i=1

ϕ⊤(si, ai)Σ−1
no
ϕ(si, ai).

Then, from (B.8), with probability 1 − δ,
no∑
i=1

ϕ⊤(si, ai)Σ−1
no
ϕ(si, ai) ≲ c1{rank[Σρ] + log(c2/δ)}

no∑
i=1

ϕ⊤(si, ai){noΣρ + λI}−1ϕ(si, ai).

(B.7)

Hereafter, we condition on the above event. To upper-bound
∑no

i=1 ∥ϕ(si, ai)∥2{noΣρ+λI}−1 , we

use Bernstein’s inequality:∣∣∣∣∣∣∣
no∑
i=1

ϕ⊤(si, ai){noΣρ + λI}−1ϕ(si, ai) − noE(s,a)∼ρ[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)]

∣∣∣∣∣∣∣
≲

√
no Var

(s,a)∼ρ
[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] + 1/λ.

since ∥ϕ(s, a)∥2
{noΣρ+λI}−1 ≤ 1/λ∀(s, a) ∈ S ×A. Here, from (B.9),

noE(s,a)∼ρ[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] ≤ rank[Σρ].

Besides,

Var
(s,a)∼ρ

[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)] ≤ E(s,a)∼ρ[{ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)}2]

≤ 1/λE(s,a)∼ρ[ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)]

≤ rank[Σρ]/(noλ). (from (B.9))

Thus,
no∑
i=1

ϕ⊤(si, ai){noΣρ + λI}−1ϕ(si, ai) ≲ rank[Σρ] +
√

rank[Σρ]/λ + 1/λ.

By combining (B.7) with the above, we have

log(det(Σno)/ det(λI)) ≤ c1rank(Σρ){rank(Σρ) + log(c2/δ)} log(1 + noc3).

from λ = Ω(1).

□

196


Before proving Theorem 51, we first present some lemmas.

Lemma 60 (Distribution change). Consider two distributions ρ1 ∈ ∆(S × A) and

ρ2 ∈ ∆(S ×A), and a feature mapping ϕ : S ×A 7→ H whereH is some Hilbert space

(e.g., finite dimensional Euclidean space). Denote C := supx∈H
x⊤Es,a∼ρ1ϕ(s,a)ϕ(s,a)⊤x
x⊤Es,a∼ρ2ϕ(s,a)ϕ(s,a)⊤x . Then

for any positive definition linear matrix ( operator Λ), we have:

Es,a∼ρ1ϕ(s, a)⊤Λϕ(s, a) ≤ CEs,a∼ρ2ϕ(s, a)⊤Λϕ(s, a).

Proof. Denote the eigendecomposition of Λ = UΣU⊤ where {σi, ui} as the eigenvalue-

eigenvector pairs. We have:

Es,a∼ρ1ϕ(s, a)⊤Λϕ(s, a) =
∞∑

i=0

σiu⊤i Es,a∼ρ1ϕ(s, a)ϕ(s, a)⊤ui

≤

∞∑
i=0

σiCu⊤i Es,a∼ρ2ϕ(s, a)ϕ(s, a)⊤ui

= CEs,a∼ρ2ϕ(s, a)⊤Λϕ(s, a),

which concludes the proof. □

Proof of Theorem 51. Here, we prove the first statement. We need to upper-bound

E(s,a)∼dπe
P

[√
ϕ⊤(s, a)Σ−1

no
ϕ(s, a)

]
.

As the first step, we use Jensen’s inequality:

E(s,a)∼dπe
P

[√
ϕ⊤(s, a)Σ−1

no
ϕ(s, a)

]
≤

√
E(s,a)∼dπe

P

[
ϕ⊤(s, a)Σ−1

no
ϕ(s, a)

]
.

Hereafter, we analyze E(s,a)∼dπe
P

[
ϕ⊤(s, a)Σ−1

no
ϕ(s, a)

]
.

We first use the definition of the relative condition number Cπe and Lemma 60 to

change distribution from dπe
P to ρ, i.e., via Lemma 60, we have:

Es,a∼dπe
P
ϕ(s, a)⊤Σ−1

no
ϕ(s, a) ≤ CπeEs,a∼ρϕ(s, a)⊤Σ−1

no
ϕ(s, a).

Thus, below we just need to bound Es,a∼ρϕ(s, a)⊤Σ−1
no
ϕ(s, a).

197


Concentration argument In this step, we consider how to bound E(s,a)∼ρ[ϕ⊤(s, a)Σ−1
no
ϕ(s, a)].

To do that, we show with probability 1 − δ,

ϕ⊤(s, a)Σ−1
no
ϕ(s, a) ≤ c1{rank(Σρ) + log(c2/δ)}ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a) ∀(s, a) ∈ S ×A.

(B.8)

We use the variational representation:

ϕ⊤(s, a)Σ−1
no
ϕ(s, a) = sup

{a∈Rd:a⊤Σno a≤1}
{a⊤ϕ(s, a)}2

= sup
{a∈Rd:a⊤Σno a≤1,∥a∥22≤(1+λ)/λ,∥a⊤ϕ∥∞≤1/λ}

{a⊤ϕ(s, a)}2.

Note that in the first line, we use

sup
{a∈Rd:a⊤Σno a≤1}

a⊤ϕ(s, a) = sup
{b∈Rd:b⊤b≤1}

b⊤Σ−1/2
no

ϕ(s, a) = ∥ϕ(s, a)∥Σ−1
no
.

From the first line to the second line, we use the fact that the maximization regarding a is

taken when ã = Σ−1
no
ϕ(s, a)/∥ϕ(s, a)∥Σ−1

no
and

∥ã∥22 = ϕ
⊤(s, a)Σ−2

no
ϕ(s, a)/{ϕ⊤(s, a)Σ−1

no
ϕ(s, a)} = (no + λ)/λ2,

|ã⊤ϕ| ≤ ∥ϕ(s, a)∥Σ−1
no
≤ 1/λ ∀(s, a) ∈ S ×A,

noting ∥ϕ(s, a)∥2 ≤ 1. By defining c̄ = (no + λ)/λ2, we have ∀(s, a) ∈ S ×A,

ϕ⊤(s, a)Σ−1
no
ϕ(s, a) = sup

{a∈Rd:a⊤Σno a≤1,∥a∥22≤c̄,∥a⊤ϕ∥∞≤1/λ}
{a⊤ϕ(s, a)}2

= sup
{a∈Rd:a⊤λIa+

∑no
i=1{a

⊤ϕi}2≤1,∥a∥22≤c̄,∥a⊤ϕ∥∞≤1/λ}
{a⊤ϕ(s, a)}2.

Next, we use Lemma 66, that is, with probability 1 − δ,

1
no

no∑
i=1

f 2(si, ai) ≥ 0.5E(s,a)∼ρ[ f 2(s, a)] − 0.5{δ′no
}2 ∀ f ∈ F ,

where

F = {(s, a) 7→ a⊤ϕ(s, a) : a⊤Σnoa ≤ 1, ∥a∥22 ≤ c̄, ∥a⊤ϕ∥∞ ≤ 1/λ, a ∈ Rd}.

198


Here, δ′no
= δno +

√
log(c2/δ)/no, where δno is the critical radius of the function class

F . Noting λ = Ω(1), from Lemma 67, δ′no
= c1

√
rank[Σρ]/no +

√
log(c2/δ)/no. By

conditioning on the above event, ∀(s, a) ∈ S ×A, we have

∥ϕ(s, a)∥2
Σ−1

no
≤ sup
{a∈Rd:a⊤λIa+0.5noE(s,a)∼ρ[{a⊤ϕ}2]≤1+0.5noδ

′2
no ,∥a∥

2
2≤c̄,∥a⊤ϕ∥∞≤1/λ}

{a⊤ϕ(s, a)}2

≤ sup
{a∈Rd:a⊤{noΣρ+λI}a≤2+noδ

′2
no ,∥a∥

2
2≤c̄,∥a⊤ϕ∥∞<1/λ}

{a⊤ϕ(s, a)}2

≤ sup
{a∈Rd:a⊤{noΣρ+λI}a≤2+noδ

′2
no }

{a⊤ϕ(s, a)}2

= (2 + noδ
′2
no

)ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)

≤ c1{rank[Σρ] + log(c2/δ)}ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a).

Last step Then, the final bound is

E(s,a)∼dπe
P

[∥ϕ(s, a)∥Σ−1
no

] =
√

CπeE(s,a)∼ρ[ϕ⊤(s, a)Σ−1
no
ϕ(s, a)]

≤ c1

√
Cπeno{rank[Σρ] + log(c2/δ)}E(s,a)∼ρ[ϕ⊤(s, a){Σρ + λI}−1ϕ(s, a)].

Let UVU⊤ be the eigenvalue decomoposition of Σρ s.t. Vi,i = µi. We have

E(s,a)∼ρ

[
ϕ⊤(s, a){noΣρ + λI}−1ϕ(s, a)

]
= Tr[{noΣρ + λI}−1{Σρ}] = Tr[{noV + λI}−1V]

=
1
no

no∑
i=1

µi

µi + λ/no
≤

rank[Σρ]
no

. (B.9)

By combining all things together, with probability 1 − δ,

E(s,a)∼dπe
P

[∥ϕ(s, a)∥Σ−1
no

] ≤ c1

√
Cπerank[Σρ]{rank[Σρ] + log(c2/δ)}

no
.

□

Missing proofs of non-parametric model

Finally, we provide missing proofs for the non-parametric GP model.

199


Proof of Theorem 54. In the proof, we use two statements, (B.10) and (B.11), in the

proof of Theorem 55. We recommend readers to read the proof of Theorem 55 first.

We denote the eigenvalues of Kno by {µ̂i}
no
i=1 s.t. µ̂1 ≥ µ̂2 ≥ · · · . From Assumption 52,

we have

no = tr(Kno) =
no∑
i=1

µ̂i.

Thus implies µ̂1 ≤ no. Then,

log(det(I + ζ−2Kno)) =
no∑
i=1

log
(
1 +

µ̂i

ζ2

)
=

no∑
i=1

log
(
1 +

µ̂i

ζ2

)
µ̂i/ζ

2 + 1
µ̂i/ζ2 + 1

=

no∑
i=1

log
(
1 +

µ̂i

ζ2

)
µ̂i/ζ

2

µ̂i/ζ2 + 1
+ log

(
1 +

µ̂i

ζ2

)
1

µ̂i/ζ2 + 1

=

no∑
i=1

log
(
1 +

µ̂i

ζ2

)
µ̂i/ζ

2

µ̂i/ζ2 + 1
+ log

(
1 +

µ̂i

ζ2

)
1

µ̂i/ζ2 + 1

≤ log
(
1 +

µ̂1

ζ2

) no∑
i=1

µ̂i/ζ
2

µ̂i/ζ2 + 1
+

no∑
i=1

µ̂i/ζ
2

µ̂i/ζ2 + 1
(log(1 + x) ≤ x)

≤ {log(1 + no/ζ
2) + 1}

no∑
i=1

µ̂i/ζ
2

µ̂i/ζ2 + 1
(µ̂1 ≤ no)

≤ {log(1 + no/ζ
2) + 1}min

j
{ j + B̂( j + 1)/ζ2} ≤ 2{log(1 + no/ζ

2) + 1}d̂,

where the last second inequality uses the fact that
∑no

i=1
µ̂i/ξ

2

µ̂i/ξ2+1 ≤ j+
∑no

i= j+1 µ̂i/ξ
2. Then,

the first statement is proved.

Next, we prove the second statement. We use
no∑
i=1

µ̂i/ζ
2

µ̂i/ζ2 + 1
=

1
ζ2

no∑
i=1

kno(xi, xi).

proved in Lemma 70. Then, from (B.10), with probability 1 − δ,

1
ζ2

no∑
i=1

kno(xi, xi) ≲ δ′2no

no∑
i=1

sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(xi),

where δ′n = δn +
√

log(c2/δ)/no and δn is the critical radius of { f ∈ Hk : ∥ f ∥k ≤ 1}.

Hereafter, we condition on the above event.

200


Then, from Bernstein’s inequality,

∣∣∣∣∣∣∣
 no∑

i=1

sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(xi)

 − noEx∼ρ

 sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(x)


∣∣∣∣∣∣∣

≲
√

no Var
x∼ρ

[ sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(x)] + no.

We use for f inHk s.t. ∥ f ∥k ≤ 1

| f (x)| = |⟨ f (·), k(x, ·)⟩k| ≤ ∥ f ∥k∥k(x, ·)∥k ≤ 1.

from Theorem 52. Here, from (B.11), the expectation is upper-bounded by

Ex∼ρ

 sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(x)

 ≤ d∗.

Besides, the variance is also upper-bounded by

Var
x∼ρ

[ sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(x)]

≤ Ex∼ρ[ sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 4(x)]

≤ Ex∼ρ[ sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(x)] ( f 2(x) ≤ 1∀x ∈ S ×A from Assumption 52)

= d∗. (From (B.11))

Thus, with probability 1 − δ,

no∑
i=1

kno(xi, xi) ≲ {δ′no
}2no(d∗ +

√
d∗ + 1)

≲ c1{d∗ + log(c2/δ)}d∗.

noting δ′no
=
√

d∗/no +
√

log(c2/δ)/no from Theorem 55.

201


By combining all things together, with probability 1 − δ,

log(det(I + ζ−2Kno)) ≤ {log(1 + no/ζ
2) + 1}

no∑
i=1

µ̂i/ζ
2

µ̂i/ζ2 + 1

= {log(1 + no/ζ
2) + 1}

1
ζ2

no∑
i=1

kno(xi, xi)

≲ {log(1 + c3no)}{d∗ + log(c2/δ)}d∗.

This concludes the proof.

□

Proof of Theorem 55.

First Statement From Jensen’s inequality, we have

Ex∼dπe
P

[
√

kno(x, x)] ≤
√
Ex∼dπe

P
[kno(x, x)].

Thus, we focus how to bound Ex∼dπe
P

[kno(x, x)]. Before that, we show the following

statement. With probability 1 − δ, we have for ∀x ∈ S ×A:

kno(x, x) ≤ c1ζ
2δ′2no
× sup
{ f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1}

f 2(x), (B.10)

where δ′no
= δno +

√
log(c2/δ)/no and δno is the critical radius of { f ∈ Hk : ∥ f ∥k ≤ 1}.

As the first step, we use Lemma 68 and Lemma 69.

kno(x, x) = sup
{ f∈Hkno |∥ f ∥

2
kno
≤1}

f 2(x) (From Lemma 68)

= sup
{ f∈Hk |∥ f ∥2k+ζ

−2 ∑no
i=1 f (xi)2≤1}

f 2(x). (From Lemma 69)

Next invoke Lemma 66, that is, with probability 1 − δ,

1
no

no∑
i=1

f 2(xi) ≥ 0.5E(s,a)∼ρ[ f 2(x)] − 0.5{δ′no
}2 ∀ f ∈ F

202


where

F = { f : f ∈ Hk, ∥ f ∥2k = 1}.

Here, δ′no
= δno +

√
log(c2/δ)/no, where δno is the critical radius of the function class F .

Hereafter, we condition on the above event. Note the uniform boundedness assumption

of F for Lemma 66 is satisfied noting

| f (x)| = |⟨ f (·), k(·, x)⟩k| ≤ ∥ f ∥k∥k(·, x)∥k ≤ 1.

noting Theorem 52. Then, we have

kno(x, x) ≤ sup
{ f∈Hk |∥ f ∥2k+ζ

−2no/2Ex∼ρ[ f 2(x)]≤1+noδ
′2
no/2}

f 2(x).

kno(x, x) is further upper-bounded by

kno(x, x) ≤ sup
{ f∈Hk:2ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤2ζ2/no+ζ2δ′2no }

f 2(x) (Multiply 2ζ2/no)

≤ (2ζ2/no + ζ
2δ′2no

) × sup
{ f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1}

f 2(x)

≤ c1ζ
2δ′2no
× sup
{ f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1}

f 2(x).

This concludes (B.10).

Next, we show

Ex∼dπe
P

 sup
{ f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1}

f 2(x)

 ≤ 2d∗ × sup
∥x∥2≤1

x⊤Σπe x
x⊤Σρx

.

For f (·) = a⊤ϕ(·) (recall ϕ(·) is the feature mapping defined by the eigenvalues µi and

eigenfunctions ϕ, s.t. ϕ = (ϕ1, · · · , ϕ∞)), we have

∥ f ∥2k = a⊤a, Ex∼ρ[ f 2(x)] = a⊤Ma.

where M is a diagonal matrix in R∞×∞ s.t. Mi,i = µi. Thus,

203


Ex∼dπe
P

 sup
{ f :ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1, f∈Hk}

f 2(x)

 = Ex∼dπe
P

 sup
{a∈R∞:a⊤(ζ2/noI+M}a≤1}

{a⊤ϕ(x)}2
 .

Then, by letting Σρ and Σπe be E(s,a)∼ρ[ϕ(s, a)ϕ⊤(s, a)] and E(s,a)∼dπe
P

[ϕ(s, a)ϕ⊤(s, a)],

Ex∼dπe
P

 sup
{a∈Rd:a⊤(ζ2/noI+M}a≤1

{a⊤ϕ(x)}2
 = Ex∼dπe

P
[ϕ(x){ζ2/noI + M}−1ϕ(x)]

= tr[Ex∼dπe
P

[ϕ(x)ϕ(x)⊤]{ζ2/noI + M}−1]

= tr[Ex∼ρ[ϕ(x)ϕ(x)⊤]{ζ2/noI + M}−1] × sup
∥x∥2≤1

x⊤Σπe x
x⊤Σρx

=

∞∑
i=1

µi

ζ2/no + µi
× sup
∥x∥2≤1

x⊤Σπe x
x⊤Σρx

.

Then, by defining Cπe = sup∥x∥2≤1
x⊤Σπe x
x⊤Σρx , we have

Ex∼dπe
P

 sup
{a∈Rd:a⊤(ζ2/no+M}a≤1

{a⊤ϕ(x)}2
 ≤ min

j
{ j + no/ζ

2
∞∑

i= j+1

µi} ×Cπe

≤ min
j
{ j + no/ζ

2
∞∑

i= j+1

µi} ×Cπe

≤ 2d∗ ×Cπe .

By combining all things together ((B.10) and (B.11)), the statement is concluded, that

is, with probability 1 − δ:

Edπe
P

[
√

kno(x, x)] ≤
√
ζ2δ′2no

× Ex∼dπP[ sup
{ f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1}

f 2(x)]

≤ ζδ′no

√
Cπed∗.

where Cπe = sup∥x∥2≤1
x⊤Σπe x
x⊤Σρx .

Remark 61. Like the above, We can also prove

Ex∼ρ[ sup
{ f∈Hk:ζ2/no∥ f ∥2k+Ex∼ρ[ f 2(x)]≤1}

f 2(x)] ≤
∞∑

i=1

µi

ζ2/no + µi
≤ 2d∗. (B.11)

This is used in the proof of Theorem 54.

204


Remark 62. We can also use

Ex∼dπe
P

 sup
{a∈Rd:a⊤(ζ2/noI+M}a≤1

{a⊤ϕ(x)}2
 = Ex∼dπe

P
[ϕ(x){ζ2/noI + M}−1ϕ(x)]

= tr[Ex∼dπe
P

[ϕ(x)ϕ(x)⊤]{ζ2/noI + M}−1]

=

∞∑
i=1

Ex∼dπe
P

[ϕi(x)ϕi(x)⊤]}

ζ2/no + µi

=

∞∑
i=1

µi

ζ2/no + µi
×

Ex∼dπe
P

[ϕi(x)ϕi(x)⊤]

µi


=

∞∑
j=1

µ j

ζ2/no + µ j
×max

i
(Ex∼dπe

P
[ψi(x)ψi(x)⊤]).

Then, Cπe is replaced with maxi(Ex∼dπe
P

[ψi(x)ψi(x)⊤]).

Second statement We use Lemma 71 to calculate the critical radius of the RKHS ball.

The critical inequality is

√
1/no

√√
no∑
i=1

min(y2, µ j) ≤ y2.

We show y =
√

d∗/no satisfies the above. This is proved by

√
1/no

√√
no∑
i=1

min(y2, µ j) ≤ min
1≤k≤no

{
√

1/n
√

ky2 + B(k + 1)}

≤
√

1/no

√
d∗y2 + B(d∗ + 1) (d∗ ≤ no)

≤
√

1/no

√
d∗y2 + d∗/no (B(d∗ + 1) ≤ d∗/no)

≤
√

d∗y2/no ≤ y2.

□

205


B.4 Auxiliary Lemmas

Lemma 63 (Simulation Lemma). Consider any two functions f : S ×A 7→ [0, 1] and

f̂ : S × A 7→ [0, 1], any two transitions P and P̂, and any policy π : S 7→ ∆(A). We

have:

Vπ
P; f − Vπ

P̂, f̂
=

H∑
h=0

Es,a∼dπP

[
f (s, a) − f̂ (s, a) + Es′∼P(·|s,a)[Vπ

P̂, f̂ ;h
(s′)] − Es′∼P̂(·|s,a)[V

π

P̂, f̂ ;h
(s′)]

]
≤

H∑
h=0

Es,a∼dπP

[
f (s, a) − f̂ (s, a) + ∥Vπ

P̂, f̂ ;h
∥∞∥P(·|s, a) − P̂(·|s, a)∥1

]
.

where Vπ
P, f ;h denotes the value function at time step h, under π, P, f .

Such simulation lemma is standard in model-based RL literature and the derivation

can be found, for instance, in the proof of Lemma 10 from Sun et al. (2019b).

Lemma 64 (ℓ1 Distance between two Gaussians). Consider two Gaussian distributions

P1 := N(µ1, ζ
2I) and P2 := N(µ2, ζ

2I). We have:

∥P1 − P2∥1 ≤
1
ζ
∥µ1 − µ2∥2 .

This lemma is proved by Pinsker’s inequality and the closed-form of the KL diver-

gence between P1 and P2.

Lemma 65 (Concentration on the inverse of state-action visitation). We set λ = Ω(1).

Then, with probability 1 − δ,

1
N(s, a) + λ

≤
c1 log(|S||A|c2/δ)

noρ(s, a) + λ
∀(s, a) ∈ S ×A.

The extension of this lemma to the linear models is stated in Equation (B.8).

206


Proof. We set ξ = c1 log(|S||A|/δ) + 1 (c1 > 4/3 + 3). First, we have

1
N(s, a) + λ

≤
ξ

N(s, a) + ξλ
.

from ξ ≥ 1. Here, by Bernsteins’s inequality, with probability 1 − δ,

N(s, a) ≥ noρ(s, a) − 2
√

2noρ(s, a)(1 − ρ(s, a)) log(|S||A|/δ) − 4 log(|S||A|/δ)/3, ∀(s, a).

Thus, ∀(s, a) ∈ V̄ , we have

N(s, a) + ξλ ≥ noρ(s, a) − 2
√

2noρ(s, a)(1 − ρ(s, a)) log(|S||A|/δ) − 4 log(|S||A|/δ)/3 + ξλ

≥ noρ(s, a) − 2
√

2noρ(s, a)(1 − ρ(s, a)) log(|S||A|/δ) + (c1 − 4/3) log(|S||A|/δ) + λ

≥ noρ(s, a) − 2
√

2noρ(s, a) log(|S||A|/δ) + (c1 − 4/3) log(|S||A|/δ) + λ

≥ 0.5noρ(s, a) + (
√

0.5noρ(s, a) −
√

4 log(|S||A|/δ))2 + (c1 − 4/3 − 4) log(|S||A|/δ) + λ

≥ 0.5noρ(s, a) + 0.5λ.

This implies with 1 − δ,

1
N(s, a) + λ

≤
2ξ

n0ρ(s, a) + λ
∀(s, a).

Then, noting c1 log(|S||A|/δ)+ 1 ≤ c1 log(|S||A|c2/δ) for some c2, the proof is concluded.

□

Lemma 66 (A uniform law with localization: Theorem 14.1 in (Wainwright, 2019)).

Assume ∥F ∥∞ ≤ b. Denote the critical radius of a function class F by δn. The critical

radius δn is defined as a solution to

Rno(y;F ) ≤ y2/b.

w.r.t y. Then, with probability 1 − δ

1
no

no∑
i=1

f (xi)2 ≥ 1/2Ex∼ρ[ f 2(x)] − (δ′n)2/2 ∀ f ∈ F ,

where δ′n = δn + c1
√

log(c2/δ)/no.

207


Lemma 67 (Critical radius of linear models). Assume ∥ϕ(s, a)∥2 ≤ 1 for any (s, a) ∈ S×A.

Then, the critical radius of function class F = {(s, a) 7→ a⊤ϕ(s, a) : ∥a∥22 ≤ α, a
⊤ϕ ≤

β, a ∈ Rd} is upper-bounded by

c
√
βrank(Σρ)/no.

where c is a universal constant.

We follow the proof of (Wainwright, 2019, Chapter 14). Their argument depends

on the assumption Σρ is full rank. We need to change the proof so that the full-rank

assumption is removed and the rank rank[Σρ] would appear in the final bound instead of

d. Note that the final bound does not include α.

Proof. Unless otherwise noted, in this proof, E[·] is taken w.r.t.

xi = (si, ai) ∼ ρ(s, a), ϵi ∼ 2{Ber(0.5) − 1}.

Note that xi and ϵi are independent.

Noting Eρ∼(s,a)[(a⊤ϕ(s, a))2] = a⊤Σρa, the localized Rademacher complexity of F ,

Rno(ξ;F ), is

E

 sup
{b∈Rd:∥b∥22≤α,∥b∥Σρ≤ξ,b

⊤ϕ≤β}

∣∣∣∣∣∣∣ 1
no

no∑
i=1

ϵi{b⊤ϕ(si, ai)}

∣∣∣∣∣∣∣


where {ϵi}
no
i=1 is a set of independent Rademacher variables. This is upper-bounded by

E

 sup
{b∈Rd:∥b∥22≤α,∥b∥Σρ≤ξ}

∣∣∣∣∣ 1
no
ϵ⊤Φb

∣∣∣∣∣


where Φ is a no × d design matrix s.t. the i-th row is ϕ⊤(si, ai) and ϵ = (ϵ1, · · · , ϵno)
⊤.

Here, we have E[Φ⊤Φ] = noΣρ. Let UVU⊤ be the SVD of Σρ, where U is a n ×

rank[Σρ] matrix and V is a rank[Σρ]× rank[Σρ] diagonal matrix. Noting b = UU⊤b+ (I−

208


UU⊤)b, we have

E

 sup
{b∈Rd:∥b∥22≤α,∥b∥Σρ≤ξ}

|
1
no
ϵ⊤Φ{UU⊤b + (I − UU⊤)b}|


≤ E

 sup
{b∈Rd:∥b∥22≤α}

|
1
no
ϵ⊤Φ(I − UU⊤)b}|

 + E
 sup
∥b∥Σρ≤ξ

|
1
no
ϵ⊤ΦUU⊤b}|


≤ E

 sup
{b∈Rd:∥b∥22≤α}

|
1
no
ϵ⊤Φ(I − UU⊤)b}|

 + E [
sup
∥c∥V≤ξ

|
1
no
ϵ⊤ΦUc}|

]
(U⊤c = b)

≤ E

[
α

no
∥ϵ⊤Φ(I − UU⊤)}∥2

]
+
ζ

no
E

[
∥ϵ⊤ΦU}∥V−1

]
(CS inequality)

≤
α

no

√
E

[
∥ϵ⊤Φ(I − UU⊤)}∥22

]
+
ζ

no

√
E

[
∥ϵ⊤ΦU∥2

V−1

]
. (Jensen’s inequality)

We analyze the second term and first term respectively.

Regarding the second term, we have

Eϵ[∥ϵ⊤ΦU∥2V−1] = Eϵ[ϵ⊤ΦUV−1U⊤Φ⊤ϵ] = tr(ΦUV−1U⊤Φ⊤),

where Eϵ[·] is an expectation only regarding ϵ. Then, by the law of total expectation,

E[∥ϵ⊤ΦU∥2V−1] = E[tr(ΦUV−1U⊤Φ⊤)]

= E[tr(Φ⊤ΦUV−1U⊤)] = tr(noΣρUV−1U⊤)

= no tr(UVU⊤UV−1U⊤)

= no tr(UU⊤) = no tr(U⊤U) = norank(Σρ).

Similarly,

Eϵ
[
∥ϵ⊤Φ(I − UU⊤)}∥22

]
= tr(Φ(I − UU⊤)(I − UU⊤)Φ⊤) = tr(Φ⊤Φ(I − UU⊤)).

Then, by the law of total expectation,

E
[
∥ϵ⊤Φ(I − UU⊤)}∥22

]
= E[tr(Φ⊤Φ(I − UU⊤))]

= no tr(Σρ(I − UU⊤)) = no tr(UVU⊤(I − UU⊤)) = 0.

209


Combining all things together,

Rn(ξ;F ) ≤ ξ
√

rank[Σρ]/no.

Then, the critical inequality becomes

y
√

rank(Σρ)/no ≤ y2/β.

Thus, the critical radius of F is √
βrank(Σρ)/no.

□

Lemma 68 (Variatioanl representation of kernels). We denote the RKHS associated with

a kernel k(·, ·) byHk. Then,

k(x, x) = sup
{ f :∥ f ∥k≤1, f∈Hk}

f 2(x).

Proof. We have

sup
{ f :∥ f ∥k≤1, f∈Hk}

f 2(x) = sup
{ f :∥ f ∥k≤1, f∈Hk}

⟨ f , k(x, ·)⟩2k

≤ sup
{ f :∥ f ∥k≤1, f∈Hk}

∥ f ∥2kk(x, x) (CS inequality)

= k(x, x).

Besides, the equality is satisfied when f (·) = k(x, ·)/
√

k(x, x) noting

f 2(x) = k2(x, x)/k(x, x) = k(x, x), ∥ f (·)∥k = ∥k(x, ·)∥k/k(x, x) = 1.

Thus,

k(x, x) = sup
{ f :∥ f ∥k≤1, f∈Hk}

f 2(x).

□

210


Lemma 69 (Relation betweenHkno
andHk ). We denoting the RKHS associated with a

kernel k(·, ·) byHk and the RKHS with a kernel kno(·, ·) byHkno
. Then, we haveHk = Hkno

.

Besides, for f ∈ Hk, we have

∥ f ∥2kno
= ∥ f ∥2k + ζ

−2
no∑
i=1

f (xi)2.

This is stated in (Srinivas et al., 2010, Appendix B) without the proof. For complete-

ness, we provide the proof.

Proof. We use Mercer’s theorem (Wainwright, 2019, Theorem 12.20). Then, any element

in the RKHS associated with the kernel k(x, x) is represented by

f (x) =
∞∑

i=1

fiψi(x).

where {ψi}
∞
i=1 is an orthonormal basis for L2(ρ): Ex∼ρ[ψi(x)ψ j(x)] = I(i = j). Here, we

have

k(x, x) = ψ⊤(x)Λψ(x) = ϕ⊤(x)ϕ(x), ∥ f ∥k = f̃ ⊤Λ−1 f̃ ,

where ϕi(x) =
√
µiψi(x) and f̃ = { fi}

∞
i=1 ∈ R

∞.

Then, by letting Φ be a n × d matrix s.t. the i-th row is ϕ⊤(si, ai),

kno(x, x) = ϕ⊤(x)ϕ(x) − ϕ⊤(x)Φ⊤(ΦΦ⊤ + ζ2I)−1Φϕ(x)

= ϕ⊤(x){I − Φ⊤(ΦΦ⊤ + ζ2I)−1Φ}ϕ(x)

= ϕ⊤(x){I + Φ⊤Φ/ζ2}−1ϕ(x) (Woodbury matrix identity)

= ϕ⊤(x)(I +
no∑
i=1

ϕ(xi)ϕ(xi)⊤/ζ2)−1ϕ(x).

Here, let UVU⊤ be the eigenvalue decomposition of {Λ−1 +
∑no

i=1 ψ(xi)ψ(xi)⊤/ζ2}−1 =

211


UVU⊤. Then,

kno(x, x) = ψ⊤(x)(Λ−1 +

no∑
i=1

ψ(xi)ψ(xi)⊤/ζ2)−1ψ(x)

= ψ⊤(x)UVU⊤ϕ(x)

= ψ′⊤(x)Vψ′(x). (U⊤ψ = ψ′)

Then, any element f (·) in the RKHS associated with the kernel kno(x, x) is represented

as

f (·) = g̃⊤ψ′(·), g̃ ∈ R∞,

and the associated norm is ∥ f ∥kno
= g̃⊤V−1g̃ since ψ′(·) is still an orthnormal basis for

L2(ρ), i.e., Ex∼ρ[ϕ′i(x)ϕ j(x)] = I(i = j). This immediately impliesHk = Hkno
.

Finally, we check the relation of the norm:

∥ f ∥2kno
= ∥

no∑
i=1

fiψi∥kno
= ∥ f̃ ⊤ψ∥kno

( f̃ = { f1, f2 · · · }
⊤)

= ∥{U⊤ f̃ }⊤U⊤ψ∥kno

= ∥{U⊤ f̃ }⊤ψ′∥kno

= {U⊤ f̃ }⊤V−1U f̃

= f̃ ⊤(Λ−1 +

no∑
i=1

ϕ(xi)ϕ(xi)⊤/ζ2) f̃

= ∥ f ∥k + 1/ζ2
no∑
i=1

{ f̃ ⊤ϕ(xi)}2 = ∥ f ∥k + ζ−2
no∑
i=1

f 2(xi).

□

Lemma 70. Let {µ̂i}
no
i=1 be the eigenvalues of Kno . Then,

no∑
i=1

µ̂i/ζ
2

µ̂i/ζ2 + 1
=

1
ζ2

no∑
i=1

kno(xi, xi).

212


Proof.
no∑
i=1

kno(xi, xi) =
no∑
i=1

k(xi, xi) − k̄⊤no
(xi){Kno + ζ

2I}−1k̄no(xi)

= tr

 no∑
i=1

k(xi, xi) − k̄⊤no
(xi){Kno + ζ

2I}−1k̄no(xi)


= tr

(
Kno

)
− tr

 no∑
i=1

k̄no(xi)k̄⊤no
(xi){Kno + ζ

2I}−1


= tr

(
Kno −K2

no
{Kno + ζ

2I}−1
)

= tr
(
{K2

no
+ ζ2Kno −K2

no
}{Kno + ζ

2I}−1
)

= tr
(
ζ2Kno{Kno + ζ

2I}−1
)
=

no∑
i=1

µ̂i

µ̂i/ζ2 + 1
.

□

Lemma 71 (Calculation of localized Rademacher complexity of RKHS balls: Corollary

14.5 in (Wainwright, 2019)). Let F = { f ∈ Hk : ∥ f ∥k ≤ 1} be the unit ball of an

RKHS with eigenvalues {µ j}
∞
j=1. Then, the localized population Rademacher complexity

is upper-bounded by

Rn(δ;F ) ≤

√
2
n

√√
∞∑
j=1

min(µ j, δ2).

Lemma 72 (Upper-bound of expectation of information gains: finite-dimensional models

).

E[Īno] ≤ rank(Σρ){log(1 + no/λ) + 1}.

Proof.

E[Īno] = E[log(det(Σno/λ))]

≤ log det(E[Σno/λ]) = log det(I + no/λΣρ) (Jensen’s inequality)

≤ rank(Σρ){log(1 + no/λ) + 1}.

The final line is proved as in the proof of Theorem 50. □

213


Lemma 73 (Upper-bound of expectation of information gains: RKHS).

E[Ino] ≤ 2d∗{log(1 + no/ζ
2) + 1}.

Proof.

E[Ino] = E[log(det(I + ζ−2Kno))]

≤

∞∑
s=1

log(1 + ζ−2µsno) (Refer to (Seeger et al., 2008, Lemma 1))

≤ {log(1 + no/ζ
2) + 1}2d∗.

From the second line to the third line, we follow in the proof of Theorem 54. □

B.5 Implementation Details

Here we detail all environment details and hyperparameters used for the experiments in

the main text.

B.5.1 Environment Details

All environments have a maximum horizon length of 500 timesteps. We achieve this by

reducing the data collection frequency of the base 1000 horizon environments. We also

remove all contact information from the observation and the reward. Finally, to be able

to compute the ground truth reward from the state, we add the velocity of the center of

mass into the state.

214


Table B.1: Observation and action space dimensions for each of the environments

Environment Observation Space Dimension Action Space Dimension

Hopper 12 3
Walker2d 18 6

HalfCheetah 18 6
Ant 29 8

Humanoid 47 17

Table B.2: Ground truth environment reward function used to train the expert and behavior
policies as well as evaluate the performance in the learning curves. At time t, ẋt is the
velocity of the center of mass in the x-axis, at is the action vector, and zt is the position
of the center of mass in the z-axis.

Environment Ground Truth Reward Function

Hopper ẋt − 0.1∥at∥
2
2 − 3.0 × (zt − 1.3)2

Walker2d ẋt − 0.1∥at∥
2
2 − 3.0 × (zt − 0.57)2

HalfCheetah ẋt − 0.1∥at∥
2
2

Ant ẋt − 0.1∥at∥
2
2 − 3.0 × (zt − 1.3)2

Humanoid 1.25 × ẋt − 0.1∥at∥
2
2 + 5 × bool(1.0 ≤ zt ≤ 2.0)

B.5.2 Dynamics Ensemble Architecture and Model Learning

For all of our experiments we use an ensemble of four dynamics models with each model

parameterized by a feed-forward neural network with two hidden layers containing 1024

units. The learned model does not predict next state, but instead predicts the normalized

difference between the next state and the current state, st+1 − st. The activation function

used at each layer is ReLU. We train all of our ensembles using Adam with learning rate

5 × 10−5 and otherwise default hyperparameters. We train each dynamics model for 300

epochs on just the offline dataset for all of our experiments. Please see Table B.3 for all

values.

215


Table B.3: All hyperparameters used for dynamics model learning

Hyperparameter Value

Hidden Layers (1024, 1024)
Activation ReLU
Optimizer Adam

Learning Rate 5 × 10−5

Batch Size 256
Epochs 300

B.5.3 Policy Architecture and TRPO Details

We use the open source NPG/TRPO implementation, MJRL (Rajeswaran et al., 2018).

The policy network and the value network are feedforward neural networks with two

hidden layers containing 32 and 128 hidden units respectively. Both networks use a tanh

activation function with the policy network outputting a Gaussian distributionN(µ(s), σ2)

where σ is a trainable parameter. We use Generalized Advantage Estimator (GAE) to

estimate the advantages. Please see Table B.4 for all values.

Table B.4: TRPO/NPG hyperparameter values used in experiments.

Hyperparameter Value

Policy Hidden Layers (32, 32)
Critic Hidden Layers (128, 128)

Batch Size 40000
Max KL Divergence 0.01

Discount γ 0.995
CG Iterations 25
CG Damping 1 × 10−5

GAE λ 0.97
Critic Update Epochs 2

Critic Optimizer Adam
Critic Learning Rate 1 × 10−4

Critic L2 Regularization 1 × 10−4

Policy Init Log Std. -0.25
Policy Min Log Std. -2.0

BC Regularization λBC 0.1

216


B.5.4 Discriminator Update and Cost Function Details

We parameterize our discriminator as a linear function f (s, a) = w⊤ϕ(s, a), where ϕ(s, a)

are Random Fourier Features (Rahimi and Recht, 2008b) and w is the vector of parameters

for the discriminator. Recall our objective,

min
π∈Π

max
f∈F

[
E(s,a)∼dπ

P̂
( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)]

]
+ λBC · E(s,a)∼De[ℓ(a, s, π)].

Now given a policy π, we can compute a closed form update for the discriminator

parameters w like so

max
w:∥w∥22≤η

L(w; π, P̂, b,De) := E(s,a)∼dπ
P̂
( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)]

≡ max
w

Lη(w; π, P̂, b,De) = E(s,a)∼dπ
P̂
( f (s, a) + b(s, a)) − E(s,a)∼De[ f (s, a)] −

1
2
· (∥w∥22 − η)

⇒ ∂wLη(w; π, P̂, b,De) = E(s,a)∼dπ
P̂
[ϕ(s, a)] − E(s,a)∼De[ϕ(s, a)] − w

where ∂wLη(w; π, P̂, b,De) denotes the partial derivative of Lη(·) wrt to w. Setting the

above expression to 0 and solving for w gives us the closed form solution. Note that even

with the BC regularization constraint added into the objective, the solution will still hold.

Now for a given updated wt, we have our cost function c(s, a) = w⊤t ϕ(s, a) + b(s, a)

where our penalty, b(s, a), is the maximum discrepancy of our model ensemble predic-

tions. To balance our penalty term with our cost term, we introduce a parameter λpenalty

to get the cost

c(s, a) = (1 − λpenalty) · w⊤t ϕ(s, a) + λpenalty · b(s, a).

In our experiments, λpenalty was the only parameter we varied across environments.

217


Table B.5: λpenalty values used for each environment.

Environment λpenalty

Hopper 2.5 × 10−4

Walker2d 1.0 × 10−7

HalfCheetah 1.0 × 10−4

Ant 1.0 × 10−4

Humanoid 5.0 × 10−4

B.6 Additional Experiments

B.6.1 MILO with Expert Trajectories

Recall that in our main experiments, we create an extremely small expert dataset con-

taining expert (s, a) pairs by randomly sampling state-action pairs from an expert dataset

consisting of state-action pairs from many expert trajectories, and we did that for the

purpose of creating an expert dataset where BC almost fails completely. One may wonder

what MILO would do if we feed MILO a complete single expert trajectory. We conduct

such experiments in this section. Figure B.1 shows the performance of MILO with one

expert trajectory using the same hyperparameters as before. All plots are shown averaged

across five seeds. Note that MILO is still performs well with one expert trajectory—

matching or nearly matching the expert performance across all 5 continuous control tasks.

Figure B.1: Performance of MILO with one expert trajectory. Note MILO performance
just as well with trajectory inputs as with state-action pair sample inputs.

218


B.6.2 Performance of MILO on Ant without Pessimism

Figure B.2: Performance of MILO with and without pessimism for Ant-v2.

Figure B.2 shows MILO with and without pessimism for a given set of hyperparameters.

Note that unlike the learning curves shown in Figure 7.7, MILO is still able to stably reach

expert level performance.

219


APPENDIX C

MISSING PROOFS AND DETAILS IN CHAPTER 4

C.1 Detailed Algorithm Pseudocode

Algorithm 13 presents a more detailed pseudocode of AILBoost. The main detail here is

the 2-step process of learning our discriminator using a weighted replay buffer of weak

learner samples and then learning a weak learner for a certain number of RL steps.

Algorithm 13 AILBOOST (Adversarial Imitation Learning via Boosting)
Require: number of iterations T , expert dataDe, weighting parameter α

1: Initialize π1 weight α1 = 1, replay buffer B = ∅
2: for t = 1, . . . ,T do
3: Construct the t-th datasetDt = {(s j, a j)}Nj=1 where s j, a j ∼ dπt ∀ j.
4: Set B ← B ∪Dt

5: for # of Discriminator Updates do
6: Sample batch from B with respective sample weights αi<t

7: Update discriminator ĝ via Eq. 4.5
8: end for
9: for # of Weak Learner Updates do

10: Compute weak learner πt+1 via an off-policy RL approach (e.g., SAC) on reward
−ĝ(s, a) with replay buffer B with uniform weights on all samples

11: end for
12: Set αi ← αi(1 − α) for i ≤ t, and αt+1 = α
13: end for
14: Return Ensemble π = {(αi, πi)}Ti=1

After learning our ensemble, we evaluate it by randomly sampling a policy, πi,

from our ensemble with probability αi. With this weighted sampling, we then collect a

trajectory. Algorithm 14 details this process.

220


Algorithm 14 AILBOOST EVALUATION

Require: Ensemble π = {(αi, πi)}Ti=1
1: for # of Evaluation Trajectories do
2: Sample πi ∼ π with probability αi

3: Collect trajectory using πi

4: end for

C.2 Implementation and Experiment Details

Here we detail all environment specifications and hyperparameters used in the main text.

C.2.1 Environment Details

Following the standards used by DrQ-v2 (Yarats et al., 2022), all environments have a

maximum horizon length of 500 timesteps. This is achieved by setting each environment’s

action repeat to be 2 frames. For image based tasks, each state is 3 stacked frames that

are each 84 × 84 dimensional RGB images (thus 9 × 84 × 84).

Task Action Space Dimension Task Traits Reward Type

Ball in Cup Catch 2 swing, catch sparse
Walker Walk 6 walk dense
Cheetah Run 6 run dense

Quadruped Walk 12 walk dense
Humanoid Stand 21 stand dense

Table C.1: Task descriptions, action space dimension, and reward type for each tested
environment.

221


C.2.2 Dataset Details

Using the publicly released implementation for SAC and DrQ-v2, we trained high quality

expert policies for state-based and image-based environments respectively. We refer the

readers to (Yarats et al., 2022) and (Haarnoja et al., 2018a; Yarats and Kostrikov, 2020)

for exact hyperparameters.

Task
Expert

Performance
Random

Performance

Ball in Cup Catch 980.8 16.4
Walker Walk 966.6 19.9
Cheetah Run 910.5 0.2

Quadruped Walk 959.2 17.9
Humanoid Stand 927.8 3.9

Walker Walk (Vision) 823.1 9.6
Cheetah Run (Vision) 806.3 0.3

Table C.2: Average expert and random performance calculated by averaging 50 trajec-
tories collected from the expert and random policies respectively. Vision experts are
denoted (vision)

C.2.3 Hyperparameters

For ValueDICE and IQ-Learn, we used the base hyperparameters they reported for

the MuJoCo benchmark suite. In order to ensure good performance, we tried different

configurations for every environments (i.e. the configuration for Cheetah Run for Walker

Walk) since despite using the same physics engine and models, there are minor differences

for DeepMind Control Suite. For DAC and AILBoost, we used our own implementations.

Table C.3 details the hyperparameters used. Note that all hyperparameters are shared

between DAC and AILBoost except for the update frequency of the disciminrator vs the

policy. Note that this is one of the core differences between DAC and AILBoost.

222


For AILBoost we predominanty tested 4 hyperparameters: # of discriminator updates,

steps to learn weak learners, weighting parameter α, and the TD n-step. For the # of

discriminator updates we tested 10, 100, 500, 1000, and 5000. For the the steps to learn

weak learners, we tested 1000, 5000, 10000, 20000, and 100000. For α, we swept 0.95,

0.7, 0.4, 0.2, and 0.05. Finally, we tested either TD n-step 1 or 3.

223


Setting Values

Policy Architecture (state) 3 layer MLP with 1024 hidden units each

DAC (state) total number of steps: 10e6
replay buffer size: 1e6
learning rate: 1e-4
action repeat: 2
batch size: 256
TD n-step: 1
discount factor: 0.99
gradient penalty coeff: 10.0
policy update frequency: 2

AILBoost (state) Samples per Weak Learner (N): 1000
# of Weak Learners (T): 100
Steps to learn Weak Learner: 1000
# of Discriminator updates: 100
Weighting Parameter (α): 0.05

Policy Architecture (vision) Model Architecture from (Yarats et al., 2022)

DAC (vision) total number of steps: 20e6
replay buffer size: 1e6
learning rate: 1e-4
action repeat: 2
batch size: 512
TD n-step: 3
discount factor: 0.99
gradient penalty coeff: 10.0
policy update frequency: 2

AILBoost (vision) Samples per Weak Learner (N): 10000
# of Weak Learners (T): 100
Steps to learn Weak Learner: 20000
# of Discriminator updates: 500
Weighting Parameter (α): 0.05

Table C.3: Hyperparameters used for DAC and AILBoost. All of DAC’s hyperparameters
are shared by AILBoost except for the parameters colored in blue. In particular, the
update frequency of the disciminrator vs the policy is one of the core differences between
DAC and AILBoost.

224


C.3 Additional Results

C.3.1 Aggregate Performance Comparisons

Following the recommendations of (Agarwal et al., 2021b), we do an additional diagnostic

of measuring the probability of improvement between two algorithms. This metric

measures how likely it is for X to outperform Y on a randomly selected task from the

benchmark suite. Specifically, P(X > Y) = 1
m

∑
m P(Xm > Ym) where P(Xm > Ym) is

the probability of X outperforming Y on task m. Note that this measurement does not

account for the size of improvement. Figure C.1 shows the comparison. AILBoost shows

significant improvement over all other algorithms other than DAC. In conjuction with

Figure 4.1, we see that although the chance of AILBoost doing better than DAC is ≈ 50%,

the size of improvement AILBoost has over DAC denoted by the IQM and Mean are

significantly larger.

Figure C.1: Probability of improvement between all tested baselines and AILBoost.

225


C.3.2 Learning Curves

Here we present the complete suite of learning curves for all 5 environments.

C.3.3 Learning curves across different optimization schedules

Here we present the full suite of learning curves where we vary how often the policy

and the discriminator update relative to each other. We keep every other hyperparameter

constant in this ablation.

226


Figure C.2: Learning curves for AILBoost and all baselines on the DMC environments
with 10, 5, and 1 expert trajectories across 3 seeds.

227


0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Samples 1e6

0

200

400

600

800

1000

M
ea

n 
Sc

or
e

Ball in Cup Catch

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Samples 1e6

Walker Walk

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Samples 1e6

Quadruped Walk

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Samples 1e6

Cheetah Run

Expert 1000 P, 100 D 1000 P, 10 D 1000 P, 1 D 100 P, 100 D 10 P, 100 D 1 P, 100 D

Figure C.3: Learning curves for AILBoost on 4 out of the 5 DMC environments with
5 expert trajectories across 3 seeds, where we vary the number of policy updates and
discriminator updates the agent takes over time.

228


APPENDIX D

MISSING PROOFS AND DETAILS IN CHAPTER 5

D.1 Additional Related Work

LLM Alignment Using RLHF is one idea of aligning LLM with human preferences.

The RLHF objective incorporates a KL constraint and is equivalent to minimizing the

reverse KL between KL-control distribution and the learner. Minimizing some divergence

between policy used for the KL-control and learner policy has been proposed for LLM

alignment. (Korbak et al., 2022; Khalifa et al., 2020; Go et al., 2023) propose alignment

ideas the attempt to minimize various divergence inspired from maximize entropy RL

(Haarnoja et al., 2017, 2018a) and Distributional Policy Gradient (DPG) (Barth-Maron

et al., 2018). Depending on the chosen divergence, the desired policy behavior may be

easy or hard to obtain. Another collection ideas for alignment focus on aspects of the

supervised learning data, for example currating the collected data (Zhou et al., 2023;

Chung et al., 2022).

Restart Distribution On-policy RL algorithms are not able to take advantage of past

visited states. But incorporating the ability to reset to any arbitrary state allows on-policy

methods to create new states from past visited states (Tavakoli et al., 2018). The core

of the idea is to use past visited states to modify the initial state distribution. Our work

introduces PPO++ which is an algorithm that has no prior over past visited states but

(Tavakoli et al., 2018) considers incorporating priories to help decide how to prioritize

past visited states to incorporate into the initial state distribution. (Agarwal et al., 2020b)

showed theoretically that the initial state distribution helps with exploration. Modifying

the initial state distribution using restart has seen success in Montezuma Revenge Atari

2600 (a hard exploration problem) and Atari 2600 games more broadly(Popov et al.,

229


2017; Salimans and Chen, 2018; Ecoffet et al., 2019; Florensa et al., 2017).

NLP with Human Feedback Learning from human feedback has been studied in

the past in the context of bandit feedback (Nguyen et al., 2017; Sokolov et al., 2016),

pairwise feedback (Scheurer et al., 2023; Chen et al., 2023) and other feedback forms

(Kreutzer et al., 2018a; Sumers et al., 2021; Hancock et al., 2018; Wu et al., 2021). RLHF

from has been an active area of research employing RL as the main strategy to align

LMs with human preferences (Ouyang et al., 2022; Bai et al., 2022a; Bakker et al., 2022;

OpenAI, 2023; Nakano et al., 2021; Wu et al., 2021; Stiennon et al., 2020; Ziegler et al.,

2019). A remarkable result in this line of work is ChatGPT (OpenAI, 2023). The general

process involves learning a preference reward model induced by human preferences and

then finetuning with RL using this learned preference model.

LLM Finetuning from AI Feedback: Despite being easier to collect than expert

data, high-quality human preference data collection is a key bottleneck of scaling RL

finetuning for LLMs. A growing body of work enlists the help of LLMs to augment

various parts of the RLHF procedure. ConstitutionalAI and RLAIF (Bai et al., 2022b;

Lee et al., 2023a) explores using LLMs to generate preference datasets to do reward

model training on while (Roit et al., 2023; Yang et al., 2023; Kwon et al., 2023) finds

directly generating reward signals from another LLM to be effective. Separate from this

literature, we investigate utilizing direct LLM feedback during the generation process,

reminiscent of RL algorithms utilizing expert interactive feedback.

RL for Text Understanding and Generation: RL has been used to train text gener-

ation models for dialogue (Li et al., 2016), text simplification (Zhang and Lapata, 2017),

machine translation (Kiegeland and Kreutzer, 2021; Wu et al., 2016; Shen et al., 2015),

image captioning (Ren et al., 2017), question generation (Pang and He, 2021). RL has

also been used to create models that take actions given a text such as for instruction

230


following (Hermann et al., 2017; Misra et al., 2017), text games (Narasimhan et al., 2015;

Côté et al., 2019; Ammanabrolu and Riedl, 2018), and code generation (Zhong et al.,

2017). These methods typically use policy gradient based RL. Recently, (Ramamurthy

et al., 2022a) studied online RL for text generation across a wide range of tasks, specifi-

cally studying Proximal Policy Optimization (PPO) (Schulman et al., 2017b). Although

the results comparing RL and SL are mixed, we build upon their work and show the

benefit of RL and ultimately RLGF outperforming SL and RL. Separately, (Snell et al.,

2022) studies offline RL in the context of text generation whereas our work studies the

online case.

231


D.2 Additional Algorithms

A detailed algorithm for LOLS showing how to combine reinforcement learning and

imitation learning differently than D2LOLS. Rather than setting α to be the stopping

time to switch from AggreVaTeD to PPO++, we have a mixing probability of combining

AggreVaTeD and PPO++ at every iteration, αAπt
θ + (1 − α)Aπg

(s, a). As discussed in

Section 5.5, we find that LOLS underperforms D2LOLS, even in practice.

Algorithm 15 LOLS: combine PPO and AggreVaTeD

1: Input: πθ, reference πg, iterations T, datasetD =
{
(xi, yi)

}N

i=1
2: Input: mixing parameter β1 ∈ [0, 1], mixing parameter β2 ∈ [0, 1], mixing prob α
3: for t = 0,1,. . . ,T-1 do
4: ▷ PPO++

5: Rollin with β1π
g + (1 − β1)πt

θ starting from x ∼ D
6: Rollout with πt

θ to collect trajectories

7: Update Vπt
θ

ϕ with trajectories and compute advantage estimates Aπt
θ

8: ▷ AggreVaTeD
9: Rollin with β2π

t
θ + (1 − β2)πg starting from x ∼ D

10: Rollout with πg to collect trajectories
11: Update Vπg

ϕ with trajectories and compute advantage estimates Aπg
(s, a)

12: ▷ Mix Update
13: Update πθ using PPO loss with αAπt

θ + (1 − α)Aπg
(s, a)

14: end for

232


D.3 Additional Experimental Details

D.3.1 KL Reward Constraint

In addition to sequence-level task rewards, per-token KL rewards are applied to prevent

the policy π from deviating too far from the pre-trained LM π0, following the works

Ziegler et al. (2019); Ouyang et al. (2022). Formally, regularized reward function is de-

fined as: R̂(st, at, y) = R(st, at, y)−λKL (π(at|st)||π0(at|st)) where KL (π(at|st)||π0(at|st)) =

(log π(at|st) − log π0(at|st)) and λ is the KL coefficient Ouyang et al. (2022). Note we

used use a fixed KL coefficient rather than an adaptive controller.

D.3.2 Task Details

Task Train/Val/Test Prompt Gen. Length

IMDB 25K/5K/5K Partial movie review up to 64 tokens 48
CommonGen 32651/993/1497 "Generate a sentence with: " set of 3-5 concepts 20

TL;DR 117000/6450/6550 "TL;DR: " 50
TL;DR Preference 92500/3300/8300 "TL;DR: " N/A

Table D.1: Train, val, test splits, prompts, and max generation length used for each task.

IMDB: We experiment on the IMDB dataset for positive movie review generation.

As shown in Table D.1, the dataset consists of 25k training, 5k validation and 5k test

prompts of movie review text with either positive or negative sentiment labels. As in put

to our models, we use partial movie reviews that are at most 64 tokens long and ask the

model to complete the review with a positive sentiment with at most 48 generated tokens.

233


CommonGen: CommonGen Lin et al. (2020b) is a common sense text generation

task where the model is given a set of concepts (i.e. hockey, rink, game) and is asked

to generate a semantically correct sentence using those concepts (i.e. the hockey team

played a game at the rink). We follow the same splits as the dataset creators and refer the

readers to Table 1 of Lin et al. (2020b) for more in-depth statistics of the dataset. In our

experiments, we prompted out models with "generate a sentence with: " and generated

at most 20 tokens. We chose this generation length based on the maximum token length

of the references in the training dataset.

TL;DR Summarization: Following Stiennon et al. (2020), we evaluate on the summa-

rization task. We use CarperAI/openai_summarize_comparisons for the preference

reward training dataset and CarperAI/openai_summarize_tldr for the RL training

dataset. For the SFT model that we use for our starting policy and our guide policy,

we use the publicly available checkpoint CarperAI/openai_summarize_tldr_sft. We

truncated/padded each prompt to 500 tokens on the GPT-J 6B tokenizer.

We first train our reward model using LoRA adapters. Our reward training is 1 epoch

and where we got 70% accuracy on the test set. With this reward model we run all of our

experiments where our policy and critic are both LoRA adapters trained on top of SFT

checkpoint.

Win Rate: We calculated the win rate against the dataset references using Llama2-

13B-chat (Touvron et al., 2023) publically available on HuggingFace. Following DPO

(Rafailov et al., 2023a), we prompt the model with instructions, 2 summaries (A) and (B),

and instructions on how to answer. We randomize which summary is (A) or (B) when

calculating the win rate over the test set. Below is our prompt skeleton:

234

CarperAI/openai_summarize_comparisons
CarperAI/openai_summarize_tldr
CarperAI/openai_summarize_tldr_sft


<<SYS>>

You are an expert summary evaluator and can consistently

distinguish between good and bad summaries. You provide

informative, correct evaluations.

<<\SYS>>

Task: Judge the quality of two TLDRs, choose the options

among (A) or (B)

context: [context]

tldr (A): [summary 1]

tldr (B): [summary 2]

FIRST provide a one-sentence comparison of the two summaries,

explaining which you prefer and why. SECOND, on a new line,

state only (A) or (B) to indicate your choice. Your

response should use hte format:

Comparison: <one-sentence comparison and explanation>

Preferred: <(A) or (B)>

D.3.3 IMDB - Algorithm Details

Table D.2 lists the hyperparameters used in our IMDB experiments. Note that we used

the same parameters here for all guide policies. Across all algorithms, we shared the

same parameters as the ones we used for our PPO baseline. Finally, we use top-k sampling

with K = 50 as the decoding method and for fair comparison, we keep this setting for all

methods.

235


Setting Values

model GPT2

PPO steps per update: 1280
total number of steps: 128000
batch size: 64
epochs per update: 5
learning rate: 1e-6
discount factor: 0.99
gae lambda: 0.95
clip ratio: 0.2
value function coeff: 0.5
λ: 0.001
η: 0.1

PPO++ Mixing Parameter (β): 0.2

AggreVaTeD Mixing Parameter (β): 0.8

LOLS Mixing Probability (α): 0.8

D2LOLS Stopping Time Iteration (α): 20

decoding sampling: true
top k: 50
min length: 48
max new tokens: 48

tokenizer padding side: left
truncation side: left
max length: 64

Table D.2: Hyperparameters used for IMDB. Note that PPO++, AggreVaTeD, LOLS, and
D2LOLS all share the same PPO parameters. All processes use the same decoding and
tokenizer parameters.

D.3.4 CommonGen - Algorithm Hyperparameters

Table D.3 lists the hyperparameters used in our CommonGen experiments. Note that we

used the same parameters here for all guide policies. Across all algorithms, we shared

the same parameters as the ones we used for our PPO baseline. Finally, we use beam

search with the number of beams = 5 as the decoding method for inference. Note that for

training, we still used softmax sampling with default temperature. For fair comparison,

236


Setting Values

model T5

PPO steps per update: 2560
total number of steps: 1,280,000
batch size: 640
epochs per update: 4
learning rate: Linear decay 1e-5
discount factor: 0.99
gae lambda: 0.95
clip ratio: 0.2
value function coeff: 30.0
λ: 0.001
η: 0.1

PPO++ Mixing Parameter (β): 0.2

AggreVaTeD Mixing Parameter (β): 0.8

LOLS Mixing Probability (α): 0.8

D2LOLS Stopping Time Iteration (α): 200

decoding num beams: 5
min length: 5
max new tokens: 20

tokenizer padding side: left
max length: 20

Table D.3: Hyperparameters used for CommonGen. Note that PPO++, AggreVaTeD, LOLS,
and D2LOLS all share the same PPO parameters. All processes use the same decoding and
tokenizer parameters.

we keep this setting for all methods. Finally, note that for CommonGen, we set the KL

coefficient to 0.

D.3.5 TL;DR Summarization - Algorithm Hyperparameters

Table D.4 lists the hyperparameters used in our TL;DR summarization experiments. Note

that we used the same parameters here for all guide policies. Across all algorithms, we

237


Setting Values

model GPT-J

PPO steps per update: 70,400
total number of steps: 7,040,000
batch size: 128
epochs per update: 4
learning rate: 1e-5
discount factor: 1.0
gae lambda: 0.95
clip ratio: 0.2
value function coeff: 0.2
λ: 0.002
η: 0.1

PPO++ Mixing Parameter (β): 0.6

AggreVaTeD Mixing Parameter (β): 1.0

decoding max new tokens: 50

tokenizer padding side: left
truncation side: right
max length: 50

RM LoRA r: 32
α: 32
dropout: 0.0

Policy LoRA r: 8
α: 64
dropout: 0.1

Critic LoRA r: 8
α: 64
dropout: 0.1

Table D.4: Hyperparameters used for TL;DR summarization. Note that PPO++,
AggreVaTeD all share the same PPO parameters.

shared the same parameters as the ones we used for our PPO baseline. The reward LoRA

adapter was used to pretrain our model. During RL training, we kept the reward adapter

frozen.

238


D.4 IMDB Qualitative Examples

Sample 1

Prompt: not a complex character, but totally rotten. The relation between him and Gisburne is just

hilarious. Actually, just looking at de Rainault sitting in his throne, bored, glaring, makes me

laugh even before he has said anything. Another actor that deserves extra praise is Ray Winstone as

Will Scarlet. You

GPT2: know, when you think he’s a bad actor, and when you think the script is all about the bad actors.

Scarlet is the typical actor with a weak, pathetic voice, who fails to show emotion or self-respect,

and who

BC: will not be disappointed. The opening scenes are amazing and the final scene of the film is amazing.

Every single minute of this film is a treat with every emotion the characters are experiencing.

From the first frame of the

PPO: know he likes old movies. I gave this film a 3 and a half out of 10. And, of course, I don’t blame

you for falling asleep! I gave it a 10 because of my favorite

PPO++: see he seems only interested in money, which really makes the two seem like friends. One

particularly funny scene involves Winstone and the rest of the cast in their studio for the first

time since "Rampage". I absolutely LOVE the way

AggreVaTeD: will laugh heartily; he is brilliant and always entertaining. I also think the film is just

wonderful. I love what Van Dyke is able to bring to the film. If you want to go great, I’m sure Van

Dyke

LOLS: great film. I absolutely loved his style and direction. One of my favorite movies of all time. The

first 25 minutes are really good and really enjoyable. It’s actually very touching and engaging.

The cast, including James Gandolfini

D2LOLS: could never have come across this again. It was one of the funniest films I’ve ever seen, and it

remains my favorite movie ever. I am absolutely honored and very excited to see it again!!! I

really love it and look forward

Sample 2

Prompt: comes from Zekeria Ebrahimi, who plays the young Amir. It’s not that it’s bad, it’s just that it’

s not as effective as it could be. He just can’t translate the guilt that consumes Amir to the

audience. In fact, I think I might have been a little lost at

GPT2: first. The movie has too much going for it, too many unanswered questions and some of the

characters not getting the emotional payoff in a reasonable enough manner. It’s nice to see a

filmmaker who isn’t a hater, but I feel

BC: the second half as it wasn’t shown at all until the very end, at best. I liked the plot (which I

think was very well done, considering it’s a "bad cop story"). It’s not

PPO: first. After listening to the great Ghazala, who’s brilliant in her acting, I am very impressed

with her acting performance. I loved how you have a young and very attractive girl named Zekeria

who is portrayed quite well and

PPO++: last week’s deadline. It’s quite a clever and good film, as well as an entertaining one. 4/10.

Recommended. 8/10

AggreVaTeD: first. The movie really deserves the credit - it really is a very moving story and a really

moving movie that is an amazing film. I personally love Amir and think it’s an extremely powerful

film. It’s one of the things that makes

LOLS: some points. As a whole it’s pretty enjoyable and is very well-acted and well edited, and is quite

a compelling, quirky and fun film. And just as important, I found some very well-acted, and well-

acted

D2LOLS: first, when Amir realizes what’s really going on. There are many terrific performances in this

beautifully directed film. I can’t say which is one of the best, which is certainly a shame because

it’s an enjoyable

Figure D.1: Qualitiative Analysis of IMDB

239


D.5 CommonGen Qualitative Examples

Sample 1

Prompt: generate a sentence with: dress perform stage.

T5: dress up and perform on stage

BC: A young girl in a black dress is performing on a stage.

PPO: performers dress up on the stage.

PPO++: a woman performing on the stage dressed up.

AggreVaTeD: young woman performing on stage in a black dress.

LOLS: people performing on the stage in a dressed up.

D2LOLS: A woman is performing on the stage in a dress.

Sample 2

Prompt: generate a sentence with: ball fall throw.

T5: ball falls off the ground

BC: A man throws a ball that has fallen to the ground.

PPO: the ball falling from the throw.

PPO++: the ball is thrown and the ball falls.

AggreVaTeD: A man is throwing the ball into the ground.

LOLS: throwing the ball after it fell from the sky.

D2LOLS: A man is throwing the ball after it fell.

Sample 3

Prompt: generate a sentence with: arm chest fold.

T5: arm folds in the chest

BC: He folds his arms over his chest, then he folds his arms over.

PPO: folded the arms in the chest.

PPO++: a man with his arms folded in the chest.

AggreVaTeD: folding his arm over his chest.

LOLS: A man folds his arms in the chest.

D2LOLS: A man with his arms folded in the chest.

Figure D.2: Qualitiative Analysis of CommonGen

240


D.6 TL;DR Qualitative Examples

Content Warning: This section contains examples of harmful language.

241


Sample 1

Prompt: SUBREDDIT: r/AskReddit

TITLE: Wife’s "friend" is stealing money. Any ideas on how to approach this?

POST: So my wife has a friend that I will call Julie. My wife and Julie have been friends for about 6

years if I remember correctly. Julie is a shit friend in my opinion and I have told the wife my

feelings. When they lived together Julie would start physical fights with my wife and spat on my

wife’s face once. Now when Julie comes over to our house there is a 98% chance that something will

be broken and really really dirty when she leaves.

Julie does not have any other good close friends in our town and her closest family is 5 hours away. I

am thinking that is why my wife doesn’t get rid of Julie. About 7 months ago we were all outside

and Julie said that she needed to go inside to use the restroom. As we soon outside I noticed that

our bedroom light turned on, stayed on for less than a minute, and then went off.

Later that night after Julie left when the wife and I went inside and the wife discovered that her purse

had been moved and that she was missing $20. I looked at my stuff and my checkbook had been moved.

The only other person in our house that night was Julie and she was left alone.

On Saturday my wife went shopping with Julie. Again for some reason my wife left her purse with Julie

alone. Today my wife was trying to buy gas and is now missing $30.

Before Julie started to steal from us her car battery died and she called us. Stupidly I paid for and

installed a new one for her. I still haven’t been paid back for that $87.

I am trying to figure out if it is time for me to say something to Julie or do I let the wife handle it?

BC: Wife’s "friend" is stealing money. Any ideas on how to approach this?

PPO: Wife’s "friend" is stealing money. Any ideas on how to approach this? Wife’s "friend" is a shit

friend in my opinion and I have told the wife my feelings.

PPO++: Wife’s "friend" is stealing money. Any ideas on how to approach this? Wife doesn’t have any other

good close friends in our town and her closest family is 5 hours away.

AggreVaTeD: Wife’s "friend" is stealing money. Any ideas on how to approach this? Wife and I have lost

$120 in the past 2 weeks.

Sample 2

Prompt: SUBREDDIT: r/Pets

TITLE: My sister’s dog was picked up by animal control, $150 to get him out and she can’t afford it. He’

ll be euthanized in 2 days if she doesn’t pay up.

POST: My sister has had this dog for the longest time, he’s been shuffled around our family from time to

time when she couldn’t keep him due to being a renter. I had him for a year and had to give him

back to her because I moved into a smaller place. I told her to keep him until the end of this year

and I would take him back but he managed to get out of his yard today and got locked up by animal

control it’s going to cost her $150 to get him out or he’ll be put to sleep! He’s about 11 years

old, the sweetest beagle in the world! He doesn’t bark he’s great with kids and he’s just a bug ol

teddy bear! I don’t know what to do!! I can’t let him be put to sleep but I can’t afford to help

her! :-(

BC: My sister’s dog was picked up by animal control, $150 to get him out and she can’t afford it. He’ll

be put to sleep if she doesn’t pay up.

PPO: Sister’s dog was picked up by animal control, $150 to get him out or he’ll be put to sleep. He’s

about 11 years old, the sweetest beagle in the world! He doesn’t bark he’s great with

PPO++: Sister’s dog was picked up by animal control, $150 to get him out and she can’t afford it. He’ll

be euthanized in 2 days if she doesn’t pay up.

AggreVaTeD: sister’s dog got locked up by animal control, $150 to get him out or he’ll be put to sleep.

He’s about 11 years old, the sweetest beagle in the world! He doesn’t bark he’s great with kids

Figure D.3: Qualitiative Analysis of TL;DR.

242


APPENDIX E

MISSING PROOFS AND DETAILS IN CHAPTER 6

E.1 DR-PO with NPG

Algorithm 16 DR-PO with NPG update
1: Input: labeled preference datasetDR, unlabeled datasetDTR, reward function class
R, Q function class F , regularization parameter λ, stepsize η, total number of
iterations T .

2: Initialize: π1 = πSFT.
3: Learn a reward model r̂ via MLE based on Eq. equation 6.1.
4: Let N0 ←

N
T . PartitionDTR into {DTR,t := {τt,n}

N0
n=1}t∈[T ] with an equal size.

/* Policy Evaluation with Dataset Reset */
5: for t = 1, · · · ,T do
6: for n = 1, · · · ,N0 do
7: Sample h ∼ Unif([H]) and pick the state at step h from τt,n, denoted by st,n

h .
8: Take action at,n

h ∼ ( 1
2π

SFT(st,n
h ) + 1

2π
t(st,n

h )) and then execute πt to step H.
9: Denote the trajectory by (sh, ah, · · · , sH, aH) and let yt,n

h =
∑H

h′=h r̂h′(sh′ , ah′).
10: Add (st,n

h , a
t,n
h , y

t,n
h ) intoDt.

11: end for
12: Compute

Q̂t = argmin
f∈F

LDt( f ) :=
1

N0

∑
(s,a,y)∈Dt

[
( f (s, a) − y)2

]
.

/* NPG Update */
13: Compute for all s ∈ S:

πt+1(s) = argmin
p∈∆(A)

〈
−Q̂t(s, ·), p

〉
+ λKL(p∥πSFT(s)) +

1
η

KL(p∥πt(s)).

14: end for
15: Output: π̂ = Unif({πt}Tt=1).

E.2 Proof of Theorem 26

First we relax the single-policy concentrability in Assumption 24 to the following as-

sumptions.

243


Assumption 74 (single-policy concentrability w.r.t. the reward class). Suppose that we

have:

max
{

0, sup
r∈R

Eτ0∼dπ⋆ ,τ1∼dπSFT [r⋆(τ0) − r⋆(τ1) − r(τ0) + r(τ1)]√
Eτ0∼dπSFT

,τ1∼dπSFT
[
|r⋆(τ0) − r⋆(τ1) − r(τ0) + r(τ1)|2

]
}
≤ Cr(R).

Assumption 75 (single-policy concentrability w.r.t. Q function class). Suppose that we

have for all t ∈ [T ]:

sup
h∈[H], f∈F ,π∈{πt ,π⋆}

∣∣∣∣Es∼dπ⋆h ,a∼π(s)

[
f (s, a) − Q̂πt ,̂r(s, a)

]∣∣∣∣√
Es∼dπSFT

h ,a∼( 1
2π

SFT(s)+ 1
2π

t(s))

[(
f (s, a) − Q̂πt ,̂r(s, a)

)2
] ≤ Ceval(F )

Assumption 76 (single-policy concentrability w.r.t. KL divergence). Suppose that we

have:

H∑
h=1

Esh∼dπ⋆h

[
KL(π⋆(sh)∥πSFT(sh))

]
≤ CKL.

Note that from Cauchy-Schwartz inequality we have the following proposition:

Proposition 77. We have

Cr(R) ≤

√
max
τ

dπ⋆(τ)
dπSFT(τ)

,

Ceval(F ) ≤

√
2 · max

h∈[H],s∈Sh,a∈A

dπ⋆h (s, a)

dπSFT

h (s, a)
,

CKL ≤ H log

 max
h∈[H],s∈Sh,a∈A

dπ
⋆

h (s, a)

dπSFT

h (s, a)

 .
Proof. First from Cauchy-Schwartz inequality, we have

Eτ0∼dπ⋆ ,τ1∼dπSFT [r⋆(τ0) − r⋆(τ1) − r(τ0) + r(τ1)] ≤
√

Eτ0∼dπ⋆ ,τ1∼dπSFT [
∣∣∣r⋆(τ0) − r⋆(τ1) − r(τ0) + r(τ1)

∣∣∣2].

Therefore we have

Cr(R) ≤

√√√
sup
r∈R

Eτ0∼dπ⋆ ,τ1∼dπSFT [
∣∣∣r⋆(τ0) − r⋆(τ1) − r(τ0) + r(τ1)

∣∣∣2]

Eτ0∼dπSFT
,τ1∼dπSFT

[
|r⋆(τ0) − r⋆(τ1) − r(τ0) + r(τ1)|2

] ≤ √
max
τ

dπ⋆(τ)
dπSFT(τ)

.

244


Similarly, we have:∣∣∣∣Es∼dπ⋆h ,a∼π(s)

[
f (s, a) − Q̂πt ,̂r(s, a)

]∣∣∣∣ ≤ √
Es∼dπ⋆h ,a∼π(s)

[∣∣∣∣ f (s, a) − Q̂πt ,̂r(s, a)
∣∣∣∣2].

Therefore we have

Ceval(F ) ≤

√√√√√√√√
sup

h∈[H], f∈F ,π∈{πt ,π⋆}

Es∼dπ⋆h ,a∼π(s)

[∣∣∣∣ f (s, a) − Q̂πt ,̂r(s, a)
∣∣∣∣2]

Es∼dπSFT
h ,a∼( 1

2π
SFT(s)+ 1

2π
t(s))

[(
f (s, a) − Q̂πt ,̂r(s, a)

)2
] .

Note that we have

sup
h∈[H], f∈F

Es∼dπ⋆h ,a∼π⋆(s)

[∣∣∣∣ f (s, a) − Q̂πt ,̂r(s, a)
∣∣∣∣2]

Es∼dπSFT
h ,a∼( 1

2π
SFT(s)+ 1

2π
t(s))

[(
f (s, a) − Q̂πt ,̂r(s, a)

)2
] ≤ max

h∈[H],s∈Sh,a∈A
2 ·

dπ
⋆

h (s, a)

dπSFT

h (s, a)
.

On the other hand, we know

sup
h∈[H], f∈F

Es∼dπ⋆h ,a∼πt(s)

[∣∣∣∣ f (s, a) − Q̂πt ,̂r(s, a)
∣∣∣∣2]

Es∼dπSFT
h ,a∼( 1

2π
SFT(s)+ 1

2π
t(s))

[(
f (s, a) − Q̂πt ,̂r(s, a)

)2
]

≤ max
h∈[H],s∈Sh

2 ·
dπ

⋆

h (s)

dπSFT

h (s)
≤ max

h∈[H],s∈Sh,a∈A
2 ·

dπ
⋆

h (s, a)

dπSFT

h (s, a)
.

Therefore, we have

Ceval(F ) ≤

√
2 · max

h∈[H],s∈Sh,a∈A

dπ⋆h (s, a)

dπSFT

h (s, a)
.

For CKL, we have
H∑

h=1

Esh∼dπ⋆h

[
KL(π⋆(sh)∥πSFT(sh))

]
=

H∑
h=1

∑
s∈Sh

dπ
⋆

h (s)
∑
a∈A

π⋆(a|s) log
π⋆(a|s)
πSFT(a|s)

=

H∑
h=1

∑
s∈Sh,a∈A

dπ
⋆

h (s, a) log
π⋆(a|s)
πSFT(a|s)

≤

H∑
h=1

∑
s∈Sh,a∈A

dπ
⋆

h (s, a) log
π⋆(a|s)
πSFT(a|s)

+

H∑
h=1

∑
s∈Sh

dπ
⋆

h (s) log
dπ

⋆

h (s)

dπSFT

h (s)

=

H∑
h=1

∑
s∈Sh,a∈A

dπ
⋆

h (s, a) log
π⋆(a|s)
πSFT(a|s)

+

H∑
h=1

∑
s∈Sh,a∈A

dπ
⋆

h (s, a) log
dπ

⋆

h (s)

dπSFT

h (s)

=

H∑
h=1

∑
s∈Sh,a∈A

dπ
⋆

h (s, a) log
dπ

⋆

h (s, a)

dπSFT

h (s, a)
≤ H log

 max
h∈[H],s∈Sh,a∈A

dπ
⋆

h (s, a)

dπSFT

h (s, a)

 .
245


□

With Proposition 77, we only need to prove the following theorem to validate Theo-

rem 26:

Theorem 78. Suppose Assumption 22,23,74,75,76 hold. Then with probability at least

1 − δ, we have Algorithm 9 with NPG update (Algorithm 10) returns a policy π̂ which

satisfies

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤

√HrmaxT
λ

+Cr(R)

 ϵMLE + ϵ
′
PMD,

where

ϵMLE := O


√
κ2

M
log
|R|

δ

 , ϵ′eval := O


√

C2
eval(F )Tr2

max

N
log

T |F |
δ


ϵ′PMD :=

CKL

ηT
+

Hr2
maxη

2
+ λCKL + 2Hϵ′eval.

In this section we provide the proof of Theorem 78. Our proof consists of three steps:

we first quantify the estimation error of the Q function incurred by LSR oracles – this

step only involves standard supervised learning analysis, then study the performance

guarantee of NPG, and lastly investigate how to deal with the reward uncertainty and

obtain the final suboptimality gap.

E.2.1 Q function Estimation Error

We have the following lemma to bound the estimation error
∣∣∣∣Q̂(s, a) − Qπt ,̂r(s, a)

∣∣∣∣:
Lemma 79. Fix any δ1 ∈ (0, 1]. With Assumption 23, we have with probability at least

1 − δ1 that for all t ∈ [T ],∣∣∣∣Eh∼Unif([H]),s∼dπ⋆h ,a∼π(s)

[
Q̂t(s, a) − Qπt ,̂r(s, a)

]∣∣∣∣ ≤ Ceval(F )

√
256r2

max

N0
log

2T |F |
δ1

:= ϵ′eval,

246


where π ∈ {πt, π⋆}.

Proof. From the guarantee of least squares (Lemma 94 in Appendix E.5), fix t ∈ [T ], we

have with probability at least 1 − δ1 that,

Eh∼Unif([H]),s∼dπSFT
h ,a∼( 1

2π
SFT(s)+ 1

2π
t(s))

[(
Q̂t(s, a) − Qπt ,̂r(s, a)

)2
]
≤

256r2
max

N0
log

2|F |
δ1

.

Take union bound over t ∈ [T ] and we have for all t ∈ [T ] that

∣∣∣∣Eh∼Unif([H]),s∼dπ⋆h ,a∼π(s)

[
Q̂t(s, a) − Qπt ,̂r(s, a)

]∣∣∣∣ ≤ Ceval(F )

√
256r2

max

N0
log

2T |F |
δ1

.

□

E.2.2 NPG Analysis

In the following discussion we use f (s) to denote the vector f (s, ·) for all functions f .

We have the following lemma which indicates that NPG is able to find a near optimal

policy with respect to the estimated reward r̂ (recall that ϵeval is defined in Lemma 79):

Lemma 80. Denote the event in Lemma 79 by E1. Then conditioned on E1, with

Assumption 23 and 76, we have

Vπ⋆ ,̂r(s1) − V π̂,̂r(s1) ≤
CKL

ηT
+

Hr2
maxη

2
+ 2Hϵ′eval + λCKL := ϵ′PMD.

Proof. In the following proof we use g(π(s)) to denote KL(π(s)∥πSFT(s)) for any policy π.

First note that from the update rule in Algorithm 9, due to first order optimality, we know

for all distribution p ∈ ∆(A) and all t ∈ [T ], s ∈ S that:

〈
−ηQ̂t(s) + (1 + ηλ)∇g(πt+1(s)) − ∇g(πt(s)), p − πt+1(s)

〉
≥ 0. (E.1)

247


This implies that for all t ∈ [T ], s ∈ S, we have〈
ηQ̂t(s), π⋆(s) − πt(s)

〉
+ ηλg(πt(s)) − ηλg(π⋆(s))

=
〈
ηQ̂t(s) − (1 + ηλ)∇g(πt+1(s)) + ∇g(πt(s)), π⋆(s) − πt+1(s)

〉
+

〈
∇g(πt+1(s)) − ∇g(πt(s)), π⋆(s) − πt+1(s)

〉
+

〈
ηQ̂t(s), πt+1(s) − πt(s)

〉
+

〈
ηλ∇g(πt+1(s)), π⋆(s) − πt+1(s)

〉
+ ηλg(πt(s)) − ηλg(π⋆(s))

≤
〈
∇g(πt+1(s)) − ∇g(πt(s)), π⋆(s) − πt+1(s)

〉︸                                                ︷︷                                                ︸
(1)

+
〈
ηQ̂t(s), πt+1(s) − πt(s)

〉︸                        ︷︷                        ︸
(2)

+
〈
ηλ∇g(πt+1(s)), π⋆(s) − πt+1(s)

〉
+ ηλg(πt(s)) − ηλg(π⋆(s))︸                                                                        ︷︷                                                                        ︸

(3)

,

where the last step is due to Equation equation E.1. Now we bound the term (1)(2)(3)

respectively.

Bounding term (1). Note that the KL divergence is indeed the Bregman divergence

induced by g, therefore the following three point lemma holds true:

Lemma 81 (three point lemma). For any distributions p1(s), p2(s), p3(s) ∈ ∆(A) ,we

have

⟨∇g(p1(s)) − ∇g(p2(s)), p3(s) − p1(s)⟩ = KL(p3(s)∥p2(s)) − KL(p3(s)∥p1(s)) − KL(p1(s)∥p2(s)).

Proof. From definition of g, we know ∇g(p(s)) = log p(s)− log πSFT(s)+1. This implies

that

⟨∇g(p1(s)) − ∇g(p2(s)), p3(s) − p1(s)⟩ =
〈
log p1(s) − log p2(s), p3(s) − p1(s)

〉
.

Substitute the definition of KL divergence and we can prove the lemma. □

From Lemma 81, we can rewrite (1) as follows:

(1) = KL(π⋆(s)∥πt(s)) − KL(π⋆(s)∥πt+1(s)) − KL(πt+1(s)∥πt(s)).

248


Bounding term (2). From Cauchy-Schwartz inequality, we have

(2) ≤
1
2

∥∥∥πt+1(s) − πt(s)
∥∥∥2

1
+
η2

2

∥∥∥∥Q̂t(s)
∥∥∥∥2

∞
≤

1
2

∥∥∥πt+1(s) − πt(s)
∥∥∥2

1
+
η2r2

max

2
.

Bounding term (3). Since g is convex, we know

〈
ηλ∇g(πt+1(s)), π⋆(s) − πt+1(s)

〉
≤ ηλg(π⋆(s)) − ηλg(πt+1(s)).

This implies that

(3) ≤ ηλ
(
g(πt(s)) − g(πt+1(s))

)
.

In summary, we have for all t ∈ [T ], s ∈ S that

〈
ηQ̂t(s), π⋆(s) − πt(s)

〉
+ ηλg(πt(s)) − ηλg(π⋆(s))

≤
(
KL(π⋆(s)∥πt(s)) − KL(π⋆(s)∥πt+1(s))

)
+ ηλ

(
g(πt(s)) − g(πt+1(s))

)
+
η2

2
C2

Q +

(
1
2

∥∥∥πt+1(s) − πt(s)
∥∥∥2

1
− KL(πt+1(s)∥πt(s))

)
≤

(
KL(π⋆(s)∥πt(s)) − KL(π⋆(s)∥πt+1(s))

)
+ ηλ

(
g(πt(s)) − g(πt+1(s))

)
+
η2r2

max

2
,

where the last step is due to Pinsker’s inequality.

This implies that

T∑
t=1

H∑
h=1

Esh∼dπ⋆h

[〈
ηQ̂t(sh), π⋆(sh) − πt(sh)

〉
+ ηλg(πt(sh)) − ηλg(π⋆(sh))

]
≤

H∑
h=1

Esh∼dπ⋆h

[
KL(π⋆(sh)∥π1(sh)) − KL(π⋆(sh)∥πT+1(sh))

]
+ ηλ

H∑
h=1

Esh∼dπ⋆h

[
g(π1(sh) − g(πT+1(sh)))

]
+

HTr2
maxη

2

2

≤

H∑
h=1

Esh∼dπ⋆h

[
KL(π⋆(sh)∥πSFT(sh))

]
+

HTr2
maxη

2

2
≤ CKL +

HTr2
maxη

2

2
. (E.2)

249


Note that here we use the fact that we initialize the policy as π1 = πSFT and thus

g(π1(s)) = 0. On the other hand, note that we have the following performance difference

lemma, whose proof is deferred to Appendxi E.5.3:

Lemma 82 (performance difference lemma). For any policy π, π′ and reward function r,

we have:

Vπ,r(s1) − Vπ′,r(s1) =
H∑

h=1

Esh∼dπh

[〈
Qπ′,r(sh), π(sh) − π′(sh)

〉]
.

Now substitute Lemma 82 into Equation equation E.2, and from Lemma 79 we have

1
T

T∑
t=1

(
Vπ⋆ ,̂r(s1) − Vπt ,̂r(s1)

)
≤

CKL

ηT
+

Hr2
maxη

2
+ 2Hϵ′eval + λCKL.

This is equivalent to

Vπ⋆ ,̂r(s1) − V π̂,̂r(s1) ≤
CKL

ηT
+

Hr2
maxη

2
+ 2Hϵ′eval + λCKL,

which concludes our proof. □

We also would like to bound the KL divergence between π̂ and πSFT as shown in the

following lemma:

Lemma 83. We have for all t ∈ [T ] that

H∑
h=1

Esh∼dπt
h

[KL(πt(sh)∥πSFT(sh))] ≤
Hrmax(t − 1)

λ
.

Proof. From the NPG update and use the fact that πt+1 is the minimizer, we know for all

t ∈ [T ], s ∈ S:

KL(πt+1(s)∥πSFT(s)) − KL(πt(s)∥πSFT(s)) ≤
1
λ

〈
Q̂t(s), πt+1(s) − πt(s)

〉
≤

rmax

λ
,

where we utilize Assumption 23 in the second step.

250


Note that KL(π1(s)∥πSFT(s)) = 0 since π1 = πSFT. This implies that for all t ∈ [T ]:

KL(πt(s)∥πSFT(s)) ≤
rmax(t − 1)

λ
.

Therefore, we have

H∑
h=1

Esh∼dπt
h

[KL(πt(sh)∥πSFT(sh))] ≤
Hrmax(t − 1)

λ
.

□

E.2.3 Unregularized Suboptimality Gap w.r.t. r∗

Now we can start to prove Theorem 26. First we have

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) =
(
Vπ⋆,r⋆(s1) − Vπ⋆ ,̂r(s1)

)
+

(
Vπ⋆ ,̂r(s1) − V π̂,̂r(s1)

)
+

(
V π̂,̂r(s1) − V π̂,r⋆(s1)

)
=

(
Eτ∼dπ⋆

[
r⋆(τ) − r̂(τ)

]
− Eτ∼dπSFT

[
r⋆(τ) − r̂(τ)

])︸                                                       ︷︷                                                       ︸
(1)

+
(
Vπ⋆ ,̂r(s1) − V π̂,̂r(s1)

)︸                    ︷︷                    ︸
(2)

+
1
T

T∑
t=1

(
Eτ∼dπSFT

[
r⋆(τ) − r̂(τ)

]
− Eτ∼dπt

[
r⋆(τ) − r̂(τ)

])
︸                                                               ︷︷                                                               ︸

(3)

.

Next we bound term (1)(2)(3) respectviely.

Bounding term (1). From the guarantee of MLE (Lemma 95 in Appendix E.5) we

have with probability at least 1 − δ2 that

Eτ0∼dπSFT
,τ1∼dπSFT

[∣∣∣r⋆(τ0) − r⋆(τ1) − r̂(τ0) + r̂(τ1)
∣∣∣2] ≤ c1κ

2 log(|R|/δ2)
M

:= ϵ2
MLE, (E.3)

251


where c1 > 0 is a universal constant. Denote the event of the above inequality by E2.

Then conditioned on E2, we have

(1) = Eτ0∼dπ⋆ ,τ1∼dπSFT [r⋆(τ0) − r⋆(τ1) − r(τ0) + r(τ1)]

≤ Cr(R)

√
Eτ0∼dπSFT

,τ1∼dπSFT

[∣∣∣r⋆(τ0) − r⋆(τ1) − r̂(τ0) + r̂(τ1)
∣∣∣2]

≤ Cr(R)ϵMLE.

Bounding term (2). From Lemma 80, conditioned on E1, we have

Vπ⋆ ,̂r(s1) − V π̂,̂r(s1) ≤ ϵ′PMD.

Bounding term (3). Note that we have the following lemma in Appendix E.5:

Lemma 84. For any two distributions d1, d2 ∈ ∆X and non-negative function l defined

on X, we have

Ex∼d2[ f (x)] − Ex∼d1[ f (x)] ≤
√

2 Var
d2

[ f ]KL(d1∥d2),

where Vard[ f ] is the variance of f under distribution d.

With Lemma 84, we have for all t ∈ [T ],

Eτ∼dπSFT
[
r⋆(τ) − r̂(τ)

]
− Eτ∼dπt

[
r⋆(τ) − r̂(τ)

]
≤

√
2 Var

dπSFT

[
r⋆ − r̂

]
KL(dπt

∥dπSFT)

=

√
Eτ0∼dπSFT

,τ1∼dπSFT

[∣∣∣r⋆(τ0) − r⋆(τ1) − r̂(τ0) + r̂(τ1)
∣∣∣2] KL(dπt

∥dπSFT)

≤ ϵMLE

√
KL(dπt

∥dπSFT), (E.4)

where the first inequality comes from the fact that 2 Varx∼d[x] = Ex∼d,x′∼d[(x − x′)2] and

the last step comes from Equation equation E.3.

On the other hand, the following lemma indicates that the KL divergence over the

whole trajectory is indeed the summation of the per-step KL divergence:

252


Lemma 85. For any policy π and π′, we have:

KL(dπ∥dπ
′

) =
H∑

h=1

Esh∼dπh
[KL(π(sh)∥π′(sh))].

Proof. From the definition of KL divergence, we have:

KL(dπ∥dπ
′

) =
∑
τ

dπ(τ) log
dπ(τ)
dπ′(τ)

=
∑
τ

dπ(τ) log
∏

h∈[H] π(ah|sh)∏
h∈[H] π

′(ah|sh)

=
∑
τ

dπ(τ)
H∑

h=1

log
π(ah|sh)
π′(ah|sh)

=

H∑
h=1

∑
s∈Sh,a∈A

log
π(a|s)
π′(a|s)

∑
τ:(sh,ah)=(s,a)

dπ(τ)

=

H∑
h=1

∑
s∈Sh,a∈A

dπh(s)π(a|s) log
π(a|s)
π′(a|s)

=

H∑
h=1

Esh∼dπh
[KL(π(sh)∥π′(sh))].

□

Thus substitute Lemma 83 and Lemma 85 into Equation equation E.4, we know

(3) ≤ ϵMLE

√
HrmaxT
λ

.

Overall, we have conditioned on event E1 ∩ E2,

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ Cr(R)ϵMLE + ϵ
′
PMD + ϵMLE

√
HrmaxT
λ

.

We finish the proof by letting δ1 = δ2 = δ/2.

E.3 NPG with regularized Q functions

In this section we propose and analyze NPG with KL-divergence regularized Q function,

which includes the KL penalty when computing the trajectory’s total reward. More

specifically, we compute yt,n as follows in line 9 of Algorithm 16:

yt,n =

H∑
h′=h

r̂h′(sh′ , ah′) − λ
H∑

h′=h+1

KL(π(sh′)∥πSFT(sh′)).

The other steps of algorithm is the the same as Algorithm 16.

253


E.3.1 Theoretical Analysis

The assumptions are similar to those in Section 6.4 and Appendix E.2. However, here Q̂t

approximates the regularized Q function defined as follows rather than Qπt ,̂r:

Vπ,r
λ (s) := Eπ

 H∑
h′=h

r(sh′ , ah′) − λKL(π(sh′)∥πSFT(sh′))
∣∣∣∣∣sh = s

 ,
Qπ,r
λ (s, a) := r(s, a) + Es′∼P(·|s,a)

[
Vπ,r
λ (s′)

]
,∀h ∈ [H], s ∈ Sh, a ∈ A,

where we let Vπ,r
λ (sH+1) = 0.

Note that Qπ,r
λ can be unbounded due to the KL divergence. To prevent this from

happening, we assume the probability of πSFT is lower bounded within its support:

Assumption 86. Suppose that we have:

min
(s,a)∈Supp(πSFT)

πSFT(a|s) ≥ CLO,

where Supp(πSFT) := {(s, a) : πSFT(a|s) > 0} is the support of πSFT.

Remark 87. Assumption 86 is a weak concentrability assumption because later we show

that the sample complexity only scales with log 1
CLO

rather than poly(CLO). This implies

that CLO can be as small as Θ(ϵ), which can be satisfied by mixing πSFT with a uniform

policy, i.e., πSFT,ϵ := (1 − ϵ)πSFT + ϵUnif(Supp(πSFT)).Note that Assumption 86 is not

full-coverage concentrability because we consider the minimum probability within the

support of πSFT.

With Assumption 86, we can suppose F is bounded and capable of approximating

Qπt ,̂r
h,λ (s, a):

Assumption 88. Suppose that we have Qπt ,̂r
λ ∈ F for all t ∈ [T ]. In addition, assume that

for all f ∈ Fh, ∥ f ∥∞ ≤ CQ := max
{
λH log 1

CLO
,H

}
.

254


Correspondingly, we modify the single-policy concentrability in Assumption 75 to

the following assumptions:

Assumption 89 (single-policy concentrability w.r.t. Q function class). Suppose that we

have for all t ∈ [T ]:

max
h∈[H], f∈F ,π∈{πt ,π⋆}

∣∣∣∣Es∼dπ⋆h ,a∼πh(s)

[
f (s, a) − Qπt ,̂r

λ (s, a)
]∣∣∣∣√

Es∼dπSFT
h ,a∼( 1

2π
SFT
h (s)+ 1

2π
t
h(s))

[(
f (s, a) − Qπt ,̂r

λ (s, a)
)2
] ≤ Cre(F )

Note that we still have Cre(F ) ≤
√

2 · suph∈[H],s∈S,a∈A
dπ⋆h (s,a)

dπSFT
h (s,a)

.

Under the above assumptions, we have the following theorem to characterize the the-

oretical performance of Algorithm 9 when setting the PO oracle as NPG with regularized

Q function:

Theorem 90. Suppose Assumption 22,74,76,86,88,89 hold and set

T =
HC2

QCKL

ϵ2 , η =
ϵ

HC2
Q

, λ =
ϵ

CKL +Cr(R)
√

CKL
.

If we have

N = Ω

H3C2
re(F )C4

QCKL

ϵ4 log
T |F |
δ

 ,M = Ω
κ2

(
CKL +C2

r (R)
)

ϵ2 log
|R|

δ

 ,
then with probability at least 1 − δ, we have

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ ϵ.

The proof is deferred to Appendix E.4. Theorem 90 indicates that by adding KL

penalty into the reward, we can reduce the sample complexity of labeled trajectory pairs

to Õ(1/ϵ2), as long as the KL penalty does not blow up (Assumption 86). Specifically,

compared to offline RLHF with pessimism (Zhan et al., 2023a), our sample complexity

of the labeled trajectory pairs only have an additional CKL term, which is upper bounded

by H log CST. Note that the algorithm is still oracle efficient and does not require

pessimism/optimism mechanisms that are hard to implement in practice.

255


E.4 Proof of Theorem 90

The proof is similar to Theorem 78 and consists of three steps.

E.4.1 Q function Estimation Error

We have the following lemma to bound the estimation error
∣∣∣∣Q̂t(s, a) − Qπt ,̂r

λ (s, a)
∣∣∣∣:

Lemma 91. Fix any δ1 ∈ (0, 1]. With Assumption 86 and 88, we have with probability at

least 1 − δ1 that for all t ∈ [T ],

∣∣∣∣Eh∼Unif([H]),s∼dπ⋆h ,a∼π(s)

[
Q̂t
λ(s, a) − Qπt ,̂r

λ (s, a)
]∣∣∣∣ ≤ Cre(F )

√
256C2

Q

N0
log

2T |F |
δ1

:= ϵ′′eval,

where π ∈ {πt, π⋆}.

The proof is the same as Lemma 79 and thus omitted here.

E.4.2 Regularized NPG Analysis

We have the following lemma which indicates that NPG is able to find a near optimal

policy with respect to KL-penalty regularized r̂ (recall that ϵ′eval is defined in Lemma 91):

Lemma 92. Denote the event in Lemma 91 by E3. Then conditioned on E3, with

Assumption 76 and 88, we have

Vπ⋆ ,̂r
λ (s1) − V π̂,̂r

λ (s1) ≤
CKL

ηT
+

Hη
2

C2
Q + 2Hϵ′′eval := ϵ′′PMD.

256


Proof. Following the same analysis in the proof of Lemma 80, we have

T∑
t=1

H∑
h=1

Esh∼dπ⋆h

[〈
ηQ̂t

λ(sh), π⋆h (sh) − πt
h(sh)

〉
+ ηλg(πt

h(sh)) − ηλg(π⋆h (sh))
]
≤ CKL +

HTη2

2
C2

Q.

(E.5)

On the other hand, note that we have the following performance difference lemma for

regularized Q function and value function, whose proof is deferred to Appendxi E.5.3:

Lemma 93. For any policy π, π′ and reward function r, we have:

Vπ,r
λ (s1) − Vπ′,r

λ (s1) =
H∑

h=1

Esh∼dπh

[〈
Qπ′,r
λ (sh), π(sh) − π′(sh)

〉
− λg(π(sh)) + λg(π′(sh))

]
.

Now substitute Lemma 93 into Equation equation E.5, and from Lemma 91 we have

1
T

T∑
t=1

(
Vπ⋆ ,̂r
λ (s1) − Vπt ,̂r

λ (s1)
)
≤

CKL

ηT
+

Hη
2

C2
Q + 2Hϵ′′eval.

This is equivalent to

Vπ⋆ ,̂r
λ (s1) − V π̂,̂r

λ (s1) ≤
CKL

ηT
+

Hη
2

C2
Q + 2Hϵ′′eval,

which concludes our proof. □

E.4.3 Unregularized Suboptimality Gap w.r.t. r∗

Now we can start to prove Theorem 26. If Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ 0, the theorem holds

naturally. Otherwise, when Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) > 0, we have

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) =
(
Eτ∼dπ⋆

[
r⋆(τ) − r̂(τ)

]
− Eτ∼dπSFT

[
r⋆(τ) − r̂(τ)

])︸                                                       ︷︷                                                       ︸
(1)

+
(
Vπ⋆ ,̂r(s1) − V π̂,̂r(s1)

)︸                    ︷︷                    ︸
(2)

+
1
T

T∑
t=1

(
Eτ∼dπSFT

[
r⋆(τ) − r̂(τ)

]
− Eτ∼dπt

[
r⋆(τ) − r̂(τ)

])
︸                                                               ︷︷                                                               ︸

(3)

.

257


From the proof of Theorem 78, we know

(1) ≤ Cr(R)ϵMLE. (E.6)

Next we bound term (2)(3) respectively.

Bounding term (2). Note that from Lemma 85, we know for any policy π, the KL

divergence between its trajectory distribution and πSFT can be written as:

KL(dπ∥dπ
SFT

) =
H∑

h=1

Esh∼dπh

[
KL

(
π(sh)∥πSFT(sh)

)]
.

Therefore from Lemma 92, conditioned on E1, we have

Vπ⋆ ,̂r(s1) − λKL(dπ
⋆

∥dπ
SFT

) − V π̂,̂r(s1) +
λ

T

T∑
t=1

KL(dπ
t
∥dπ

SFT
) ≤ ϵ′′PMD. (E.7)

This implies that

(2) ≤ ϵ′PMD + λKL(dπ
⋆

∥dπ
SFT

) ≤ ϵ′′PMD + λCKL.

Bounding term (3). First note that from Equation equation E.7 we have

1
T

T∑
t=1

KL(dπ
t
∥dπ

SFT
) ≤

V π̂,̂r(s1) − Vπ⋆ ,̂r(s1)
λ

+
ϵ′′PMD

λ
+CKL

=
(1) + (3) + ϵ′′PMD

λ
+CKL −

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1)
λ

≤
(1) + (3) + ϵ′′PMD

λ
+CKL, (E.8)

where the last step is because now we consider the case where Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) > 0.

With Lemma 84, we have

(3) ≤ ϵMLE ·
1
T

T∑
t=1

√
KL(dπt

∥dπSFT) ≤ ϵMLE

√√
1
T

T∑
t=1

KL(dπt
∥dπSFT)

258


where the last step comes from Cauchy-Schwartz inequality. Let X = (1) + (3), then plug

Equation equation E.6 and equation E.8 into the above inequality, we have

X ≤ Cr(R)ϵMLE + ϵMLE

√
X + ϵ′′PMD

λ
+CKL.

This implies that

X ≤ c2ϵMLE

ϵMLE

λ
+

√
Cr(R)ϵMLE

λ
+Cr(R) +

√
CKL +

√
ϵ′′PMD

λ

 ,
where c2 > 0 is a universal constant.

Overall, we have

Vπ⋆,r⋆(s1) − V π̂,r⋆(s1) ≤ϵ′′PMD + λCKL

+ c2ϵMLE

ϵMLE

λ
+

√
Cr(R)ϵMLE

λ
+Cr(R) +

√
CKL +

√
ϵ′′PMD

λ

 .
We finish the proof by substituting the values of λ, η,T,N,M into the above inequality

and letting δ1 = δ2 = δ/2.

E.5 Auxiliary Lemmas

E.5.1 Least Sqaures Guarantee

Lemma 94 (Lemma 15 in Song et al. (2022)). Fix any R > 0, δ ∈ (0, 1) and assume we

have a class of real valued functionsH : X 7→ [−R,R]. Suppose we have K i.i.d. samples

{(xk, yk)}Kk=1 where xk ∼ ρ and yk is sampled via the conditional probability p(· | xk):

yk ∼ p(· | xk) := h∗(xk) + ϵk,

where h∗ ∈ H and {ϵk}
K
k=1 are independent random variables such that E[yk | xk] = h∗(xk).

Additionally, suppose that maxk |yk| ≤ R and maxx |h∗(x)| ≤ R. Then the least square

259


solution ĥ← argminh∈H
∑K

k=1 (h(xk) − yk)2 satisfies with probability at least 1 − δ,

Ex∼ρ

[(̂
h(x) − h∗(x)

)2
]
≤

256R2 log(2|H|/δ)
K

.

The proof is the same as in Song et al. (2022) and thus is omitted here.

E.5.2 Maximum Likelihood Estimation Guarantee

Lemma 95. With Assumption 22, we have with probability at least 1 − δ that

Eτ0∼dπSFT
,τ1∼dπSFT

[∣∣∣r⋆(τ0) − r⋆(τ1) − r̂(τ0) + r̂(τ1)
∣∣∣2] ≤ c1κ

2 log(|R|/δ)
M

,

where c1 > 0 is a universal constant.

Proof. The proof largely follows the proof of Theorem 1 in Zhan et al. (2023a). Specifi-

cally, we have the following lemma from Zhan et al. (2023a):

Lemma 96 (Lemma 2 in Zhan et al. (2023a)). Fix any δ ∈ (0, 1]. Then with probability

at least 1 − δ, we have that for all reward function r ∈ R,

Eτ0,τ1∼dπSFT

[∥∥∥∥Pr(·|τ0, τ1) − Pr⋆(·|τ0, τ1)
∥∥∥∥2

1

]
≤

c1

M

( M∑
m=1

log
(Pr⋆(om|τm,0, τm,1)

Pr(om|τm,0, τm,1)

)
+ log

|R|

δ

)
.

Then from Lemma 96, since
∑M

m=1 log Pr⋆(om|τm,0, τm,1) ≤
∑M

m=1 log Pr̂(om|τm,0, τm,1),

we have with probability at least 1 − δ:

Eτ0,τ1∼dπSFT

[∥∥∥∥Pr̂(·|τ0, τ1) − Pr⋆(·|τ0, τ1)
∥∥∥∥2

1

]
≤

c1 log |R|
δ

M
. (E.9)

Then under Assumption 22, we can apply the mean value theorem between r⋆(τ1) −

r⋆(τ0) and r̂(τ1) − r̂(τ0) to equation E.9 and ensure that

Eτ0,τ1∼dπSFT [|(r⋆(τ1) − r⋆(τ0)) − (̂r(τ1) − r̂(τ0))|2] ≤
c1κ

2 log |R|
δ

M
,

where κ := 1
infx∈[−rmax ,rmax] Φ′(x) measures the non-linearity of the link function Φ. □

260


E.5.3 Performance Difference

We restate and prove Lemma 93 as follows. Lemma 82 is a special case where λ = 0.

Lemma 97. For any policy π, π′ and reward function r, we have:

Vπ,r
1,λ(s1) − Vπ′,r

1,λ (s1) =
H∑

h=1

Esh∼dπh

[〈
Qπ′,r

h,λ (sh), πh(sh) − π′h(sh)
〉
− λg(πh(sh)) + λg(π′h(sh))

]
,

where g(πh(s)) := KL(πh(s)∥πSFT
h (s))

Proof. For any two policies π, π′ and reward r, we have that

Vπ,r
1,λ(s1) − Vπ′,r

1,λ (s1)

=Eπ
[
r1(s1, a1) − λg(π1(s1)) + Vπ,r

2,λ(s2)
]
− Eπ

[
Vπ′,r

1,λ (s1)
]

=Eπ
[(

Qπ′,r
1,λ (s1, a1) − Vπ′,r

2,λ (s2)
)
− λg(π1(s1)) + Vπ,r

2,λ(s2)
]
− Eπ

[
Vπ′,r

1,λ (s1)
]

=Eπ
[
Vπ,r

2,λ(s2) − Vπ′,r
2,λ (s2)

]
+ Eπ

[
Qπ′,r

1,λ (s1, a1) − Vπ′,r
1,λ (s1) − λg(π1(s1))

]
=Eπ

[
Vπ,r

2,λ(s2) − Vπ′,r
2,λ (s2)

]
+ Es1∼dπ1

[〈
Qπ′,r

1,λ (s1, ·), π1(·|s1) − π′1(·|s1)
〉
− λg(π1(s1)) + λg(π′1(s1))

]
= · · · =

H∑
h=1

Esh∼dπh

[〈
Qπ′,r

h,λ (sh), πh(sh) − π′h(sh)
〉
− λg(πh(sh)) + λg(π′h(sh))

]
.

This concludes our proof. □

E.5.4 KL Divergence Property

We restate Lemma 84 as follows:

Lemma 98 (Corollary 11 in Talebi and Maillard (2018)). For any two distributions

d1, d2 ∈ ∆X and function f defined on X, we have

Ex∼d2[ f (x)] − Ex∼d1[ f (x)] ≤
√

2 Var
d2

[ f ]KL(d1∥d2),

where Vard[ f ] is the variance of f under distribution d.

261


Task Train/Val/Test Prompt Gen. Length

TL;DR 117K/6.45K/6.55K "TL;DR: " 53
CNN/DailyMail 287K/13.4K/11.4K "TL;DR: " 64

Table E.1: Train, val, test splits, prompts, and max generation length used for each task.

The proof is the same as in Talebi and Maillard (2018) and thus is omitted here.

E.6 Additional Experiment Details

E.6.1 Experiment Hyperparameters and Task Details

Task Details

We present dataset specific details in table E.1

For both datasets we obtained the training data from https://github.com/openai/

summarize-from-feedback.

E.6.2 Dataset Reset Implementation Details

Here is a code snippet of the logit processor that handles dataset resets from references

for a HuggingFace transformers model. β here represents the proportion of generations

in the batch to do resets for.

import torch

import numpy as np

262

https://github.com/openai/summarize-from-feedback
https://github.com/openai/summarize-from-feedback


from transformers import LogitsProcessor

class ResetProcessor(LogitsProcessor):

def __init__(self, references, beta, rng, seq_lens):

self.counter = 0

self.references = references

self.seq_lens = seq_lens

self.create_mask(beta, rng)

def create_mask(self, beta, rng):

batch_size, seq_len = self.references.shape[:2]

# Mixin

init_mask = rng.choice(

[True, False], size=(batch_size, 1), p=[beta, 1-beta]

)

init_mask = np.tile(init_mask, (1, seq_len))

# Rollin Selection

length_masks = np.tril(np.ones((seq_len, seq_len)))

masks = []

for length in self.seq_lens:

if length < 2:

masks.append(np.zeros((seq_len)).astype(bool))

else:

masks.append(

rng.choice(length_masks[: length - 1, :]

).astype(bool))

263


self.rollin_mask = np.stack(masks)

self.rollin_mask[~init_mask] = False

def __call__(

self, input_ids: torch.LongTensor, scores: torch.FloatTensor

) -> torch.FloatTensor:

vocab_size = scores.size(-1)

new_scores = one_hot(

self.references[:, self.counter], num_classes=vocab_size

).float()

new_scores[new_scores == 0] = -float("inf")

mask = self.rollin_mask[:, self.counter]

assert scores.shape == new_scores.shape

new_scores = new_scores.to(scores.device)

# Only do Teacher Forcing on the rollins

scores[mask] = new_scores[mask]

self.counter += 1

return scores

Computation

Note since we start with the references from the dataset, the computational requirements

to generate with resets are the same as generating from the initial state distribution. For

all of our experiments, we ran with the same per device batch size between PPO and

264


DR-PO. For this work, we made use of 16 A6000 gpus with 48GB of VRAM. We used 4

gpus for each run.

E.6.3 Details on GPT4 Winrate

For winrate calculation, we used the following prompt:

Which of the following summaries does a better job of summarizing the most important points

in the given forum post, without including unimportant or irrelevant details? Judge based

on accuracy, coverage, and coherence.

### Post:

{{post}}

### Summary A:

{{summarya}}

### Summary B:

{{summaryb}}

### Instructions:

FIRST provide a one-sentence comparison of the two summaries, explaining which you prefer

and why. SECOND, on a new line, state only "A" or "B" to indicate your choice. Your response

265


should use the format:

Comparison: <one-sentence comparison and explanation>

Preferred: <"A" or "B">

Win Rate Example

Here is an example of getting a one sentence explanation as to why GPT4 chose certain

generations for the winrate.

Prompt

SUBREDDIT: r/AskReddit

TITLE: How do you get someone out of your head?

POST: Hi,

I'm 22, and I have been with my girlfriend for 5 years now. We recently moved together.

We've always loved each other intensely.

Problem, I recently started to have feelings for an other person (a friend). This person

has had a boyfriend for now 3 years, and has absolutely no ideas. Those feelings were

so strong, it was hard to hide them. After 2 months of me being distant and really sad,

my girlfriend forced me to say what was bothering me. I'm not a good liar, and now she knows.

We decided to give us a week alone, I went to my parents.

Now, I'm completely lost. I keep on thinking about this person, and I hate that. I would

266


like for those feelings to go away, to leave me alone. But I can't.

What do I do? It's been 3 months now, and I'm just desperate.

TL;DR:

\end{lstlisting}

\textbf{DR-PO Generation (Summary A)}

\begin{verbatim}

I recently started to have feelings for someone else, my girlfriend knows, we decided to

give ourselves a week alone, now I'm completely lost, I hate that, what do I do

Reference (Summary B)

long relationship; fell in love with an other person; admitted it; would like it to

disappear, though it doesn't.

GPT4 Explanation for Choosing DR-PO

Summary A is more detailed and better represents the tone and content of the forum post,

while Summary B is too abbreviated and omits important details such as the fact

that the poster's girlfriend now knows about their feelings.

E.6.4 Examples from Test

Prompt 1

267


SUBREDDIT: r/AskReddit

TITLE: My friend works at a California Charter School and their principal is disregarding

their charter and threatening to fire anyone who questions his leadership. What can she do?

POST: I'm not an educator but my friend is having a really hard time at her job right now

and I was wondering if anyone could help.

**The Issue:**

She works for a charter school in California where the principal has gone off the deep end.

At first it started casually with cuts and a few firings (understandable at the time because

of budget cuts). It steadily got worse however as he slowly stripped the teachers of any

rights/protections they once had. Now he's threatening to fire them at any time if they

refuse to work late/take personal days/attempt talks with the teachers union/show signs of

disagreeing with the principal's leadership, etc.

This guy has continually canceled meetings by any form of teacher-staffed governing boards

that would normally provide the teachers with a voice. Now he just dictates policy for the

school with no regard or input from the teachers. He has threatened to let anyone go if they

show even the slightest dissatisfaction with his decisions.

Keep in mind this was a very good school before this guy came in. It's not like it was a

failing school in need of some tough love. As far as scholastic achievement goes, not much

has changed since this guy was hired on (which is why I suspect he still has a job despite

complaints to district).

268


Is there anything my friend can do to oust this guy or at least protect her job?

TL;DR:

DR-PO

My friend works at a California charter school where the principal is disregarding their

charter and threatening to fire anyone who questions his leadership. What can she do?

Prompt 2

SUBREDDIT: r/AskReddit

TITLE: I am in the financial and insurance industry, but is looking to go back to school

to become a vet. Any suggessions?

POST: I'm not 100% sure if I have what it takes to make this huge change, but I feel like

being a Veternarian is something I really want to do in life.

A bit of background. I'm currently 24, graduated with a finance degree 3 years ago.

I'm currently a financial rep focusing on selling insurance. Even though I don't hate

my job I feel like it doesn't fit my personality. But I stuck with it for a while

because I don't like to quit easily. The idea of becoming a vet happened when a

chinchilla of mine passed because I didn't take it to the vet in time, and I really love

animals. People tell me I'm a warm person and great with kids and animals. I feel like

this is something I came up with myself and not what other want me to be.

269


So back to reality, I did a tiny bit of research on this. The closest grad school that has

this program is Cornell university (I'm located in NY) I'd probably need to do an undergrad

in science or medical field. I'm a little unsure of 8 more years of school, but I guess if

there's no other choice. Another concern is money, I only have about 10k in bank, I cannot

touch my retirement and life insurance money. I'm not sure if I can qualify for any federal

grant. I may also need to move back with my parents. One idea I have is to just brokage

product with high residuals while I wait to get into a school. I'll also appraciate any

insight and experiences from a vet or a person going back to school.

TL;DR:

DR-PO

I'm in the financial and insurance industry, but am looking to go back to school to become

a vet. Any suggestions or experiences from a vet or someone going back to school.

270


E.6.5 Hyperparameters

We write the relevant hyperparameters from our experiments for DPO, PPO, SFT, and

DRPO in table E.2.
Setting Values

model Pythia 2.8B (HuggingFace Model Card: EleutherAI/pythia-2.8b-deduped)

PPO train epochs: 1
batch size: 512
num epochs: 4
num minibatches: 1
learning rate: 3e-6
schedule: linear decay
discount factor: 1
gae λ: 0.95
clip ratio: 0.2
value function coeff: 0.1
kl coefficient: 0.05

DR-PO mixing parameter (β): 1

DPO batch size: 64
β: 0.05
learning rate: 3e-6
schedule: linear decay
num train epochs: 1

Reward Model batch size: 64
learning rate: 3e-6
schedule: linear decay
num train epochs: 1

SFT batch size: 64
learning rate: 3e-6
schedule: linear decay
num train epochs: 1

LoRA Adapter Config r: 1024
α: 2048
dropout: 0.0
bias: False

Decoding sampling: true
top k: 0.0
top p: 1.0
min length: 53
max new tokens: 53
temperature: 0.1

Tokenizer padding side: left
truncation side: left
max length: 563

Table E.2: Hyperparameters used for TL;DR and CNN/DailyMail. Note that DP-RO and
PPO share the same parameters (other than mixing proportion). All processes use the
same decoding, LoRA config, and tokenizer parameters.

271

EleutherAI/pythia-2.8b-deduped


Algorithms RM TL;DR Accuracy RM CNN/DM Accuracy

RM 66.21% 67.48%
DPO 65.92% 67.28%

RM w/ LoRA 62.87% 66.75%
DPO w/ LoRA 66.14% 61.78%

Table E.3: Reward Model Transfer to CNN/DM: The accuracy of the RM and DPO’s
implicit learned reward in accuracy predicting the preference. We evaluate models trained
with and without LoRA on TL;DR. We also report the zero-shot performance of these
models on the CNN/DailyMail preference dataset from Stiennon et al. (2020).

E.6.6 Additional Experiments

Shown in Table E.3, we investigate DPO’s implicit learned reward accuracy to our RM’s

accuracy on both TL;DR and CNN/DailyMail’s test sets. Furthermore, we also report the

effects of LoRA on the RM and DPO performance. We see that DPO without LoRA has

comparable preference accuracy on CNN/DM as our RM. Thus, we used the DPO policy

without LoRA when comparing against PPO and DR-PO in Table 6.3.

272


APPENDIX F

MISSING PROOFS AND DETAILS IN CHAPTER 8

F.1 Consistency Models

We reproduce the consistency model algorithm from Song et al. (2023).

Algorithm 17 Consistency Model Multi-step Sampling Procedure (Song et al., 2023)
1: Input: Consistency model π = fθ(·, ·), sequence of time points τ1 > τ2 > . . . > τN−1,

initial noise x̂T

2: x← f (x̂T ,T )
3: for n = 1 to N-1 do
4: z ∼ N(0, I)
5: x̂τn ← x +

√
τ2

n − ϵ
2z

6: x← f (x̂τn , τn)
7: end for
8: Output: x

F.2 Experiment Details

F.2.1 Hyperparameters

We note that a 4th gpu was used for Prompt Image Alignment as a sever for the LLaVA

(Liu et al., 2023a) and BERT models (Zhang et al., 2019a) to form the reward function.

F.2.2 Hyperparameter Sweep Ranges

These hyperparameters were found via a sweep. In particular we swept the learning

rate for values in the range [1e-5,3e-4]. Likewise we also swept the number of batches

273


Parameters Compression Incompression Aesthetic Prompt Image Alignment

Advantage Clip Maximum 10 10 10 10
Batches Per Epoch 10 10 10 6
Clip Range 0.0001 0.0001 0.0001 0.0001
Gradient Accumulation Steps 2 2 4 20
Learning Rate 0.0001 0.0001 0.0001 0.0001
Max Grad Norm 5 5 5 5
Pretrained Model Dreamshaper v7 Dreamshaper v7 Dreamshaper v7 Dreamshaper v7
Number of Epochs 100 100 100 118
Horizon (Number of inference steps) 8 8 8 16
Number of Sample Inner Epochs 1 1 1 5
Sample Batch Size (per GPU) 4 4 8 8
Rolling Statistics Buffer Size 16 16 32 32
Rolling Statistics Min Count 16 16 16 16
Train Batch Size (per GPU) 2 2 2 2
Number of GPUs 4 4 4 3
LoRA rank 16 16 8 16
LoRA α 32 32 8 32

Table F.1: Hyperparameters for all tasks (Compression, Incompression, Aesthetic, Prompt
Image Alignment)

per epoch and gradient accumulation steps but found that increasing both of these

values led to greater performance, at the cost of sample complexity. We also swept the

hyperparameters for DDPO, our baseline, but found that the provided hyperparameters

provided the best results. In particular we tried lower batch size to increase the sample

complexity of DDPO but found that this made the algorithm unstable. Likewise, we

found that increasing the number of inner epochs did not help performance. In fact, it

had quite the opposite effect.

F.2.3 Details on Task Prompts

We followed (Black et al., 2024) in forming the prompts for each of the tasks. The

prompts for incompression, compression, and aesthetic took the form of [animal]. For

the prompt image alignment task, the prompt took the form of a [animal] [task]

where the a was conjugated depending on the animal. The prompts for compression and

incompression were the animal classes of Imagenet (Deng et al., 2009). Aesthetic was a

set of simple animals, and prompt image alignment used the animals from the aesthetic

274


task and chose from the tasks: riding a bike, washing the dishes, playing chess.

275


F.3 Additional Samples from RLCM

We provide random samples from RLCM at the end of training on aesthetic and prompt

image alignment. Images from converged compression and incompression are relatively

uninteresting and thus omitted.

276


F.3.1 Aesthetic Task

277


F.3.2 Prompt Image Alignment

278


BIBLIOGRAPHY

Abbasi-yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear

stochastic bandits. In Advances in Neural Information Processing Systems, volume 24.

Curran Associates, Inc.

Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement

learning. In ICML. ACM.

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D.,

Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv

preprint arXiv:2303.08774.

Agarwal, A., Henaff, M., Kakade, S., and Sun, W. (2020a). Pc-pg: Policy cover directed

exploration for provable policy gradient learning. NeurIPS.

Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning:

Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2020b). Optimality and approx-

imation with policy gradient methods in markov decision processes. In Conference on

Learning Theory, pages 64–66. PMLR.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021a). On the Theory of

Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. Journal

of Machine Learning Research, 22(1):4431–4506.

Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. (2021b).

Deep reinforcement learning at the edge of the statistical precipice. In Ranzato, M.,

Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in

Neural Information Processing Systems, volume 34, pages 29304–29320. Curran

Associates, Inc.

279


Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist, M., and Bachem, O. (2023).

Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv

preprint arXiv:2306.13649.

Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino,

A., Plappert, M., Powell, G., Ribas, R., et al. (2019). Solving rubik’s cube with a robot

hand. arXiv preprint arXiv:1910.07113.

Ammanabrolu, P. and Riedl, M. O. (2018). Playing text-adventure games with graph-

based deep reinforcement learning. arXiv preprint arXiv:1812.01628.

Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2016). Spice: Semantic

Propositional Image Caption Evaluation. In Proceedings of European Conference on

Computer Vision.

Anthropic (2023). https://www.anthropic.com/index/claude-2.

Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with

bellman-residual minimization based fitted policy iteration and a single sample path.

Machine Learning, 71:89–129.

Arora, K., Asri, L. E., Bahuleyan, H., and Cheung, J. C. K. (2022). Why exposure

bias matters: An imitation learning perspective of error accumulation in language

generation. arXiv preprint arXiv:2204.01171.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed

bandit problem. Machine learning, 47(2):235–256.

Aytar, Y., Pfaff, T., Budden, D., Paine, T. L., Wang, Z., and de Freitas, N. (2018). Playing

hard exploration games by watching youtube. In NeurIPS, pages 2935–2945.

280


Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. (2011). Reinforcement

learning with a near optimal rate of convergence. Technical report, INRIA.

Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement

learning. In International Conference on Machine Learning, pages 263–272.

Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. (2018a). Efficient exploration

through bayesian deep q-networks. In ITA, pages 1–9. IEEE.

Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. (2018b). Efficient exploration

through bayesian deep q-networks. In 2018 Information Theory and Applications

Workshop (ITA), pages 1–9. IEEE.

Bach, F. (2017). On the equivalence between kernel quadrature rules and random feature

expansions. Journal of machine learning research, 18(21):1–38.

Bagnell, J., Kakade, S. M., Schneider, J., and Ng, A. (2003). Policy Search by Dynamic

Programming. In Proceedings of Neural Information Processing Systems.

Bagnell, J. A. (2004). Learning decisions: Robustness, uncertainty, and approximation.

Carnegie Mellon University.

Bagnell, J. A. and Schneider, J. (2003). Covariant policy search.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S.,

Ganguli, D., Henighan, T., et al. (2022a). Training a helpful and harmless assistant

with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A.,

Mirhoseini, A., McKinnon, C., et al. (2022b). Constitutional ai: Harmlessness from ai

feedback. arXiv preprint arXiv:2212.08073.

281


Bakker, M., Chadwick, M., Sheahan, H., Tessler, M., Campbell-Gillingham, L., Balaguer,

J., McAleese, N., Glaese, A., Aslanides, J., Botvinick, M., et al. (2022). Fine-tuning

language models to find agreement among humans with diverse preferences. Advances

in Neural Information Processing Systems, 35:38176–38189.

Ball, P. J., Smith, L., Kostrikov, I., and Levine, S. (2023). Efficient online reinforcement

learning with offline data. arXiv preprint arXiv:2302.02948.

Bansal, S., Calandra, R., Xiao, T., Levine, S., and Tomiin, C. J. (2017). Goal-driven

dynamics learning via bayesian optimization. In 2017 IEEE 56th Annual Conference

on Decision and Control (CDC), pages 5168–5173. IEEE.

Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Tb, D., Muldal,

A., Heess, N., and Lillicrap, T. (2018). Distributed distributional deterministic policy

gradients. arXiv preprint arXiv:1804.08617.

Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities.

Annals of Statistics, 33(4):1497–1537.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learn-

ing environment: An evaluation platform for general agents. Journal of Artificial

Intelligence Research, 47:253–279.

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for

sequence prediction with recurrent neural networks. In NIPS.

Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D.,

Fischer, Q., Hashme, S., Hesse, C., et al. (2019). Dota 2 with large scale deep

reinforcement learning. arXiv preprint arXiv:1912.06680.

Bertsekas, D. P. (2011). Approximate Policy Iteration: A Survey and Some New Methods.

Journal of Control Theory and Applications, 9(3):310–335.

282


Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E.,

Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A suite

for analyzing large language models across training and scaling. In International

Conference on Machine Learning, pages 2397–2430. PMLR.

Black, K., Janner, M., Du, Y., Kostrikov, I., and Levine, S. (2024). Training diffusion

models with reinforcement learning.

Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs: I.

the method of paired comparisons. Biometrika, 39(3/4):324–345.

Brafman, R. I. and Tennenholtz, M. (2001). R-max - a general polynomial time algorithm

for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213–231.

Brantley, K., Sun, W., and Henaff, M. (2019). Disagreement-regularized imitation

learning. In International Conference on Learning Representations.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and

Zaremba, W. (2016a). Openai Gym. arXiv preprint arXiv:1606.01540.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and

Zaremba, W. (2016b). Openai gym.

Brown, D., Goo, W., Nagarajan, P., and Niekum, S. (2019). Extrapolating beyond

suboptimal demonstrations via inverse reinforcement learning from observations. In

International conference on machine learning, pages 783–792. PMLR.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and non-stochastic

multi-armed bandit problems. Found. Trends Mach. Learn, 5(1):1–122.

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P.,

283


Lee, Y. T., Li, Y., Lundberg, S., et al. (2023). Sparks of Artificial General Intelligence:

Early Experiments with Gpt-4. arXiv preprint arXiv:2303.12712.

Buciluǎ, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Pro-

ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 535–541.

Buckman, J., Gelada, C., and Bellemare, M. G. (2020). The importance of pessimism in

fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018). Exploration by random

network distillation. arXiv preprint arXiv:1810.12894.

Burda, Y., Edwards, H., Storkey, A. J., and Klimov, O. (2019). Exploration by random

network distillation. In ICLR. OpenReview.net.

Calandriello, D., Carratino, L., Lazaric, A., Valko, M., and Rosasco, L. (2019). Gaussian

process optimization with adaptive sketching: Scalable and no regret. In Proceedings

of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of

Machine Learning Research, pages 533–557.

Camacho, A., Gur, I., Moczulski, M. L., Nachum, O., and Faust, A. (2021). Sparsedice:

Imitation learning for temporally sparse data via regularization. In ICML 2021 Work-

shop on Unsupervised Reinforcement Learning.

CarperAI (2023). https://github.com/carperai/trlx.

Chan, A. J. and van der Schaar, M. (2021). Scalable bayesian inverse reinforcement

learning. arXiv preprint arXiv:2102.06483.

Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., and Sun, W. (2021). Mitigating

covariate shift in imitation learning via offline data with partial coverage. In Ranzato,

284


M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in

Neural Information Processing Systems, volume 34, pages 965–979. Curran Associates,

Inc.

Chang, J. D., Brantley, K., Ramamurthy, R., Misra, D., and Sun, W. (2023). Learning to

generate better than your llm. arXiv preprint arXiv:2306.11816.

Chang, J. D., Sreenivas, D., Huang, Y., Brantley, K., and Sun, W. (2024a). Adversarial

imitation learning via boosting. In The Twelfth International Conference on Learning

Representations.

Chang, J. D., Zhan, W., Oertell, O., Brantley, K., Misra, D., Lee, J. D., and Sun, W.

(2024b). Dataset reset policy optimization for rlhf.

Chang, K.-W., He, H., Daumé III, H., and Langford, J. (2015a). Learning to search for

dependencies. arXiv preprint arXiv:1503.05615.

Chang, K.-w., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. (2015b).

Learning to search better than your teacher. In ICML.

Chen, A., Scheurer, J., Korbak, T., Campos, J. A., Chan, J. S., Bowman, S. R., Cho, K.,

and Perez, E. (2023). Improving code generation by training with natural language

feedback. arXiv preprint arXiv:2303.16749.

Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement

learning. In Proceedings of the 36th International Conference on Machine Learning,

volume 97, pages 1042–1051.

Chen, X., Zhong, H., Yang, Z., Wang, Z., and Wang, L. (2022). Human-in-the-loop:

Provably efficient preference-based reinforcement learning with general function

approximation. In International Conference on Machine Learning, pages 3773–3793.

PMLR.

285


Cheng, C.-A., Yan, X., Wagener, N., and Boots, B. (2018). Fast Policy Learning through

Imitation and Reinforcement. arXiv preprint arXiv:1805.10413.

Chowdhury, S. R. and Gopalan, A. (2019). Online learning in kernelized markov decision

processes. In Chaudhuri, K. and Sugiyama, M., editors, Proceedings of the Twenty-

Second International Conference on Artificial Intelligence and Statistics, volume 89 of

Proceedings of Machine Learning Research, pages 3197–3205. PMLR.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017).

Deep reinforcement learning from human preferences. Advances in neural information

processing systems, 30.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X.,

Dehghani, M., Brahma, S., et al. (2022). Scaling instruction-finetuned language

models. arXiv preprint arXiv:2210.11416.

Contextual.ai (2023). https://contextual.ai/better-cheaper-faster-llm-alignment-with-kto/.

Côté, M.-A., Kádár, A., Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J.,

Hausknecht, M., El Asri, L., Adada, M., et al. (2019). Textworld: A learning environ-

ment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in

Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI

2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75.

Springer.

Curi, S., Berkenkamp, F., and Krause, A. (2020). Efficient model-based reinforce-

ment learning through optimistic policy search and planning. arXiv preprint

arXiv:2006.08684.

Daumé III, H., Langford, J., and Marcu, D. (2009). Search-based structured prediction.

Machine learning.

286


Daumé III, H. and Marcu, D. (2005). Learning as search optimization: Approximate large

margin methods for structured prediction. In Proceedings of the 22nd international

conference on Machine learning, pages 169–176.

Deisenroth, M. and Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient

approach to policy search. In International Conference on Machine Learning, pages

465–472.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A

large-scale hierarchical image database. In 2009 IEEE conference on computer vision

and pattern recognition, pages 248–255. Ieee.

Desai, S., Durugkar, I., Karnan, H., Warnell, G., Hanna, J., Stone, P., and Sony, A. (2020).

An imitation from observation approach to transfer learning with dynamics mismatch.

Advances in Neural Information Processing Systems, 33.

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J.,

Sidor, S., Wu, Y., and Zhokhov, P. (2017). Openai baselines. https://github.com/

openai/baselines.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using real nvp.

arXiv preprint arXiv:1605.08803.

Donsker, M. and Varadhan, S. (1983). Asymptotic evaluation of certain markov process

expectations for large time. iv. Communications on Pure and Applied Mathematics,

36(2):183–212.

Du, S. S., Kakade, S. M., Lee, J. D., Lovett, S., Mahajan, G., Sun, W., and Wang, R.

(2021). Bilinear classes: A structural framework for provable generalization in rl.

ICML.

287

https://github.com/openai/baselines
https://github.com/openai/baselines


Duan, Y., Jia, Z., and Wang, M. (2020). Minimax-optimal off-policy evaluation with

linear function approximation. In Proceedings of the 37th International Conference on

Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages

2701–2709.

Duan, Y., Jin, C., and Li, Z. (2021). Risk bounds and rademacher complexity in batch

reinforcement learning. arXiv preprint arXiv:2103.13883.

Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and

Hashimoto, T. B. (2023). Alpacafarm: A simulation framework for methods that learn

from human feedback.

Duckworth, D., Neelakantan, A., Goodrich, B., Kaiser, L., and Bengio, S. (2019). Parallel

Scheduled Sampling. arXiv preprint arXiv:1906.04331.

Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. (2015). Contextual

dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR.

Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. (2019). Go-explore: a

new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995.

Edwards, A. D., Sahni, H., Schroecker, Y., and Jr, C. L. I. (2019). Imitating latent policies

from observation. In ICML.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement

learning. Journal of Machine Learning Research, 6:503–556.

Fakoor, R., Mueller, J., Chaudhari, P., and Smola, A. J. (2021). Continuous doubly

constrained batch reinforcement learning. arXiv preprint arXiv:2102.09225.

Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020). A theoretical analysis of deep q-learning.

288


In Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume

120 of Proceedings of Machine Learning Research, pages 486–489.

Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh,

M., Lee, K., and Lee, K. (2023). Dpok: Reinforcement learning for fine-tuning

text-to-image diffusion models. arXiv preprint arXiv:2305.16381.

Fedus, W., Goodfellow, I., and Dai, A. M. (2018). Maskgan: better text generation via

filling in the_. arXiv preprint arXiv:1801.07736.

Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between

generative adversarial networks, inverse reinforcement learning, and energy-based

models. arXiv preprint arXiv:1611.03852.

Finn, C., Levine, S., and Abbeel, P. (2016b). Guided cost learning: Deep inverse optimal

control via policy optimization. In ICML.

Fisac, J. F., Akametalu, A. K., Zeilinger, M. N., Kaynama, S., Gillula, J., and Tomlin, C. J.

(2018). A general safety framework for learning-based control in uncertain robotic

systems. IEEE Transactions on Automatic Control, 64(7):2737–2752.

Florensa, C., Held, D., Wulfmeier, M., Zhang, M., and Abbeel, P. (2017). Reverse

curriculum generation for reinforcement learning. In Conference on robot learning,

pages 482–495. PMLR.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine.

The Annals of Statistics, 29(5):1189 – 1232.

Fu, J., Luo, K., and Levine, S. (2018). Learning robust rewards with adversarial inverse

reinforcement learning.

289


Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning

without exploration. In Proceedings of the 36th International Conference on Machine

Learning, volume 97 of Proceedings of Machine Learning Research, pages 2052–2062.

PMLR.

Garg, D., Chakraborty, S., Cundy, C., Song, J., and Ermon, S. (2021). Iq-learn: Inverse

soft-q learning for imitation. Advances in Neural Information Processing Systems,

34:4028–4039.

Ghasemipour, S. K. S., Zemel, R., and Gu, S. (2020). A divergence minimization

perspective on imitation learning methods. In Conference on Robot Learning, pages

1259–1277. PMLR.

Github (2023). https://github.com/features/copilot. Accessed: 2023-May-13.

Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M.,

Weidinger, L., Chadwick, M., Thacker, P., et al. (2022). Improving alignment of

dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.

Go, D., Korbak, T., Kruszewski, G., Rozen, J., Ryu, N., and Dymetman, M. (2023).

Aligning language models with preferences through f-divergence minimization. arXiv

preprint arXiv:2302.08215.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,

Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in

neural information processing systems, pages 2672–2680.

Goyal, K., Dyer, C., and Berg-Kirkpatrick, T. (2017). Differentiable Scheduled Sampling

for Credit Assignment. arXiv preprint arXiv:1704.06970.

Goyal, T., Li, J. J., and Durrett, G. (2022). News Summarization and Evaluation in the

Era of Gpt-3. arXiv preprint arXiv:2209.12356.

290


Grover, A. and Ermon, S. (2017). Boosted generative models.

Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016). Continuous deep q-learning

with model-based acceleration.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with

deep energy-based policies. In International conference on machine learning, pages

1352–1361. PMLR.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018a). Soft actor-critic: Off-policy

maximum entropy deep reinforcement learning with a stochastic actor.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018b). Soft actor-critic: Off-

policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR,

abs/1801.01290.

Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2020). Dream to control: Learning

behaviors by latent imagination. In International Conference on Learning Representa-

tions.

Hancock, B., Bringmann, M., Varma, P., Liang, P., Wang, S., and Ré, C. (2018). Training

classifiers with natural language explanations. In Proceedings of the conference.

Association for Computational Linguistics. Meeting, volume 2018, page 1884. NIH

Public Access.

Hazan, E., Kakade, S., Singh, K., and Van Soest, A. (2019). Provably efficient maximum

entropy exploration. In International Conference on Machine Learning, pages 2681–

2691. PMLR.

Hermann, K. M., Hill, F., Green, S., Wang, F., Faulkner, R., Soyer, H., Szepesvari, D.,

Czarnecki, W. M., Jaderberg, M., Teplyashin, D., et al. (2017). Grounded Language

Learning in a Simulated 3D World. arXiv preprint arXiv:1706.06551.

291


Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal,

P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J.,

Sidor, S., and Wu, Y. (2018). Stable baselines. https://github.com/hill-a/

stable-baselines.

Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge ina neural network.

arXiv preprint arXiv:1503.02531.

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B.,

Norouzi, M., Fleet, D. J., et al. (2022). Imagen video: High definition video generation

with diffusion models. arXiv preprint arXiv:2210.02303.

Ho, J. and Ermon, S. (2016a). Generative adversarial imitation learning. In NIPS.

Ho, J. and Ermon, S. (2016b). Generative adversarial imitation learning. CoRR,

abs/1606.03476.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Ad-

vances in neural information processing systems, 33:6840–6851.

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2019). The curious case of

neural text degeneration. arXiv preprint arXiv:1904.09751.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen,

W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint

arXiv:2106.09685.

Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen,

W. (2022). LoRA: Low-rank adaptation of large language models. In International

Conference on Learning Representations.

292

https://github.com/hill-a/stable-baselines
https://github.com/hill-a/stable-baselines


Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J.,

Mordatch, I., Chebotar, Y., et al. (2022). Inner Monologue: Embodied Reasoning

through Planning with Language Models. In Proceedings of Annual Conference on

Robot Learning.

Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement

learning. Journal of Machine Learning Research, 11(Apr):1563–1600.

Janner, M., Du, Y., Tenenbaum, J. B., and Levine, S. (2022). Planning with diffusion for

flexible behavior synthesis. arXiv preprint arXiv:2205.09991.

Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-

based policy optimization. CoRR, abs/1906.08253.

Janz, D., Burt, D., and Gonzalez, J. (2020). Bandit optimisation of functions in the

matérn kernel rkhs. In Proceedings of the Twenty Third International Conference on

Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning

Research, pages 2486–2495.

Jarrett, D., Bica, I., and van der Schaar, M. (2020). Strictly batch imitation learning by

energy-based distribution matching. In Advances in Neural Information Processing

Systems, volume 33, pages 7354–7365.

Jiang, N. (2020). Notes on tabular methods.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. (2016).

Contextual decision processes with low bellman rank are pac-learnable. arXiv preprint

arXiv:1610.09512.

Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., and Liu, Q.

(2019). Tinybert: Distilling bert for natural language understanding. arXiv preprint

arXiv:1909.10351.

293


Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020a). Provably efficient reinforcement

learning with linear function approximation. In Conference on Learning Theory, pages

2137–2143. PMLR.

Jin, Y., Yang, Z., and Wang, Z. (2020b). Is pessimism provably efficient for offline rl?

arXiv preprint arXiv:2012.15085.

Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. (2020a). In-

formation theoretic regret bounds for online nonlinear control. arXiv preprint

arXiv:2006.12466.

Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. (2020b). Infor-

mation theoretic regret bounds for online nonlinear control. In Advances in Neural

Information Processing Systems, volume 33, pages 15312–15325.

Kakade, S. and Langford, J. (2002a). Approximately optimal approximate reinforcement

learning. In ICML.

Kakade, S. M. (2001a). A natural policy gradient. Advances in neural information

processing systems, 14.

Kakade, S. M. (2001b). A natural policy gradient. In NIPS, pages 1531–1538.

Kakade, S. M. et al. (2003a). On the sample complexity of reinforcement learning. PhD

thesis, University of London London, England.

Kakade, S. M., Kearns, M. J., and Langford, J. (2003b). Exploration in metric state

spaces. In ICML.

Kakade, S. M., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. (2020c).

Information theoretic regret bounds for online nonlinear control. In NeurIPS.

294


Kakade, S. M. and Langford, J. (2002b). Approximately optimal approximate reinforce-

ment learning. In ICML.

Ke, L., Barnes, M., Sun, W., Lee, G., Choudhury, S., and Srinivasa, S. (2019). Imitation

learning as f -divergence minimization. arXiv preprint arXiv:1905.12888.

Ke, L., Choudhury, S., Barnes, M., Sun, W., Lee, G., and Srinivasa, S. (2020). Imitation

learning as f -divergence minimization.

Kearns, M. and Singh, S. (2002a). Near-optimal reinforcement learning in polynomial

time. Machine learning, 49(2-3):209–232.

Kearns, M. and Singh, S. (2002b). Near optimal reinforcement learning in polynomial

time. Machine Learning, 49(2-3):209–232.

Khalifa, M., Elsahar, H., and Dymetman, M. (2020). A distributional approach to

controlled text generation. arXiv preprint arXiv:2012.11635.

Khan Academy (2023). https://blog.khanacademy.org/harnessing-ai-so-that-all-students-

benefit-a-nonprofit-approach-for-equal-access/. Accessed: 2023-May-14.

Kidambi, R., Chang, J., and Sun, W. (2021). Optimism is all you need: Model-based

imitation learning from observation alone. arXiv preprint arXiv:2102.10769.

Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. (2020a). Morel: Model-

based offline reinforcement learning. CoRR, abs/2005.05951.

Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. (2020b). Morel: Model-

based offline reinforcement learning. In Advances in Neural Information Processing

Systems, volume 33, pages 21810–21823. Curran Associates, Inc.

Kiegeland, S. and Kreutzer, J. (2021). Revisiting the weaknesses of reinforcement

learning for neural machine translation. arXiv preprint arXiv:2106.08942.

295


Kim, G.-H., Seo, S., Lee, J., Jeon, W., Hwang, H., Yang, H., and Kim, K.-E. (2022).

Demodice: Offline imitation learning with supplementary imperfect demonstrations.

In International Conference on Learning Representations (ICLR).

Kim, Y. and Rush, A. M. (2016). Sequence-level knowledge distillation. arXiv preprint

arXiv:1606.07947.

Ko, J., Klein, D. J., Fox, D., and Haehnel, D. (2007). Gaussian processes and reinforce-

ment learning for identification and control of an autonomous blimp. In Proceedings

2007 ieee international conference on robotics and automation, pages 742–747. IEEE.

Korbak, T., Elsahar, H., Kruszewski, G., and Dymetman, M. (2022). On reinforcement

learning and distribution matching for fine-tuning language models with no catas-

trophic forgetting. Advances in Neural Information Processing Systems, 35:16203–

16220.

Kostrikov, I., Agrawal, K. K., Dwibedi, D., Levine, S., and Tompson, J. (2019a).

Discriminator-actor-critic: Addressing sample inefficiency and reward bias in ad-

versarial imitation learning. In ICLR. OpenReview.net.

Kostrikov, I., Agrawal, K. K., Dwibedi, D., Levine, S., and Tompson, J. (2019b).

Discriminator-actor-critic: Addressing sample inefficiency and reward bias in ad-

versarial imitation learning. In International Conference on Learning Representations.

Kostrikov, I., Nachum, O., and Tompson, J. (2019c). Imitation learning via off-policy

distribution matching. ICLR.

Kostrikov, I., Nachum, O., and Tompson, J. (2020). Imitation learning via off-policy

distribution matching. In International Conference on Learning Representations.

Kreutzer, J., Khadivi, S., Matusov, E., and Riezler, S. (2018a). Can neural machine

translation be improved with user feedback? arXiv preprint arXiv:1804.05958.

296


Kreutzer, J., Uyheng, J., and Riezler, S. (2018b). Reliability and learnability of human

bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint

arXiv:1805.10627.

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for

offline reinforcement learning. arXiv preprint arXiv:2006.04779.

Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. (2018). Model-ensemble

trust-region policy optimization. In ICLR. OpenReview.net.

Kwon, M., Xie, S. M., Bullard, K., and Sadigh, D. (2023). Reward design with language

models. arXiv preprint arXiv:2303.00001.

Lampe, T. and Riedmiller, M. A. (2014). Approximate model-assisted neural fitted

q-iteration. In IJCNN, pages 2698–2704. IEEE.

Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.

Le Gratiet, L., Le Gratiet, L., Garnier, J., and Garnier, J. (2015). Asymptotic analysis of

the learning curve for gaussian process regression. Machine learning, 98(3):407–433.

Leblond, R., Alayrac, J.-B., Osokin, A., and Lacoste-Julien, S. (2017). SEARNN:

Training RNNs with Global-local losses. arXiv preprint arXiv:1706.04499.

Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V., and

Rastogi, A. (2023a). Rlaif: Scaling reinforcement learning from human feedback with

ai feedback. arXiv preprint arXiv:2309.00267.

Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y., Boutilier, C., Abbeel, P., Ghavamzadeh,

M., and Gu, S. S. (2023b). Aligning text-to-image models using human feedback.

arXiv preprint arXiv:2302.12192.

297


Lee, P., Bubeck, S., and Petro, J. (2023c). Benefits, Limits, and Risks of GPT-4 as an AI

Chatbot for Medicine. New England Journal of Medicine, 388(13):1233–1239.

Li, J., Monroe, W., Ritter, A., Jurafsky, D., Galley, M., and Gao, J. (2016). Deep

Reinforcement Learning for Dialogue Generation. In Proceedings of Conference on

Empirical Methods in Natural Language Processing.

Li, W. and Todorov, E. (2004). Iterative linear quadratic regulator design for nonlinear

biological movement systems. In ICINCO, pages 222–229.

Li, Z., Yang, Z., and Wang, M. (2023). Reinforcement learning with human feedback:

Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438.

Liao, P., Qi, Z., and Murphy, S. (2020). Batch policy learning in average reward markov

decision processes. arXiv preprint arXiv:2007.11771.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and

Wierstra, D. (2019). Continuous control with deep reinforcement learning.

Lin, A., Wohlwend, J., Chen, H., and Lei, T. (2020a). Autoregressive knowledge

distillation through imitation learning. arXiv preprint arXiv:2009.07253.

Lin, B. Y., Zhou, W., Shen, M., Zhou, P., Bhagavatula, C., Choi, Y., and Ren, X. (2020b).

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense

Reasoning. In Findings of Association for Computational Linguistics: EMNLP.

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text

summarization branches out, pages 74–81.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023a). Visual instruction tuning. In NeurIPS.

Liu, H., Sferrazza, C., and Abbeel, P. (2023b). Languages are rewards: Hindsight

finetuning using human feedback. arXiv preprint arXiv:2302.02676.

298


Liu, S., Zhu, Z., Ye, N., Guadarrama, S., and Murphy, K. (2017). Improved Image

Captioning via Policy Gradient Optimization of Spider. In Proceedings of International

Conference on Computer Vision.

Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P. J., and Liu, J. (2023c).

Statistical rejection sampling improves preference optimization. arXiv preprint

arXiv:2309.06657.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020). Provably good batch

off-policy reinforcement learning without great exploration. In Advances in Neural

Information Processing Systems, volume 33, pages 1264–1274.

Lowrey, K., Rajeswaran, A., Kakade, S., Todorov, E., and Mordatch, I. (2019). Plan

Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control.

In International Conference on Learning Representations (ICLR).

Luo, S., Tan, Y., Huang, L., Li, J., and Zhao, H. (2023). Latent consistency models:

Synthesizing high-resolution images with few-step inference.

Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. (2018). Algorithmic framework for

model-based deep reinforcement learning with theoretical guarantees. arXiv preprint

arXiv:1807.03858.

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning

Word Vectors for Sentiment Analysis. In Proceedings of Annual Meeting of the

Association for Computational Linguistics: Human language technologies.

MacGlashan, J., Ho, M. K., Loftin, R., Peng, B., Wang, G., Roberts, D. L., Taylor, M. E.,

and Littman, M. L. (2017). Interactive learning from policy-dependent human feedback.

In International Conference on Machine Learning, pages 2285–2294. PMLR.

299


Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999). Boosting algorithms as gradient

descent. In Solla, S., Leen, T., and Müller, K., editors, Advances in Neural Information

Processing Systems, volume 12. MIT Press.

Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., and Gu, S. (2020). Deployment-

efficient reinforcement learning via model-based offline optimization. ICLR.

Mihaylova, T. and Martins, A. F. (2019). Scheduled Sampling for Transformers. arXiv

preprint arXiv:1906.07651.

Misra, D., Langford, J., and Artzi, Y. (2017). Mapping Instructions and Visual Obser-

vations to Actions with Reinforcement Learning. In Proceedings of Conference on

Empirical Methods in Natural Language Processing.

Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., and Awadallah, A.

(2023). Orca: Progressive learning from complex explanation traces of gpt-4. arXiv

preprint arXiv:2306.02707.

Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration.

Journal of Machine Learning Research, 9(May):815–857.

Nachum, O., Chow, Y., Dai, B., and Li, L. (2019a). Dualdice: Behavior-agnostic

estimation of discounted stationary distribution corrections. Advances in Neural

Information Processing Systems, 32.

Nachum, O., Chow, Y., Dai, B., and Li, L. (2019b). Dualdice: Behavior-agnostic

estimation of discounted stationary distribution corrections. Advances in Neural

Information Processing Systems 2019.

Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics

for model-based deep reinforcement learning with model-free fine-tuning. In IEEE

International Conference on Robotics and Automation, pages 7559–7566.

300


Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Over-

coming exploration in reinforcement learning with demonstrations. In 2018 IEEE

international conference on robotics and automation (ICRA), pages 6292–6299. IEEE.

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain,

S., Kosaraju, V., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-

answering with human feedback. arXiv preprint arXiv:2112.09332.

Narasimhan, K., Kulkarni, T., and Barzilay, R. (2015). Language Understanding for

Text-based Games using Deep Reinforcement Learning. In Proceedings of Conference

on Empirical Methods in Natural Language Processing.

Ng, A. Y. and Russell, S. (2000). Algorithms for inverse reinforcement learning. In Proc.

ICML, pages 663–670.

Nguyen, K., Daumé III, H., and Boyd-Graber, J. (2017). Reinforcement learning for

bandit neural machine translation with simulated human feedback. arXiv preprint

arXiv:1707.07402.

Novoseller, E., Wei, Y., Sui, Y., Yue, Y., and Burdick, J. (2020). Dueling posterior

sampling for preference-based reinforcement learning. In Conference on Uncertainty

in Artificial Intelligence, pages 1029–1038. PMLR.

Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-gan: Training generative neural

samplers using variational divergence minimization. Advances in neural information

processing systems, 29.

Oertell, O., Chang, J. D., Zhang, Y., Brantley, K., and Sun, W. (2024). Rl for consistency

models: Faster reward guided text-to-image generation.

OpenAI (2023). Gpt-4 technical report.

301


OpenAI (2023). https://openai.com/blog/chatgpt.

Orsini, M., Raichuk, A., Hussenot, L., Vincent, D., Dadashi, R., Girgin, S., Geist, M.,

Bachem, O., Pietquin, O., and Andrychowicz, M. (2021). What matters for adversarial

imitation learning? Advances in Neural Information Processing Systems, 34:14656–

14668.

Osband, I., Aslanides, J., and Cassirer, A. (2018a). Randomized prior functions for deep

reinforcement learning. CoRR, abs/1806.03335.

Osband, I., Aslanides, J., and Cassirer, A. (2018b). Randomized prior functions for

deep reinforcement learning. In Advances in Neural Information Processing Systems,

volume 31.

Osband, I. and Van Roy, B. (2014). Model-based reinforcement learning and the Eluder

dimension. In Advances in Neural Information Processing Systems, pages 1466–1474.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C.,

Agarwal, S., Slama, K., Ray, A., et al. (2022). Training Language Models to Follow

Instructions with Human Feedback. In Proceedings of Neural Information Processing

Systems.

Pacchiano, A., Saha, A., and Lee, J. (2021). Dueling rl: reinforcement learning with

trajectory preferences. arXiv preprint arXiv:2111.04850.

Pang, R. Y. and He, H. (2021). Text Generation by Learning from Demonstrations. In

Proceedings of International Conference on Learning Representations.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: A Method for

Automatic Evaluation of Machine Translation. In Proceedings of Association for

Computational Linguistics.

302


Pathak, D., Gandhi, D., and Gupta, A. (2019a). Self-supervised exploration via disagree-

ment. In ICML, pages 5062–5071.

Pathak, D., Gandhi, D., and Gupta, A. (2019b). Self-supervised exploration via dis-

agreement. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th

International Conference on Machine Learning, volume 97 of Proceedings of Machine

Learning Research, pages 5062–5071. PMLR.

Peng, X. B., Abbeel, P., Levine, S., and van de Panne, M. (2018). Deepmimic: example-

guided deep reinforcement learning of physics-based character skills. ACM Trans.

Graphics.

Pomerlau, D. A. (1989). ALVINN: An autonomous land vehicle in a neural network. In

Touretzky, D. S., editor, Advances in Neural Information Processing Systems, pages

323–331, San Mateo, CA. Morgan Kaufmann Publishers inc.

Pomerleau, D. A. (1988). Alvinn: An autonomous land vehicle in a neural network. In

Touretzky, D., editor, Advances in Neural Information Processing Systems, volume 1.

Morgan-Kaufmann.

Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network.

Technical report, CMU.

Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G., Vecerik, M., Lampe,

T., Tassa, Y., Erez, T., and Riedmiller, M. (2017). Data-efficient deep reinforcement

learning for dexterous manipulation. arXiv preprint arXiv:1704.03073.

Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K. (2023). Aligning text-to-

image diffusion models with reward backpropagation.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,

Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models

303


from natural language supervision. In International conference on machine learning,

pages 8748–8763. PMLR.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019).

Language models are unsupervised multitask learners. OpenAI blog, 1:9.

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. (2023a).

Direct preference optimization: Your language model is secretly a reward model. arXiv

preprint arXiv:2305.18290.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023b).

Direct preference optimization: Your language model is secretly a reward model. In

Thirty-seventh Conference on Neural Information Processing Systems.

Rafailov, R., Yu, T., Rajeswaran, A., and Finn, C. (2021). Visual adversarial imitation

learning using variational models. In Ranzato, M., Beygelzimer, A., Dauphin, Y.,

Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing

Systems, volume 34, pages 3016–3028. Curran Associates, Inc.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W.,

and Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text

transformer. Journal of Machine Learning Research, 21(140):1–67.

Rahimi, A. and Recht, B. (2008a). Random features for large-scale kernel machines. In

Advances in Neural Information Processing Systems, pages 1177–1184.

Rahimi, A. and Recht, B. (2008b). Random features for large-scale kernel machines. In

Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information

Processing Systems, volume 20. Curran Associates, Inc.

Rajaraman, N., Yang, L. F., Jiao, J., and Ramachandran, K. (2020). Toward the funda-

mental limits of imitation learning. arXiv preprint arXiv:2009.05990.

304


Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., and

Levine, S. (2017a). Learning complex dexterous manipulation with deep reinforcement

learning and demonstrations. arXiv preprint arXiv:1709.10087.

Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., and Levine,

S. (2018). Learning Complex Dexterous Manipulation with Deep Reinforcement

Learning and Demonstrations. In Proceedings of Robotics: Science and Systems

(RSS).

Rajeswaran, A., Lowrey, K., Todorov, E., and Kakade, S. (2017b). Towards Generaliza-

tion and Simplicity in Continuous Control. In NIPS.

Rajeswaran, A., Mordatch, I., and Kumar, V. (2020). A game theoretic framework for

model based reinforcement learning. ArXiv, abs/2004.07804.

Ramamurthy, R., Ammanabrolu, P., Brantley, K., Hessel, J., Sifa, R., Bauckhage, C.,

Hajishirzi, H., and Choi, Y. (2022a). Is Reinforcement Learning (Not) for Natural

Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural

Language Policy Optimization. arXiv preprint arXiv:2210.01241.

Ramamurthy, R., Ammanabrolu, P., Brantley, K., Hessel, J., Sifa, R., Bauckhage, C.,

Hajishirzi, H., and Choi, Y. (2022b). Is reinforcement learning (not) for natural

language processing?: Benchmarks, baselines, and building blocks for natural language

policy optimization. arXiv preprint arXiv:2210.01241.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and

Sutskever, I. (2021). Zero-shot text-to-image generation. In International Conference

on Machine Learning, pages 8821–8831. PMLR.

Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2015). Sequence level training

with recurrent neural networks. ICLR 2016.

305


Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline

reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint

arXiv:2103.12021.

Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine

Learning (Adaptive Computation and Machine Learning). The MIT Press.

Reddy, S., Dragan, A. D., and Levine, S. (2019). Sqil: Imitation learning via reinforce-

ment learning with sparse rewards. arXiv preprint arXiv:1905.11108.

Reddy, S., Dragan, A. D., and Levine, S. (2020). Sqil: Imitation learning via reinforce-

ment learning with sparse rewards. In ICLR.

Ren, Z., Wang, X., Zhang, N., Lv, X., and Li, L.-J. (2017). Deep Reinforcement

Learning-Based Image Captioning With Embedding Reward. In Proceedings of IEEE

Conference on Computer Vision and Pattern Recognition (CVPR).

Roit, P., Ferret, J., Shani, L., Aharoni, R., Cideron, G., Dadashi, R., Geist, M., Girgin, S.,

Hussenot, L., Keller, O., et al. (2023). Factually consistent summarization via reinforce-

ment learning with textual entailment feedback. arXiv preprint arXiv:2306.00186.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-

resolution image synthesis with latent diffusion models. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages

10684–10695.

Ross, S. and Bagnell, D. (2010a). Efficient reductions for imitation learning. In Teh,

Y. W. and Titterington, D. M., editors, AISTATS, JMLR Proceedings, pages 661–668.

Ross, S. and Bagnell, J. A. (2010b). Efficient reductions for imitation learning. In

AISTATS, pages 661–668.

306


Ross, S. and Bagnell, J. A. (2014a). Reinforcement and imitation learning via interactive

no-regret learning. arXiv preprint arXiv:1406.5979.

Ross, S. and Bagnell, J. A. (2014b). Reinforcement and Imitation Learning via Interactive

No-regret Learning. arXiv preprint arXiv:1406.5979.

Ross, S., Gordon, G., and Bagnell, D. (2011a). A reduction of imitation learning and

structured prediction to no-regret online learning. In Proceedings of the fourteenth

international conference on artificial intelligence and statistics, pages 627–635.

Ross, S., Gordon, G. J., and Bagnell, D. (2011b). A reduction of imitation learning and

structured prediction to no-regret online learning. In AISTATS, pages 627–635.

Ross, S., Gordon, G. J., and Bagnell, J. (2011c). A reduction of imitation learning and

structured prediction to no-regret online learning. In AISTATS.

Ross, S., Melik-Barkhudarov, N., Shankar, K. S., Wendel, A., Dey, D., Bagnell, J. A.,

and Hebert, M. (2013). Learning monocular reactive uav control in cluttered natural

environments. In ICRA.

Russo, D. and Roy, B. V. (2013). Eluder dimension and the sample complexity of

optimistic exploration. In NIPS, pages 2256–2264.

Russo, D. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathe-

matics of Operations Research, 39(4):1221–1243.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K.,

Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. (2022). Photorealistic text-

to-image diffusion models with deep language understanding. Advances in Neural

Information Processing Systems, 35:36479–36494.

307


Salimans, T. and Chen, R. (2018). Learning montezuma’s revenge from a single demon-

stration. arXiv preprint arXiv:1812.03381.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a Distilled Version

of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108.

Sasaki, F., Yohira, T., and Kawaguchi, A. (2019). Sample efficient imitation learning for

continuous control. In International conference on learning representations.

Scherrer, B. and Geist, M. (2014). Local policy search in a convex space and conservative

policy iteration as boosted policy search. In Machine Learning and Knowledge

Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France,

September 15-19, 2014. Proceedings, Part III 14, pages 35–50. Springer.

Scheurer, J., Campos, J. A., Korbak, T., Chan, J. S., Chen, A., Cho, K., and Perez, E.

(2023). Training language models with language feedback at scale. arXiv preprint

arXiv:2303.16755.

Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S., and Finn, C. (2020). Reinforce-

ment learning with videos: Combining offline observations with interaction. CoRR,

abs/2011.06507.

Schuhmann, C. (2022). Laion aesthetics. https://laion.ai/blog/

laion-aesthetics/.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. (2015a). Trust region

policy optimization. In ICML, pages 1889–1897.

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015b). Trust region

policy optimization. CoRR, abs/1502.05477.

308

https://laion.ai/blog/ laion- aesthetics/
https://laion.ai/blog/ laion- aesthetics/


Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2017a). Trust region

policy optimization.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal

policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017c). Proximal

policy optimization algorithms. CoRR, abs/1707.06347.

See, A., Liu, P. J., and Manning, C. D. (2017). Get to the point: Summarization

with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the

Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083,

Vancouver, Canada. Association for Computational Linguistics.

Seeger, M. W., Kakade, S. M., and Foster, D. P. (2008). Information consistency of

nonparametric gaussian process methods. IEEE Transactions on Information Theory,

54(5):2376–2382.

Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2015). Minimum risk

training for neural machine translation. arXiv preprint arXiv:1512.02433.

Shin, D., Dragan, A. D., and Brown, D. S. (2023). Benchmarks and algorithms for offline

preference-based reward learning. arXiv preprint arXiv:2301.01392.

Silver, D. et al. (2016a). Mastering the game of go with deep neural networks and tree

search. Nature.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrit-

twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D.,

Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K.,

Graepel, T., and Hassabis, D. (2016b). Mastering the game of Go with deep neural

networks and tree search. Nature, 529:484–489.

309


Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M.,

Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T. P., Simonyan, K., and Hassabis, D.

(2018). A general reinforcement learning algorithm that masters chess, shogi, and go

through self-play. Science, 362:1140–1144.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014).

Deterministic policy gradient algorithms. In International conference on machine

learning, pages 387–395. Pmlr.

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O.,

Gafni, O., et al. (2022). Make-a-video: Text-to-video generation without text-video

data. arXiv preprint arXiv:2209.14792.

Snell, C., Kostrikov, I., Su, Y., Yang, M., and Levine, S. (2022). Offline rl for natural lan-

guage generation with implicit language q learning. arXiv preprint arXiv:2206.11871.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsu-

pervised learning using nonequilibrium thermodynamics. In International conference

on machine learning, pages 2256–2265. PMLR.

Sokolov, A., Riezler, S., and Urvoy, T. (2016). Bandit structured prediction for

learning from partial feedback in statistical machine translation. arXiv preprint

arXiv:1601.04468.

Sollich, P. and Halees, A. (2002). Learning curves for gaussian process regression:

Approximations and bounds. Neural computation, 14(6):1393–1428.

Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. (2023). Consistency models. arXiv

preprint arXiv:2303.01469.

Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data

distribution. Advances in neural information processing systems, 32.

310


Song, Y., Mavalankar, A., Sun, W., and Gao, S. (2020a). Provably efficient model-

based policy adaptation. In International Conference on Machine Learning, pages

9088–9098. PMLR.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2020b).

Score-based generative modeling through stochastic differential equations. arXiv

preprint arXiv:2011.13456.

Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishnamurthy, A., and Sun, W. (2022).

Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint

arXiv:2210.06718.

Spencer, J., Choudhury, S., Venkatraman, A., Ziebart, B., and Bagnell, J. A. (2021).

Feedback in imitation learning: The three regimes of covariate shift. arXiv preprint

arXiv:2102.02872.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process opti-

mization in the bandit setting: No regret and experimental design. In Proceedings of

the 27th International Conference on International Conference on Machine Learning,

ICML’10, page 1015–1022.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2009). Gaussian process

optimization in the bandit setting: No regret and experimental design. arXiv preprint

arXiv:0912.3995.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei,

D., and Christiano, P. F. (2020). Learning to Summarize with Human Feedback. In

Proceedings of Neural Information Processing Systems.

Sumers, T. R., Ho, M. K., Hawkins, R. D., Narasimhan, K., and Griffiths, T. L. (2021).

311


Learning rewards from linguistic feedback. In Proceedings of the AAAI Conference on

Artificial Intelligence, volume 35, pages 6002–6010.

Sun, M., Mahajan, A., Hofmann, K., and Whiteson, S. (2021). Softdice for imitation learn-

ing: Rethinking off-policy distribution matching. arXiv preprint arXiv:2106.03155.

Sun, W., Bagnell, J. A., and Boots, B. (2018). Truncated Horizon Policy Search: Combin-

ing Reinforcement Learning & Imitation Learning. arXiv preprint arXiv:1805.11240.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. (2019a). Model-

based rl in contextual decision processes: Pac bounds and exponential improvements

over model-free approaches. In Conference on Learning Theory, pages 2898–2933.

PMLR.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. (2019b). Model-

based rl in contextual decision processes: Pac bounds and exponential improvements

over model-free approaches. In Proceedings of the Thirty-Second Conference on

Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages

2898–2933.

Sun, W., Vemula, A., Boots, B., and Bagnell, D. (2019c). Provably efficient imitation

learning from observation alone. In ICML, volume 97. PMLR.

Sun, W., Vemula, A., Boots, B., and Bagnell, D. (2019d). Provably efficient imitation

learning from observation alone. In International Conference on Machine Learning,

pages 6036–6045. PMLR.

Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. (2017a). Deeply

Aggrevated: Differentiable Imitation Learning for Sequential Prediction. In Proceed-

ings of International Conference on Machine Learning.

312


Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. (2017b). Deeply

aggrevated: Differentiable imitation learning for sequential prediction. arXiv preprint

arXiv:1703.01030.

Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. (2013). On the importance of

initialization and momentum in deep learning. In ICML, volume 28.

Sutton, R. S. (1990). First results with dyna, an integrated architecture for learning,

planning, and reacting. In Neural Networks for Control, pages 179–189. The MIT

Press: Cambridge, MA, USA.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT

Press, Cambridge, MA.

Swamy, G., Choudhury, S., Bagnell, J. A., and Wu, Z. S. (2021). Of moments and

matching: A game-theoretic framework for closing the imitation gap. In Proceedings

of the 38th International Conference on Machine Learning.

Talebi, M. S. and Maillard, O.-A. (2018). Variance-aware regret bounds for undiscounted

reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770–805.

PMLR.

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdol-

maleki, A., Merel, J., Lefrancq, A., et al. (2018). Deepmind control suite. arXiv

preprint arXiv:1801.00690.

Tavakoli, A., Levdik, V., Islam, R., Smith, C. M., and Kormushev, P. (2018). Exploring

restart distributions. arXiv preprint arXiv:1811.11298.

Tiapkin, D., Belomestny, D., Calandriello, D., Moulines, E., Naumov, A., Perrault, P.,

Valko, M., and Menard, P. (2023). Demonstration-regularized rl.

313


Todorov, E., Erez, T., and Tassa, Y. (2012a). MuJoCo: A physics engine for model-based

control. In IEEE International Conference on Intelligent Robots and Systems, pages

5026–5033.

Todorov, E., Erez, T., and Tassa, Y. (2012b). Mujoco: A physics engine for model-based

control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems,

pages 5026–5033.

Tolstikhin, I., Gelly, S., Bousquet, O., Simon-Gabriel, C.-J., and Schölkopf, B. (2017).

Adagan: Boosting generative models.

Torabi, F., Warnell, G., and Stone, P. (2018a). Behavioral cloning from observation. In

IJCAI, pages 4950–4957.

Torabi, F., Warnell, G., and Stone, P. (2018b). Generative adversarial imitation from

observation. arXiv preprint arXiv:1807.06158.

Touati, A., Zhang, A., Pineau, J., and Vincent, P. (2020). Stable policy optimization via

off-policy divergence regularization. arXiv preprint arXiv:2003.04108.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N.,

Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and

fine-tuned chat models. arXiv preprint arXiv:2307.09288.

Uchendu, I., Xiao, T., Lu, Y., Zhu, B., Yan, M., Simon, J., Bennice, M., Fu, C., Ma, C.,

Jiao, J., et al. (2023). Jump-start reinforcement learning. In International Conference

on Machine Learning, pages 34556–34583. PMLR.

Uehara, M., Huang, J., and Jiang, N. (2020). Minimax weight and q-function learning for

off-policy evaluation. In Proceedings of the 37th International Conference on Machine

Learning, pages 9659–9668.

314


Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W., and Xie, T. (2021). Finite

sample analysis of minimax offline reinforcement learning: Completeness, fast rates

and first-order efficiency. arXiv preprint arXiv:2102.02981.

Uehara, M., Zhao, Y., Black, K., Hajiramezanali, E., Scalia, G., Diamant, N. L., Tseng,

A. M., Biancalani, T., and Levine, S. (2024). Fine-tuning of continuous-time diffusion

models as entropy-regularized control.

Umlauft, J., Pöhler, L., and Hirche, S. (2018). An uncertainty-based control lyapunov

approach for control-affine systems modeled by gaussian process. IEEE Control

Systems Letters, 2(3):483–488.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-

time analysis of kernelised contextual bandits. In Proceedings of the Twenty-Ninth

Conference on Uncertainty in Artificial Intelligence, UAI’13, page 654–663, Arlington,

Virginia, USA. AUAI Press.

Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based

Image Description Evaluation. In Proceedings of IEEE Conference on Computer

Vision and Pattern Recognition, pages 4566–4575.

Venkatraman, A., Hebert, M., and Bagnell, J. A. (2015). Improving multi-step prediction

of learned time series models. AAAI.

Vincent, P. (2011). A connection between score matching and denoising autoencoders.

Neural computation, 23(7):1661–1674.

Völske, M., Potthast, M., Syed, S., and Stein, B. (2017). TL;DR: Mining Reddit to

learn automatic summarization. In Proceedings of the Workshop on New Frontiers in

Summarization, pages 59–63, Copenhagen, Denmark. Association for Computational

Linguistics.

315


Wainwright, M. J. (2019). High-Dimensional Statistics : A Non-Asymptotic Viewpoint.

Cambridge University Press, New York.

Wang, B. and Komatsuzaki, A. (2021). Gpt-j-6b: A 6 billion parameter autoregressive

language model.

Wang, R., Foster, D. P., and Kakade, S. M. (2020a). What are the statistical limits of

offline rl with linear function approximation?. arXiv preprint arXiv:2010.11895.

Wang, T., Bao, X., Clavera, I., Hoang, J., Wen, Y., Langlois, E., Zhang, S., Zhang, G.,

Abbeel, P., and Ba, J. (2019). Benchmarking model-based reinforcement learning.

arXiv preprint arXiv:1907.02057.

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020b). Minilm: Deep

self-attention distillation for task-agnostic compression of pre-trained transformers.

Advances in Neural Information Processing Systems, 33:5776–5788.

Warnell, G., Waytowich, N., Lawhern, V., and Stone, P. (2018). Deep tamer: Interactive

agent shaping in high-dimensional state spaces. In Proceedings of the AAAI conference

on artificial intelligence, volume 32.

Williams, C. K. and Vivarelli, F. (2000). Upper and lower bounds on the learning curve

for gaussian processes. Machine learning, 40(1):77–102.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist

reinforcement learning. Machine learning.

Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J., et al. (2017). A survey of preference-

based reinforcement learning methods. Journal of Machine Learning Research,

18(136):1–46.

316


Wiseman, S. and Rush, A. M. (2016). Sequence-to-Sequence Learning as Beam-Search

Optimization. In Proceedings of Conference on Empirical Methods in Natural Lan-

guage Processing.

Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano,

P. (2021). Recursively summarizing books with human feedback. arXiv preprint

arXiv:2109.10862.

Wu, R. and Sun, W. (2023). Making rl with preference-based feedback efficient via

randomization. arXiv preprint arXiv:2310.14554.

Wu, T., Zhu, B., Zhang, R., Wen, Z., Ramchandran, K., and Jiao, J. (2023). Pairwise

proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv

preprint arXiv:2310.00212.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M.,

Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation

system: Bridging the gap between human and machine translation. arXiv preprint

arXiv:1609.08144.

Wu, Y., Tucker, G., and Nachum, O. (2019). Behavior regularized offline reinforcement

learning. arXiv preprint arXiv:1911.11361.

Xie, T. and Jiang, N. (2020). Q* approximation schemes for batch reinforcement learning:

A theoretical comparison. UAI2020.

Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. (2022). Geodiff: A ge-

ometric diffusion model for molecular conformation generation. arXiv preprint

arXiv:2203.02923.

Xu, Y., Wang, R., Yang, L., Singh, A., and Dubrawski, A. (2020). Preference-based

317


reinforcement learning with finite-time guarantees. Advances in Neural Information

Processing Systems, 33:18784–18794.

Yang, C., Ma, X., Huang, W., Sun, F., Liu, H., Huang, J., and Gan, C. (2019). Imitation

learning from observations by minimizing inverse dynamics disagreement. In NeurIPS.

Yang, K., Klein, D., Celikyilmaz, A., Peng, N., and Tian, Y. (2023). Rlcd: Reinforcement

learning from contrast distillation for language model alignment. arXiv preprint

arXiv:2307.12950.

Yang, L. F. and Wang, M. (2019). Reinforcement leaning in feature space: Matrix bandit,

kernels, and regret bound. arXiv preprint arXiv:1905.10389.

Yang, Z., Jin, C., Wang, Z., Wang, M., and Jordan, M. (2020). Provably efficient

reinforcement learning with kernel and neural function approximations. In Advances

in Neural Information Processing Systems, volume 33, pages 13903–13916.

Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. (2022). Mastering visual continuous con-

trol: Improved data-augmented reinforcement learning. In International Conference

on Learning Representations.

Yarats, D. and Kostrikov, I. (2020). Soft actor-critic (sac) implementation in pytorch.

https://github.com/denisyarats/pytorch_sac.

Yin, D., Hao, B., Abbasi-Yadkori, Y., Lazić, N., and Szepesvári, C. (2022). Efficient

local planning with linear function approximation. In International Conference on

Algorithmic Learning Theory, pages 1165–1192. PMLR.

Yin, M., Bai, Y., and Wang, Y.-X. (2021). Near-optimal offline reinforcement learning

via double variance reduction. arXiv preprint arXiv:2102.01748.

318

https://github.com/denisyarats/pytorch_sac


Yin, M. and Wang, Y.-X. (2020). Asymptotically efficient off-policy evaluation for

tabular reinforcement learning. In Proceedings of the Twenty Third International

Conference on Artificial Intelligence and Statistics, pages 3948–3958.

Yu, L., Yu, T., Song, J., Neiswanger, W., and Ermon, S. (2023). Offline imitation learning

with suboptimal demonstrations via relaxed distribution matching.

Yu, T., Kumar, A., Rafailov, R., Rajeswaran, A., Levine, S., and Finn, C. (2021).

Combo: Conservative offline model-based policy optimization. arXiv preprint

arXiv:2102.08363.

Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020).

Mopo: Model-based offline policy optimization. In Advances in Neural Information

Processing Systems, volume 33, pages 14129–14142.

Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. (2024). Self-

rewarding language models. arXiv preprint arXiv:2401.10020.

Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. (2023). Rrhf: Rank

responses to align language models with human feedback without tears. arXiv preprint

arXiv:2304.05302.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. (2012). The k-armed dueling bandits

problem. Journal of Computer and System Sciences, 78(5):1538–1556.

Zanette, A. (2020). Exponential lower bounds for batch reinforcement learning: Batch rl

can be exponentially harder than online rl. arXiv preprint arXiv:2012.08005.

Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. (2022). Offline reinforcement

learning with realizability and single-policy concentrability. In Conference on Learning

Theory, pages 2730–2775. PMLR.

319


Zhan, W., Uehara, M., Kallus, N., Lee, J. D., and Sun, W. (2023a). Provable offline

preference-based reinforcement learning.

Zhan, W., Uehara, M., Sun, W., and Lee, J. D. (2023b). Provable reward-agnostic

preference-based reinforcement learning.

Zhang, R., Dai, B., Li, L., and Schuurmans, D. (2020). Gendice: Generalized offline esti-

mation of stationary values. In International Conference on Learning Representations.

Zhang, T. (2005). Learning bounds for kernel regression using effective data dimension-

ality. Neural computation, 17(9):2077–2098.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2019a). Bertscore:

Evaluating Text Generation with Bert. arXiv preprint arXiv:1904.09675.

Zhang, W., Feng, Y., Meng, F., You, D., and Liu, Q. (2019b). Bridging the gap between

training and inference for neural machine translation. arXiv preprint arXiv:1906.02448.

Zhang, X., Bosselut, A., Yasunaga, M., Ren, H., Liang, P., Manning, C. D., and Leskovec,

J. (2022). Greaselm: Graph reasoning enhanced language models. In Proceedings of

International Conference on Learning Representations.

Zhang, X. and Lapata, M. (2017). Sentence Simplification with Deep Reinforcement

Learning. In Proceedings of Conference on Empirical Methods in Natural Language

Processing.

Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., and Liu, P. J. (2023). Slic-hf: Se-

quence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.

Zhong, V., Xiong, C., and Socher, R. (2017). Seq2sql: Generating structured queries

from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.

320


Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al.

(2023). Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.

Zhou, L., Du, Y., and Wu, J. (2021). 3d shape generation and completion through

point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on

Computer Vision, pages 5826–5835.

Zhu, B., Jiao, J., and Jordan, M. I. (2023a). Principled reinforcement learning with human

feedback from pairwise or k-wise comparisons. arXiv preprint arXiv:2301.11270.

Zhu, B., Sharma, H., Frujeri, F. V., Dong, S., Zhu, C., Jordan, M. I., and Jiao, J. (2023b).

Fine-tuning language models with advantage-induced policy alignment. arXiv preprint

arXiv:2306.02231.

Zhu, Z., Lin, K., Dai, B., and Zhou, J. (2020). Off-policy imitation learning from

observations. In NeurIPS.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy

inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL,

USA.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano,

P., and Irving, G. (2019). Fine-tuning Language Models from Human Preferences.

arXiv preprint arXiv:1909.08593.

Zoghi, M., Whiteson, S. A., De Rijke, M., and Munos, R. (2014). Relative confidence

sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM

international conference on Web search and data mining, pages 73–82.

321


	Biographical Sketch
	Dedication
	Acknowledgements
	Contents
	Introduction
	Main Contributions
	Imitation Learning from Diverse Data Sources
	Reinforcement Learning and Imitation Learning with Generative Models

	Background
	Markov Decision Process (MDP)
	Deep Policy Gradient Algorithms
	Inverse Reinforcement Learning (IRL)

	Organization
	Bibliographical Remarks

	I Imitation Learning
	Model-Based Imitation Learning From Observation Alone
	Introduction
	Related Works
	Setting
	Function Approximation Setup

	Algorithm
	Components of MobILE
	Exploration And Imitation Tradeoff

	Analysis
	Regret Bound
	Exploration in ILFO and the Exponential Gap between IL and ILFO

	Practical Instantiation of MobILE
	Experiments
	Benchmarking MobILE on MuJoCo suite
	Importance of the optimistic MDP construction
	Varying Number of Expert Samples

	Conclusion

	Model-Based Offline Imitation Learning
	Introduction
	Related work
	Setting
	Algorithm
	Specialization to offline RL

	Analysis
	Analysis: Discrete MDPs
	Analysis: KNRs and GPs for Continuous MDPs

	Practical Implementation
	Experiments
	Evaluation on MuJoCo Continuous Control Tasks
	Ablation

	Conclusion

	Model-Free Off-Policy Imitation Learning
	Introduction
	Related works
	Preliminaries
	Adversarial Imitation Learning (AIL)
	Discriminator Actor Critic (DAC)
	ValueDICE

	Algorithm
	AILBoost: Adversarial Imitation Learning via Boosting

	Experiments
	Controller State-based Experiments
	Image-based Experiments
	Sensitivity to gradient-based optimization for weak learners and discriminators

	Conclusion


	II IL and RL for Generative Models
	Learning to Generate Better Than Your LLM
	Introduction
	Related Work
	Preliminaries
	Reinforcement Learning from Guided Feedback
	Theoretical Justification
	Experiments
	Experimental Results

	Conclusion and Future Work

	Provably Efficient RL with Preference-based Feedback via Dataset Reset
	Introduction
	Related Work

	Preliminaries
	Dataset Reset Policy Optimization
	Theoretical Analysis
	Theoretical Sample Complexity

	Experiments
	How well can DR-PO optimize the RLHF objective?
	Analysis of Dataset Reset Proportion
	DR-PO Transfer Performance
	DR-PO Scaling Performance on Anthropic HH

	Conclusion

	RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
	Introduction
	Related Works
	Preliminaries
	Reinforcement Learning
	Diffusion and Consistency Models
	Reinforcement Learning for Diffusion Models

	Reinforcement Learning for Consistency Models
	Experiments
	RLCM vs. DDPO Performance Comparisons
	Train and Test Time Analysis
	Ablation of Inference Horizon for RLCM
	Qualitative Effects on Generalization

	Conclusion and Future Directions

	Conclusion
	Imitation Learning
	Reinforcement Learning for Generative Models
	Concluding Remarks


	III Appendix
	Missing Proofs and Details in Chapter 2
	Analysis of Algorithm 2
	Discrete MDPs
	KNRs
	General Function Class G with Bounded Eluder dimension
	Proof of Theorem 7

	Auxiliary Lemmas
	Implementation Details
	Environment Setup and Benchmarks
	Practical Implementation of MobILE
	Hyper-parameter Details

	Additional Experimental Results
	Modified Cartpole-v0 environment with noise added to transition dynamics
	Swimmer Learning Curves
	Additional Results
	Ablation Study on Number of Models used for Strategic Exploration Bonus


	Missing Proofs and Details in Chapter 3
	Bonus Designs
	Tabular models
	KNRs
	Gaussian processes

	Proof of Theorem 10
	Finite sample error bound for each model
	Discrete MDPs
	KNRs
	Gaussian processes
	Missing Proofs 

	Auxiliary Lemmas
	Implementation Details
	Environment Details
	Dynamics Ensemble Architecture and Model Learning
	Policy Architecture and TRPO Details
	Discriminator Update and Cost Function Details

	Additional Experiments
	MILO with Expert Trajectories
	Performance of MILO on Ant without Pessimism


	Missing Proofs and Details in Chapter 4
	Detailed Algorithm Pseudocode
	Implementation and Experiment Details
	Environment Details
	Dataset Details
	Hyperparameters

	Additional Results
	Aggregate Performance Comparisons
	Learning Curves
	Learning curves across different optimization schedules


	Missing Proofs and Details in Chapter 5
	Additional Related Work
	Additional Algorithms
	Additional Experimental Details
	KL Reward Constraint
	Task Details
	IMDB - Algorithm Details
	CommonGen - Algorithm Hyperparameters
	TL;DR Summarization - Algorithm Hyperparameters

	IMDB Qualitative Examples
	CommonGen Qualitative Examples
	TL;DR Qualitative Examples

	Missing Proofs and Details in Chapter 6
	DR-PO with NPG
	Proof of Theorem 26
	Q function Estimation Error
	NPG Analysis
	Unregularized Suboptimality Gap w.r.t. r*

	NPG with regularized Q functions
	Theoretical Analysis

	Proof of Theorem 90
	Q function Estimation Error
	Regularized NPG Analysis
	Unregularized Suboptimality Gap w.r.t. r*

	Auxiliary Lemmas
	Least Sqaures Guarantee
	Maximum Likelihood Estimation Guarantee
	Performance Difference
	KL Divergence Property

	Additional Experiment Details
	Experiment Hyperparameters and Task Details
	Dataset Reset Implementation Details
	Details on GPT4 Winrate
	Examples from Test
	Hyperparameters
	Additional Experiments


	Missing Proofs and Details in Chapter 8
	Consistency Models
	Experiment Details
	Hyperparameters
	Hyperparameter Sweep Ranges
	Details on Task Prompts

	Additional Samples from RLCM
	Aesthetic Task
	Prompt Image Alignment