Reinforcement Learning with Latent Low Rank Structure
The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size; without structural assumptions, algorithms incur a sample complexity bound or regret bound that scales with the product of the sizes of the state space $|S|$ and action space $|A|$ of a finite horizon MDP, e.g., $|S||A|$ or $\sqrt{|S||A|}$. In this work, we consider a class of MDPs that have latent low rank structure. We first investigate what are the minimal low rank assumptions that admit efficient reinforcement learning. We first show that without imposing further assumptions beyond low rank of $Q^*$, if one is constrained to estimate the $Q$ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon $H$ to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/\epsilon^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $\epsilon$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. We next consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels with Tucker rank $(|S|, d, |A| )$, $(|S| , |S| , d), (d, |S| , |A| )$, or $(d , d , d )$. In each setting, we introduce the transfer-ability coefficient $\alpha$ that measures the difficulty of representational transfer. Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on $|S| , |A| $, or $|S| |A| $ in the target MDP regret bound. We complement our positive results with information theoretic lower bounds that show our algorithms (excluding the ($d, d, d$) setting) are minimax-optimal with respect to $\alpha$.