Slides: html pdf

Limits of deep reinforcement learning


Model-free methods (DQN, A3C, DDPG, PPO, SAC) are able to find optimal policies in complex MDPs by just sampling transitions. They suffer however from a high sample complexity, i.e. they need ridiculous amounts of samples to converge.

Model-based methods (I2A, Dreamer, MuZero) use learned dynamics to predict the future and plan the consequences of an action. The sample complexity is lower, but learning a good model can be challenging. Inference times can be prohibitive.

Overview of deep RL methods. Source: https://github.com/avillemin/RL-Personnal-Notebook

Deep RL is still very unstable. Depending on initialization, deep RL networks may or may not converge (30% of runs converge to a worse policy than a random agent). Careful optimization such as TRPO / PPO help, but not completely. You never know if failure is your fault (wrong network, bad hyperparameters, bug), or just bad luck.

As it uses neural networks, deep RL overfits its training data, i.e. the environment it is trained on. If you change anything to the environment dynamics, you need to retrain from scratch. OpenAI Five collects 900 years of game experience per day on Dota 2: it overfits the game, it does not learn how to play. Modify the map a little bit and everything is gone (but see Meta RL - RL^2 later).

Classical methods sometimes still work better. Model Predictive Control (MPC) is able to control Mujoco robots much better than RL through classical optimization techniques (e.g. iterative LQR) while needing much less computations. If you have a good physics model, do not use DRL. Reserve it for unknown systems, or when using noisy sensors (images). Genetic algorithms (CMA-ES) sometimes give better results than RL to train policy networks.

You cannot do that with deep RL (yet):

RL libraries

  • keras-rl: many deep RL algorithms implemented directly in keras: DQN, DDQN, DDPG, Continuous DQN (CDQN or NAF), Cross-Entropy Method (CEM)…


  • OpenAI Baselines from OpenAI: A2C, ACER, ACKTR, DDPG, DQN, PPO, TRPO… Not maintained anymore.


  • Stable baselines from Inria Flowers, a clean rewrite of OpenAI baselines also including SAC and TD3.


  • rlkit from Vitchyr Pong (PhD student at Berkeley) with in particular model-based algorithms (TDM).


  • chainer-rl implemented in Chainer: A3C, ACER, DQN, DDPG, PGT, PCL, PPO, TRPO.


  • RL Mushroom is a very modular library based on Pytorch allowing to implement DQN and variants, DDPG, SAC, TD3, TRPO, PPO.


  • Tensorforce implement in tensorflow: DQN and variants, A3C, DDPG, TRPO, PPO.


  • Tensorflow Agents is officially supported by tensorflow: DQN, A3C, DDPG, TD3, PPO, SAC.


  • Coach from Intel Nervana also provides many state-of-the-art algorithms.


Deep RL algorithms available in Coach. Source: https://github.com/NervanaSystems/coach

  • rllib is part of the more global ML framework Ray, which also includes Tune for hyperparameter optimization.

It has implementations in both tensorflow and Pytorch.

All major model-free algorithms are implemented (DQN, Rainbow, A3C, DDPG, PPO, SAC), including their distributed variants (Ape-X, IMPALA, TD3) but also model-based algorithms (Dreamer!)


Architecture of rllib. Source: https://docs.ray.io/en/master/rllib.html

Inverse RL - learning the reward function

RL is an optimization method: it maximizes the reward function that you provide it. If you do not design the reward function correctly, the agent may not do what you expect. In the Coast runners game, turbos provide small rewards but respawn very fast: it is more optimal to collect them repeatedly than to try to finish the race.

Defining the reward function that does what you want becomes an art. RL algorithms work better with dense rewards than sparse ones. It is tempting to introduce intermediary rewards. You end up covering so many special cases that it becomes unusable: Go as fast as you can but not in a curve, except if you are on a closed circuit but not if it rains…

In the OpenAI Lego stacking paper (Popov et al., 2017), it was perhaps harder to define the reward function than to implement DDPG.

Lego stacking handmade reward function (Popov et al., 2017).

The goal of inverse RL (see (Arora and Doshi, 2019) for a review) is to learn from demonstrations (e.g. from humans) which reward function is maximized. This is not imitation learning, where you try to learn and reproduce actions. The goal if to find a parametrized representation of the reward function:

\hat{r}(s) = \sum_{i=1}^K w_i \, \varphi_i(s)

When the reward function has been learned, you can train a RL algorithm to find the optimal policy.

Intrinsic motivation and curiosity

One fundamental problem of RL is its dependence on the reward function. When rewards are sparse, the agent does not learn much (but see successor representations) unless its random exploration policy makes it discover rewards. The reward function is handmade, what is difficult in realistic complex problems.

Human learning does not (only) rely on maximizing rewards or achieving goals. Especially infants discover the world by playing, i.e. interacting with the environment out of curiosity.

What happens if I do that? Oh, that’s fun.

This called intrinsic motivation: we are motivated by understanding the world, not only by getting rewards. Rewards are internally generated.

In intrinsic motivation, rewards are generated internally depending on the achieved states. Source: (Barto, 2013).

What is intrinsically rewarding / motivating / fun? Mostly what has unexpected consequences.

  • If you can predict what is going to happen, it becomes boring.
  • If you cannot predict, you can become curious and try to explore that action.

Intrinsic rewards are defined by the ability to predict states. Source: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa

The intrinsic reward (IR) of an action is defined as the sensory prediction error:

\text{IR}(s_t, a_t, s_{t+1}) = || f(s_t, a_t) - s_{t+1}||

where f(s_t, a_t) is a forward model predicting the sensory consequences of an action. An agent maximizing the IR will tend to visit unknown / poorly predicted states (exploration).

Is it a good idea to predict frames directly? Frames are highly dimensional and there will always be a remaining error.

Intrinsic rewards are defined by the ability to predict states. Source: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa

Moreover, they can be noisy and unpredictable, without being particularly interesting.

Falling leaves are hard to predict, but hardly interesting. Source: Giphy.

What can we do? As usual, predict in a latent space!

The intrinsic curiosity module (ICM, (Pathak et al., 2017)) learns to provide an intrinsic reward for a transition (s_t, a_t, s_{t+1}) by comparing the predicted latent representation \hat{\phi}(s_{t+1}) (using a forward model) to its “true” latent representation \phi(s_{t+1}). The feature representation \phi(s_t) is trained using an inverse model predicting the action leading from s_t to s_{t+1}.

intrinsic curiosity module. (Pathak et al., 2017)

Curiosity-driven RL on Atari games (Burda et al., 2018):

Hierarchical RL - learning different action levels

In all previous RL methods, the action space is fixed. When you read a recipe, the actions are “Cut carrots”, “Boil water”, etc. But how do you perform these high-level actions? Break them into subtasks iteratively until you arrive to muscle activations. But it is not possible to learn to cook a boeuf bourguignon using muscle activations as actions.

Hierarchical structure of preparing a boeuf bourguignon. Source: https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/

Sub-policies (options) can be trained to solve simple tasks (going left, right, etc). A meta-learner or controller then learns to call each sub-policy when needed, at a much lower frequency (Frans et al., 2017).

Meta Learning Shared Hierarchies (Frans et al., 2017). Source: https://openai.com/blog/learning-a-hierarchy/

Some additional references on Hierarchical Reinforcement Learning

  • MLSH: Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. (2017). Meta Learning Shared Hierarchies. arXiv:1710.09767.
  • FUN: Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., et al. (2017). FeUdal Networks for Hierarchical Reinforcement Learning. arXiv:1703.01161
  • Option-Critic architecture: Bacon, P.-L., Harb, J., and Precup, D. (2016). The Option-Critic Architecture. arXiv:1609.05140.
  • HIRO: Nachum, O., Gu, S., Lee, H., and Levine, S. (2018). Data-Efficient Hierarchical Reinforcement Learning. arXiv:1805.08296.
  • HAC: Levy, A., Konidaris, G., Platt, R., and Saenko, K. (2019). Learning Multi-Level Hierarchies with Hindsight. arXiv:1712.00948.
  • Spinal-cortical: Heess, N., Wayne, G., Tassa, Y., Lillicrap, T., Riedmiller, M., and Silver, D. (2016). Learning and Transfer of Modulated Locomotor Controllers. arXiv:1610.05182.

Meta Reinforcement learning - RL^2

Meta learning is the ability to reuse skills acquired on a set of tasks to quickly acquire new (similar) ones (generalization).

Meta Reinforcement learning. Source: https://meta-world.github.io/

Meta RL is based on the idea of fast and slow learning: * Slow learning is the adaptation of weights in the NN. * Fast learning is the adaptation to changes in the environment.

A simple strategy developed concurrently by (Wang et al., 2017) and (Duan et al., 2016)is to have a model-free algorithm (e.g. A3C) integrate with a LSTM layer not only the current state s_t, but also the previous action a_{t-1} and reward r_t.

Meta RL uses a LSTM layer to encode past actions and rewards in the state representation. Source: (Wang et al., 2017)

The policy of the agent becomes memory-guided: it selects an action depending on what it did before, not only the state.

Meta RL algorithms are trained on a set of similar MDPs. Source: (Duan et al., 2016)

The algorithm is trained on a set of similar MDPs:

  1. Select a MDP \mathcal{M}.
  2. Reset the internal state of the LSTM.
  3. Sample trajectories and adapt the weights.
  4. Repeat 1, 2 and 3.

The meta RL can be be trained an a multitude of 2-armed bandits, each giving a reward of 1 with probability p and 1-p. Left is a classical bandit algorithm, right is the meta bandit:

Classical bandit (left) and meta-bandit (right) learning a new two-armed bandit problem. Source: https://hackernoon.com/learning-policies-for-learning-policies-meta-reinforcement-learning-rl%C2%B2-in-tensorflow-b15b592a2ddf

The meta bandit has learned that the best strategy for any 2-armed bandit is to sample both actions randomly at the beginning and then stick to the best one. The meta bandit does not learn to solve each problem, it learns how to solve them.

Model-Based Meta-Reinforcement Learning for Flight with Suspended Payloads

Additional references on meta RL:

Offline RL

Even off-policy algorithms need to interact with the environment: the behavior policy is \epsilon-soft around the learned policy.

Is it possible to learn purely offline from recorded transitions using another policy (experts)? Data efficiency. This would bring safety: the agent would not explore dangerous actions.

D4RL (https://sites.google.com/view/d4rl/home) provides offline data recorded using expert policies to test offline algorithms.


As no exploration is allowed, the model is limited by the quality of the data: if the acquisition policy is random, there is not much to hope. If we have already a good policy, but slow or expensive to compute, we could try to transfer it to a fast neural network. If the policy is a human expert, it is called learning from demonstrations (lfd) or imitation learning.

The simplest approach to offline RL is behavioral cloning: simply supervised learning of (s, a) pairs…

Bojarski et al. (2016)

The main problem in offline RL is the distribution shift: what if the trained policy assigns a non-zero probability to a (s, a) pair that is outside the training data?

Most offline RL methods are conservative methods, which try to learn policies staying close to the known distribution of the data. See Levine et al. (2020) for a review. Examples:

Transformers are the new SotA method to transform sequences into sequences. Why not sequences of states into sequences of actions?

The decision transformer (Chen et al., 2021) takes complete offline trajectories as inputs (s, a, r, s…) and predicts autoregressively the next action.

Source: Chen et al. (2021)

However, transformers will mostly shine when used as World models… See Micheli et al. (2022).

Source: Micheli et al. (2022)

Source: Micheli et al. (2022)