Deep Reinforcement Learning


Julien Vitay

Professur für Künstliche Intelligenz - Fakultät für Informatik

1 - Summary of DRL

Overview of deep RL methods


  • Model-free methods (DQN, A3C, DDPG, PPO, SAC) are able to find optimal policies in complex MDPs by just sampling transitions.

  • They suffer however from a high sample complexity, i.e. they need ridiculous amounts of samples to converge.

  • Model-based methods (I2A, Dreamer, MuZero) use learned dynamics to predict the future and plan the consequences of an action.

  • The sample complexity is lower, but learning a good model can be challenging. Inference times can be prohibitive.


Deep RL is still very unstable

  • Depending on initialization, deep RL networks may or may not converge (30% of runs converge to a worse policy than a random agent).

  • Careful optimization such as TRPO / PPO help, but not completely.

  • You never know if failure is your fault (wrong network, bad hyperparameters, bug), or just bad luck.

Deep RL lacks generalization to different environments

  • As it uses neural networks, deep RL overfits its training data, i.e. the environment it is trained on.

  • If you change anything to the environment dynamics, you need to retrain from scratch.

  • OpenAI Five collects 900 years of game experience per day on Dota 2: it overfits the game, it does not learn how to play.

  • Modify the map a little bit and everything is gone.

  • But see Meta RL - RL^2 later.

Classical methods sometimes still work better

  • Model Predictive Control (MPC) is able to control Mujoco robots much better than RL through classical optimization techniques (e.g. iterative LQR) while needing much less computations.

  • If you have a good physics model, do not use DRL. Reserve it for unknown systems, or when using noisy sensors (images).

  • Genetic algorithms (CMA-ES) sometimes give better results than RL to train policy networks.

You cannot do that with deep RL (yet)

RL libraries

  • keras-rl: many deep RL algorithms implemented directly in keras: DQN, DDQN, DDPG, CEM…

  • OpenAI Baselines from OpenAI: A2C, ACER, ACKTR, DDPG, DQN, PPO, TRPO… Not maintained.

  • Stable baselines from Inria Flowers, a clean rewrite of OpenAI baselines including SAC and TD3.

  • chainer-rl implemented in Chainer: A3C, ACER, DQN, DDPG, PGT, PCL, PPO, TRPO.

  • RL Mushroom is a very modular library based on Pytorch allowing to implement DQN and variants, DDPG, SAC, TD3, TRPO, PPO.

  • Tensorforce implement in tensorflow: DQN and variants, A3C, DDPG, TRPO, PPO.

  • Tensorflow Agents is officially supported by tensorflow: DQN, A3C, DDPG, TD3, PPO, SAC.

  • Coach from Intel Nervana also provides many state-of-the-art algorithms.


  • rllib is part of the more global ML framework Ray, which also includes Tune for hyperparameter optimization.

It has implementations in both tensorflow and Pytorch.

All major model-free algorithms are implemented (DQN, Rainbow, A3C, DDPG, PPO, SAC), including their distributed variants (Ape-X, IMPALA, TD3) but also model-based algorithms (Dreamer!)

  • tianshou is a recent addition to the family. The implementation is based on pytorch and is very modular. Allows for efficient distributed RL.

Algos: DQN+/DDPG/PPO/SAC, imitation learning, offline RL…

2 - Inverse RL - learning the reward function

RL maximizes the reward function you give it

  • RL is an optimization method: it maximizes the reward function that you provide it.

  • If you do not design the reward function correctly, the agent may not do what you expect.

  • In the Coast runners game, turbos provide small rewards but respawn very fast: it is more optimal to collect them repeatedly than to try to finish the race.

Reward functions need careful engineering

  • Defining the reward function that does what you want becomes an art.

  • RL algorithms work better with dense rewards than sparse ones. It is tempting to introduce intermediary rewards.

  • You end up covering so many special cases that it becomes unusable:

    • Go as fast as you can but not in a curve, except if you are on a closed circuit but not if it rains…
  • In the OpenAI Lego stacking paper, it was perhaps harder to define the reward function than to implement DDPG.

Inverse Reinforcement Learning

  • The goal of inverse RL is to learn from demonstrations (e.g. from humans) which reward function is maximized.

  • This is not imitation learning, where you try to learn and reproduce actions.

  • The goal if to find a parametrized representation of the reward function:

\hat{r}(s) = \sum_{i=1}^K w_i \, \varphi_i(s)

  • When the reward function has been learned, you can train a RL algorithm to find the optimal policy.

3 - Intrinsic motivation and curiosity

Intrinsic motivation and curiosity

  • One fundamental problem of RL is its dependence on the reward function.
  • When rewards are sparse, the agent does not learn much (but see successor representations) unless its random exploration policy makes it discover rewards.

  • The reward function is handmade, what is difficult in realistic complex problems.

  • Human learning does not (only) rely on maximizing rewards or achieving goals.

  • Especially infants discover the world by playing, i.e. interacting with the environment out of curiosity.

    • What happens if I do that? Oh, that’s fun.
  • This called intrinsic motivation: we are motivated by understanding the world, not only by getting rewards.

  • Rewards are internally generated.

Intrinsic motivation and curiosity

  • What is intrinsically rewarding / motivating / fun? Mostly what has unexpected consequences.

    • If you can predict what is going to happen, it becomes boring.

    • If you cannot predict, you can become curious and try to explore that action.

  • The intrinsic reward (IR) of an action is defined as the sensory prediction error:

\text{IR}(s_t, a_t, s_{t+1}) = || f(s_t, a_t) - s_{t+1}||

where f(s_t, a_t) is a forward model predicting the sensory consequences of an action.

  • An agent maximizing the IR will tend to visit unknown / poorly predicted states (exploration).

Intrinsic motivation and curiosity

  • Is it a good idea to predict frames directly?

  • Frames are highly dimensional and there will always be a remaining error.

  • Moreover, they can be noisy and unpredictable, without being particularly interesting.

Source: Giphy
  • What can we do? As usual, predict in a latent space!

Intrinsic curiosity module (ICM)

  • The intrinsic curiosity module (ICM) learns to provide an intrinsic reward for a transition (s_t, a_t, s_{t+1}) by comparing the predicted latent representation \hat{\phi}(s_{t+1}) (using a forward model) to its “true” latent representation \phi(s_{t+1}).

  • The feature representation \phi(s_t) is trained using an inverse model predicting the action leading from s_t to s_{t+1}.

Intrinsic motivation and curiosity

Intrinsic motivation and curiosity


4 - Hierarchical RL - learning different action levels

Hierarchical RL - learning different action levels

  • In all previous RL methods, the action space is fixed.

  • When you read a recipe, the actions are “Cut carrots”, “Boil water”, etc.

  • But how do you perform these high-level actions? Break them into subtasks iteratively until you arrive to muscle activations.

  • But it is not possible to learn to cook a boeuf bourguignon using muscle activations as actions.


Meta-Learning Shared Hierarchies

  • Sub-policies (options) can be trained to solve simple tasks (going left, right, etc).

  • A meta-learner or controller then learns to call each sub-policy when needed, at a much lower frequency.

Meta-Learning Shared Hierarchies

Meta-Learning Shared Hierarchies

Hierarchical Reinforcement Learning

5 - Meta Reinforcement learning - RL^2

Meta RL: Learning to learn

  • Meta learning is the ability to reuse skills acquired on a set of tasks to quickly acquire new (similar) ones (generalization).

Meta RL: Learning to learn

  • Meta RL is based on the idea of fast and slow learning:

    • Slow learning is the adaptation of weights in the NN.

    • Fast learning is the adaptation to changes in the environment.

  • A simple strategy developed concurrently by (Wang et al. 2016) and (Duan et al. 2016) is to have a model-free algorithm (e.g. A3C) integrate with a LSTM layer not only the current state s_t, but also the previous action a_{t-1} and reward r_t.

  • The policy of the agent becomes memory-guided: it selects an action depending on what it did before, not only the state.

Meta RL: Learning to learn

  • The algorithm is trained on a set of similar MDPs:

    1. Select a MDP \mathcal{M}.

    2. Reset the internal state of the LSTM.

    3. Sample trajectories and adapt the weights.

    4. Repeat 1, 2 and 3.

Meta RL: Learning to learn

  • The meta RL can be be trained an a multitude of 2-armed bandits, each giving a reward of 1 with probability p and 1-p.

  • Left is a classical bandit algorithm, right is the meta bandit:


  • The meta bandit has learned that the best strategy for any 2-armed bandit is to sample both actions randomly at the beginning and then stick to the best one.

  • The meta bandit does not learn to solve each problem, it learns how to solve them.

Model-Based Meta-Reinforcement Learning for Flight with Suspended Payloads


6 - Offline RL

Offline RL

  • Even off-policy algorithms need to interact with the environment: the behavior policy is \epsilon-soft around the learned policy.

  • Is it possible to learn purely offline from recorded transitions using another policy (experts)? Data efficiency.

  • This would bring safety: the agent would not explore dangerous actions.



Behavioral cloning

  • As no exploration is allowed, the model is limited by the quality of the data: if the acquisition policy is random, there is not much to hope.

  • If we have already a good policy, but slow or expensive to compute, we could try to transfer it to a fast neural network.

  • If the policy is a human expert, it is called learning from demonstrations (lfd) or imitation learning.

  • The simplest approach to offline RL is behavioral cloning: simply supervised learning of (s, a) pairs…

Dave2 : NVIDIA’s self-driving car

Distribution shift

  • The main problem in offline RL is the distribution shift: what if the trained policy assigns a non-zero probability to a (s, a) pair that is outside the training data?

  • Most offline RL methods are conservative methods, which try to learn policies staying close to the known distribution of the data. Examples:

    • Batch-Contrained deep Q-learning (model-free), MOREL (model-based)…


Decision transformer

  • Transformers are the new SotA method to transform sequences into sequences.

  • Why not sequences of states into sequences of actions?

  • The decision transformer takes complete offline trajectories as inputs (s, a, r, s…) and predicts autoregressively the next action.

Transformers as World models


