Deep Reinforcement Learning
Abstract
This website contains the materials for the module Deep Reinforcement Learning (573140).
Information
Registration on OPAL to receive updates per email.
Lectures
You will find below the links to the slides for each lecture (html and pdf).
1 - Introduction
2 - Tabular RL
Slides | |
---|---|
2.1 - Sampling and Bandits n-armed bandits, the simplest RL setting that can be solved by sampling. |
html, pdf |
2.2 - Markov Decision Processes and Dynamic Programming MDPs are the basic RL framework. The value functions and the Bellman equations fully characterize a MDP. Dynamic programming is a model-based method allowing to iteratively solve the Bellman equations. |
html, pdf |
2.3 - Monte Carlo control Monte Carlo control estimates value functions through sampling of complete episodes and infers the optimal policy using action selection, either on- or off-policy. |
html, pdf |
2.4 - Temporal Difference TD algorithms allow the learning of value functions using single transitions. Q-learning is the famous off-policy variant. |
html, pdf |
2.5 - Function Approximation Value functions can actually be approximated by any function approximator, allowing to apply RL to continuous state of action spaces. |
html, pdf |
2.6 - Deep Neural Networks Quick overview of the main neural network architectures needed for the rest of the course. |
html, pdf |
3 - Model-free RL
Slides | |
---|---|
3.1 - DQN: Deep Q-Network DQN (Mnih et al. 2013) was the first successful application of deep networks to the RL problem. It has been applied to Atari video games and started the interest for deep RL methods. |
html, pdf |
3.2 - Beyond DQN Various extensions to the DQN algorithms have been proposed in the following years: distributional learning, parameter noise, distributed learning or recurrent architectures. |
html, pdf |
3.3 - PG: Policy Gradient Policy gradient methods allow to directly learn the policy without requiring action selection over value functions. |
html, pdf |
3.4 - A3C: Asynchronous Advantage Actor-Critic A3C (Mnih et al., 2016) is an actor-critic architecture estimating the policy gradient from multiple parallel workers. |
html, pdf |
3.5 - DDPG: Deep Deterministic Policy Gradient DDPG (Lillicrap et al., is an off-policy actor-critic architecture particularly suited for continuous control problems such as robotics. |
html, pdf |
3.6 - PPO: Proximal Policy Optimization PPO (Schulman et al., 2017) allows stable learning by estimating trust regions for the policy updates. |
html, pdf |
3.7 - SAC: Soft Actor-Critic Maximum Entropy RL modifies the RL objective by learning optimal policies that also explore the environment as much as possible.. SAC (Haarnoja et al., 2018) is an off-policy actor-critic architecture for soft RL. |
html, pdf |
4 - Model-based RL
Slides | |
---|---|
4.1 - Model-based RL Two main paradigms in model-based RL: model-based augmentation of model-free learning (Dyna architectures) and planning (model predictive control, MPC) |
html, pdf |
4.2 - Learned World models Learning a world model from data is much easier than learning the optimal policy, as it is just supervised learning. Modern model-based algorithms (TDM, World models, PlaNet, Dreamer) make use of this property to reduce the sample complexity. |
html, pdf |
4.3 - AlphaGo AlphaGo surprised the world in 2016 by beating Lee Seedol, the world champion of Go. It combines model-free learning through policy gradient and self-play with model-based planning using MCTS (Monte Carlo Tree Search). |
html, pdf |
4.4 - Successor representations Successor representations provide a trade-off between model-free and model-based learning. |
html, pdf |
5 - Outlook
Exercises
You will find below links to download the notebooks for the exercises (which you have to fill) and their solution (which you can look at after you have finished the exercise). It is recommended not to look at the solution while doing the exercise unless you are lost. Alternatively, you can run the notebooks directly on Colab (https://colab.research.google.com/) if you have a Google account.
For instructions on how to install a Python distribution on your computer, check this page.
Notebook | Solution | |
---|---|---|
1 - Introduction to Python Introduction to the Python programming language. Optional for students already knowing Python. |
ipynb, colab | ipynb, colab |
2 - Numpy and Matplotlib Presentation of the numpy library for numerical computations and matplotlib for visualization. Also optional for students already familiar. |
ipynb, colab | ipynb, colab |
3 - Sampling Simple exercise to investigate random sampling and its properties. |
ipynb, colab | ipynb, colab |
4 - Bandits Implementation of various action selection methods to the n-armed bandit. |
ipynb, colab | ipynb, colab |
5 - Bandits (part 2) Advanced bandit methods. |
ipynb, colab | ipynb, colab |
6 - Dynamic programming Calculation of the Bellman equations for the recycling robot and application of policy iteration and value iteration. |
ipynb, colab | ipynb, colab |
7 - Gym environments Introdcution to the gym(nasium) RL environments. |
ipynb, colab | ipynb, colab |
8 - Monte Carlo control Study of on-policy Monte Carlo control on the Taxi environment. |
ipynb, colab | ipynb, colab |
9 - Temporal Difference, Q-learning Q-learning on the Taxi environment. |
ipynb, colab | ipynb, colab |
10 - Eligibility traces Investigation of eligibility traces for Q-learning in a gridworld environment. |
ipynb, colab | ipynb, colab |
11 - Keras Quick tutorial for Keras. |
ipynb, colab | ipynb, colab |
12 - DQN Implementation of the DQN algorithm for Cartpole from scratch. |
ipynb, colab | ipynb, colab |
13 - PPO DQN and PPO on cartpole using the tianshou library. |
ipynb, colab | ipynb, colab |
Recommended readings
- Richard Sutton and Andrew Barto (2017). Reinforcement Learning: An Introduction. MIT press.
http://incompleteideas.net/book/the-book-2nd.html
- CS294 course of Sergey Levine at Berkeley.
http://rll.berkeley.edu/deeprlcourse/
- Reinforcement Learning course by David Silver at UCL.