Deep Reinforcement Learning

Author

Affiliation

Abstract

This website contains the materials for the module Deep Reinforcement Learning (573140).

Lectures

You will find below the links to the slides for each lecture (html and pdf). There is a playlist of quite outdated videos on Youtube.

1 - Introduction

	Slides
1.1 - Introduction Introduction to the main concepts of reinforcement learning and showcasing of the current applications.	html, pdf

2 - Tabular RL

	Slides
2.1 - Sampling and Bandits n-armed bandits, the simplest RL setting that can be solved by sampling.	html, pdf
2.2 - Markov Decision Processes and Dynamic Programming MDPs are the basic RL framework. The value functions and the Bellman equations fully characterize a MDP. Dynamic programming is a model-based method allowing to iteratively solve the Bellman equations.	html, pdf
2.3 - Monte Carlo control Monte Carlo control estimates value functions through sampling of complete episodes and infers the optimal policy using action selection, either on- or off-policy.	html, pdf
2.4 - Temporal Difference TD algorithms allow the learning of value functions using single transitions. Q-learning is the famous off-policy variant.	html, pdf
2.5 - Function Approximation Value functions can actually be approximated by any function approximator, allowing to apply RL to continuous state of action spaces.	html, pdf
2.6 - Deep Neural Networks Quick overview of the main neural network architectures needed for the rest of the course.	html, pdf

3 - Model-free RL

	Slides
3.1 - DQN: Deep Q-Network DQN (Mnih et al. 2013) was the first successful application of deep networks to the RL problem. It has been applied to Atari video games and started the interest for deep RL methods.	html, pdf
3.2 - Beyond DQN Various extensions to the DQN algorithms have been proposed in the following years: distributional learning, parameter noise, distributed learning or recurrent architectures.	html, pdf
3.3 - PG: Policy Gradient Policy gradient methods allow to directly learn the policy without requiring action selection over value functions.	html, pdf
3.4 - A3C: Asynchronous Advantage Actor-Critic A3C (Mnih et al., 2016) is an actor-critic architecture estimating the policy gradient from multiple parallel workers.	html, pdf
3.5 - DDPG: Deep Deterministic Policy Gradient DDPG (Lillicrap et al., is an off-policy actor-critic architecture particularly suited for continuous control problems such as robotics.	html, pdf
3.6 - PPO: Proximal Policy Optimization PPO (Schulman et al., 2017) allows stable learning by estimating trust regions for the policy updates.	html, pdf
3.7 - SAC: Soft Actor-Critic Maximum Entropy RL modifies the RL objective by learning optimal policies that also explore the environment as much as possible.. SAC (Haarnoja et al., 2018) is an off-policy actor-critic architecture for soft RL.	html, pdf

4 - Model-based RL

	Slides
4.1 - Model-based RL Model-based RL uses a world model to emulate the future. Dyna-like architectures use these rollouts to augment MF algorithms.	html, pdf
4.2 - Planning with learned World models Learning a world model from data and planning the optimal sequence of actions using model-predictive control is much easier than learning the optimal policy directly. Modern model-based algorithms (World models, PlaNet, Dreamer) make use of this property to reduce the sample complexity.	html, pdf
4.3 - AlphaGo AlphaGo surprised the world in 2016 by beating Lee Seedol, the world champion of Go. It combines model-free learning through policy gradient and self-play with model-based planning using MCTS (Monte Carlo Tree Search).	html, pdf
4.4 - Successor representations (optional) Successor representations provide a trade-off between model-free and model-based learning.	html, pdf

5 - Outlook

	Slides
5.1 - Outlook Current RL research investigates many different directions: inverse RL, intrinsic motivation, hierarchical RL, meta RL, offline RL, multi-agent RL (MARL), etc.	html, pdf

Exercises

You will find below links to download the notebooks for the exercises (which you have to fill) and their solution (which you can look at after you have finished the exercise). It is recommended not to look at the solution while doing the exercise unless you are lost. Alternatively, you can run the notebooks directly on Colab (https://colab.research.google.com/) if you have a Google account.

For instructions on how to install a Python distribution on your computer, check this page.

	Notebook	Solution
1 - Introduction to Python Introduction to the Python programming language. Optional for students already knowing Python.	ipynb, colab	ipynb, colab
2 - Numpy and Matplotlib Presentation of the numpy library for numerical computations and matplotlib for visualization. Also optional for students already familiar.	ipynb, colab	ipynb, colab
3 - Sampling Simple exercise to investigate random sampling and its properties.	ipynb, colab	ipynb, colab
4 - Bandits Implementation of various action selection methods to the n-armed bandit.	ipynb, colab	ipynb, colab
5 - Bandits (part 2) Advanced bandit methods.	ipynb, colab	ipynb, colab
6 - Dynamic programming Calculation of the Bellman equations for the recycling robot and application of policy iteration and value iteration.	ipynb, colab	ipynb, colab
7 - Gym environments Introdcution to the gym(nasium) RL environments.	ipynb, colab	ipynb, colab
8 - Monte Carlo control Study of on-policy Monte Carlo control on the Taxi environment.	ipynb, colab	ipynb, colab
9 - Temporal Difference, Q-learning Q-learning on the Taxi environment.	ipynb, colab	ipynb, colab
10 - Eligibility traces Investigation of eligibility traces for Q-learning in a gridworld environment.	ipynb, colab	ipynb, colab
11 - Pytorch Quick tutorial for Pytorch. It will investigate in particular why correlated inputs are bad for neural networks. The previous version using keras is available here: Notebook: ipynb, colab, Solution: ipynb, colab.	ipynb, colab	ipynb, colab
12 - DQN Implementation of the DQN algorithm for Cartpole from scratch in pytorch. The previous version using keras is available here: Notebook: ipynb, colab, Solution: ipynb, colab.	ipynb, colab	ipynb, colab
13 - PPO DQN and PPO on cartpole using the tianshou library.	ipynb, colab	ipynb, colab