# Deep Reinforcement Learning

Abstract

This website contains the materials for the module **Deep Reinforcement Learning**(573140).

## Information

Registration on OPAL to receive updates per email.

## Lectures

You will find below the links to the slides for each lecture (html and pdf).

#### 1 - Introduction

#### 2 - Tabular RL

Slides | |
---|---|

2.1 - Sampling and Banditsn-armed bandits, the simplest RL setting that can be solved by sampling. |
html, pdf |

2.2 - Markov Decision Processes and Dynamic ProgrammingMDPs are the basic RL framework. The value functions and the Bellman equations fully characterize a MDP. Dynamic programming is a model-based method allowing to iteratively solve the Bellman equations. |
html, pdf |

2.3 - Monte Carlo controlMonte Carlo control estimates value functions through sampling of complete episodes and infers the optimal policy using action selection, either on- or off-policy. |
html, pdf |

2.4 - Temporal DifferenceTD algorithms allow the learning of value functions using single transitions. Q-learning is the famous off-policy variant. |
html, pdf |

2.5 - Function ApproximationValue functions can actually be approximated by any function approximator, allowing to apply RL to continuous state of action spaces. |
html, pdf |

2.6 - Deep Neural NetworksQuick overview of the main neural network architectures needed for the rest of the course. |
html, pdf |

#### 3 - Model-free RL

Slides | |
---|---|

3.1 - DQN: Deep Q-NetworkDQN (Mnih et al. 2013) was the first successful application of deep networks to the RL problem. It has been applied to Atari video games and started the interest for deep RL methods. |
html, pdf |

3.2 - Beyond DQNVarious extensions to the DQN algorithms have been proposed in the following years: distributional learning, parameter noise, distributed learning or recurrent architectures. |
html, pdf |

3.3 - PG: Policy GradientPolicy gradient methods allow to directly learn the policy without requiring action selection over value functions. |
html, pdf |

3.4 - A3C: Asynchronous Advantage Actor-CriticA3C (Mnih et al., 2016) is an actor-critic architecture estimating the policy gradient from multiple parallel workers. |
html, pdf |

3.5 - DDPG: Deep Deterministic Policy GradientDDPG (Lillicrap et al., is an off-policy actor-critic architecture particularly suited for continuous control problems such as robotics. |
html, pdf |

3.6 - PPO: Proximal Policy OptimizationPPO (Schulman et al., 2017) allows stable learning by estimating trust regions for the policy updates. |
html, pdf |

3.7 - SAC: Soft Actor-CriticMaximum Entropy RL modifies the RL objective by learning optimal policies that also explore the environment as much as possible.. SAC (Haarnoja et al., 2018) is an off-policy actor-critic architecture for soft RL. |
html, pdf |

#### 4 - Model-based RL

Slides | |
---|---|

4.1 - Model-based RLTwo main paradigms in model-based RL: model-based augmentation of model-free learning (Dyna architectures) and planning (model predictive control, MPC) |
html, pdf |

4.2 - Learned World modelsLearning a world model from data is much easier than learning the optimal policy, as it is just supervised learning. Modern model-based algorithms (TDM, World models, PlaNet, Dreamer) make use of this property to reduce the sample complexity. |
html, pdf |

4.3 - AlphaGoAlphaGo surprised the world in 2016 by beating Lee Seedol, the world champion of Go. It combines model-free learning through policy gradient and self-play with model-based planning using MCTS (Monte Carlo Tree Search). |
html, pdf |

4.4 - Successor representationsSuccessor representations provide a trade-off between model-free and model-based learning. |
html, pdf |

#### 5 - Outlook

## Exercises

You will find below links to download the notebooks for the exercises (which you have to fill) and their solution (which you can look at after you have finished the exercise). It is recommended not to look at the solution while doing the exercise unless you are lost. Alternatively, you can run the notebooks directly on Colab (https://colab.research.google.com/) if you have a Google account.

For instructions on how to install a Python distribution on your computer, check this page.

Notebook | Solution | |
---|---|---|

1 - Introduction to PythonIntroduction to the Python programming language. Optional for students already knowing Python. |
ipynb, colab | ipynb, colab |

2 - Numpy and MatplotlibPresentation of the numpy library for numerical computations and matplotlib for visualization. Also optional for students already familiar. |
ipynb, colab | ipynb, colab |

3 - SamplingSimple exercise to investigate random sampling and its properties. |
ipynb, colab | ipynb, colab |

4 - BanditsImplementation of various action selection methods to the n-armed bandit. |
ipynb, colab | ipynb, colab |

5 - Bandits (part 2)Advanced bandit methods. |
ipynb, colab | ipynb, colab |

6 - Dynamic programmingCalculation of the Bellman equations for the recycling robot and application of policy iteration and value iteration. |
ipynb, colab | ipynb, colab |

7 - Gym environmentsIntrodcution to the gym(nasium) RL environments. |
ipynb, colab | ipynb, colab |

8 - Monte Carlo controlStudy of on-policy Monte Carlo control on the Taxi environment. |
ipynb, colab | ipynb, colab |

9 - Temporal Difference, Q-learningQ-learning on the Taxi environment. |
ipynb, colab | ipynb, colab |

10 - Eligibility tracesInvestigation of eligibility traces for Q-learning in a gridworld environment. |
ipynb, colab | ipynb, colab |

11 - KerasQuick tutorial for Keras. |
ipynb, colab | ipynb, colab |

12 - DQNImplementation of the DQN algorithm for Cartpole from scratch. |
ipynb, colab | ipynb, colab |

13 - PPODQN and PPO on cartpole using the tianshou library. |
ipynb, colab | ipynb, colab |

## Recommended readings

- Richard Sutton and Andrew Barto (2017). Reinforcement Learning: An Introduction. MIT press.

http://incompleteideas.net/book/the-book-2nd.html

- CS294 course of Sergey Levine at Berkeley.

http://rll.berkeley.edu/deeprlcourse/

- Reinforcement Learning course by David Silver at UCL.