DQN variants (Rainbow)

Double DQN

In DQN, the experience replay memory and the target network were decisive in allowing the CNN to learn the tasks through RL. Their drawback is that they drastically slow down learning and increase the sample complexity. Additionally, DQN has stability issues: the same network may not converge the same way in different runs. One first improvement on DQN was proposed by van Hasselt et al. () and called double DQN.

The idea is that the target value y=r(s,a,s)+γmaxaQθ(s,a)y = r(s, a, s') + \gamma \, \max_{a'} Q_{\theta'}(s', a') is frequently over-estimating the true return because of the maximum operator. Especially at the beginning of learning when Q-values are far from being correct, if an action is over-estimated (Qθ(s,a)Q_{\theta'}(s', a) is higher that its true value) and selected by the target network as the next greedy action, the learned Q-value Qθ(s,a)Q_{\theta}(s, a) will also become over-estimated, what will propagate to all previous actions on the long-term. van Hasselt () showed that this over-estimation is inevitable in regular Q-learning and proposed double learning.

Figure 9.1: Overestimation of Q-values with DQN on a few Atari games. The true Q-value can be estimated by counting the rewards until the end of the episode. Source van Hasselt et al. ().

The idea is to train independently two value networks: one will be used to find the greedy action (the action with the maximal Q-value), the other to estimate the Q-value itself. Even if the first network choose an over-estimated action as the greedy action, the other might provide a less over-estimated value for it, solving the problem.

Applying double learning to DQN is particularly straightforward: there are already two value networks, the trained network and the target network. Instead of using the target network to both select the greedy action in the next state and estimate its Q-value, here the trained network θ\theta is used to select the greedy action a=argmaxaQθ(s,a)a^* = \text{argmax}_{a'} Q_\theta (s', a') while the target network only estimates its Q-value. The target value becomes:

y=r(s,a,s)+γQθ(s,argmaxaQθ(s,a)) y = r(s, a, s') + \gamma \, Q_{\theta'}(s', \text{argmax}_{a'} Q_\theta (s', a'))

This induces only a small modification of the DQN algorithm and significantly improves its performance and stability:

Double DQN algorithm
  • Every TtrainT_\text{train} steps:

    • Sample a minibatch Ds\mathcal{D}_s randomly from D\mathcal{D}.

    • For each transition (s,a,r,s)(s, a, r, s') in the minibatch:

      • Select the greedy action in the next state a=argmaxaQθ(s,a)a^* = \text{argmax}_{a'} Q_\theta (s', a') using the trained network.

      • Predict its Q-value Qθ(s,a)Q_{\theta'}(s', a^*) using the target network.

      • Compute the target value y=r+γQθ(s,a)y = r + \gamma \, Q_{\theta'}(s', a*).

Prioritized experience replay

Another drawback of the original DQN is that the experience replay memory is sampled uniformly. Novel and interesting transitions are selected with the same probability as old well-predicted transitions, what slows down learning. The main idea of prioritized experience replay () is to order the transitions in the experience replay memory in decreasing order of their TD error:

δ=r(s,a,s)+γQθ(s,argmaxaQθ(s,a))Qθ(s,a) \delta = r(s, a, s') + \gamma \, Q_{\theta'}(s', \text{argmax}_{a'} Q_\theta (s', a')) - Q_\theta(s, a)

and sample with a higher probability those surprising transitions to form a minibatch:

P(k)=(δk+ϵ)αk(δk+ϵ)α P(k) = \frac{(|\delta_k| + \epsilon)^\alpha}{\sum_k (|\delta_k| + \epsilon)^\alpha}

However, non-surprising transitions might become relevant again after enough training, as the Qθ(s,a)Q_\theta(s, a) change, so prioritized replay has a softmax function over the TD error to ensure “exploration” of memorized transitions. This data structure has of course a non-negligible computational cost, but accelerates learning so much that it is worth it. See https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/ for a presentation of double DQN with prioritized replay.

Figure 9.2: Relative improvement on the Atari benchmark brought by PER. Source: Schaul et al. ()

Duelling network

The classical DQN architecture uses a single NN to predict directly the value of all possible actions Qθ(s,a)Q_\theta(s, a). The value of an action depends on two factors:

  • the value of the underlying state ss: in some states, all actions are bad, you lose whatever you do.
  • the interest of that action: some actions are better than others for a given state.
Figure 9.3: The value of an action depends on the value of the underlying state, plus its own advantage.

However, the exact Q-values of all actions are not equally important.

  • In bad states (low Vπ(s)V^\pi(s)), you can do whatever you want, you will lose.
  • In neutral states, you can do whatever you want, nothing happens.
  • In good states (high Vπ(s)V^\pi(s)), you need to select the right action to get rewards, otherwise you lose.

The total variance of the Q-values (over all states) is quite high: it can be a problem for the underlying neural network, which has to output very negative and very positive number.

This leads to the definition of the advantage Aπ(s,a)A^\pi(s,a) of an action:

Aπ(s,a)=Qπ(s,a)Vπ(s)(9.1) A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) \tag{9.1}

The advantage of the optimal action in ss is equal to zero: the expected return in ss is the same as the expected return when being in ss and taking aa, as the optimal policy will choose aa in ss anyway. The advantage of all other actions is negative: they bring less reward than the optimal action (by definition), so they are less advantageous. Note that this is only true if your estimate of Vπ(s)V^\pi(s) is correct.

Baird () has shown that it is advantageous to decompose the Q-value of an action into the value of the state and the advantage of the action (advantage updating):

Qπ(s,a)=Vπ(s)+Aπ(s,a) Q^\pi(s, a) = V^\pi(s) + A^\pi(s, a)

If you already know that the value of a state is very low, you do not need to bother exploring and learning the value of all actions in that state, they will not bring much. Moreover, the advantage function has less variance than the Q-values, which is a very good property when using neural networks for function approximation. The variance of the Q-values comes from the fact that they are estimated based on other estimates, which themselves evolve during learning (non-stationarity of the targets) and can drastically change during exploration (stochastic policies). The advantages only track the relative change of the value of an action compared to its state, what is going to be much more stable over time.

Figure 9.4: The advantages have a much smaller variability than the Q-values.

The range of values taken by the advantages is also much smaller than the Q-values. Let’s suppose we have two states with values -10 and 10, and two actions with advantages 0 and -1 (it does not matter which one). The Q-values will vary between -11 (the worst action in the worst state) and 10 (the best action in the best state), while the advantage only varies between -1 and 0. It is therefore going to be much easier for a neural network to learn the advantages than the Q-values, which are theoretically not bounded.

Figure 9.5: Duelling network architecture. Top: classical feedforward architecture to predict Q-values. Bottom: Duelling networks predicting state values and advantage functions to form the Q-values. Source: Wang et al. ().

Wang et al. () incorporated the idea of advantage updating in a double DQN architecture with prioritized replay (). As in DQN, the last layer represents the Q-values of the possible actions and has to minimize the mse loss:

L(θ)=Eπ([r(s,a,s)+γQθ,α,β(s,argmaxaQθ,α,β(s,a))Qθ,α,β(s,a)]2) \mathcal{L}(\theta) = \mathbb{E}_\pi([r(s, a, s') + \gamma \, Q_{\theta', \alpha', \beta'}(s', \text{argmax}_{a'} Q_{\theta, \alpha, \beta} (s', a')) - Q_{\theta, \alpha, \beta}(s, a)]^2)

The difference is that the previous fully-connected layer is forced to represent the value of the input state Vθ,β(s)V_{\theta, \beta}(s) and the advantage of each action Aθ,α(s,a)A_{\theta, \alpha}(s, a) separately. There are two separate sets of weights in the network, α\alpha and β\beta, to predict these two values, sharing representations from the early convolutional layers through weights θ\theta. The output layer performs simply a parameter-less summation of both sub-networks:

Qθ,α,β(s,a)=Vθ,β(s)+Aθ,α(s,a) Q_{\theta, \alpha, \beta}(s, a) = V_{\theta, \beta}(s) + A_{\theta, \alpha}(s, a)

The issue with this formulation is that one could add a constant to Vθ,β(s)V_{\theta, \beta}(s) and substract it from Aθ,α(s,a)A_{\theta, \alpha}(s, a) while obtaining the same result. An easy way to constrain the summation is to normalize the advantages, so that the greedy action has an advantage of zero as expected:

Qθ,α,β(s,a)=Vθ,β(s)+(Aθ,α(s,a)maxaAθ,α(s,a)) Q_{\theta, \alpha, \beta}(s, a) = V_{\theta, \beta}(s) + (A_{\theta, \alpha}(s, a) - \max_a A_{\theta, \alpha}(s, a))

By doing this, the advantages are still free, but the state value will have to take the correct value. Wang et al. () found that it is actually better to replace the max\max operator by the mean of the advantages. In this case, the advantages only need to change as fast as their mean, instead of having to compensate quickly for any change in the greedy action as the policy improves:

Qθ,α,β(s,a)=Vθ,β(s)+(Aθ,α(s,a)1AaAθ,α(s,a)) Q_{\theta, \alpha, \beta}(s, a) = V_{\theta, \beta}(s) + (A_{\theta, \alpha}(s, a) - \frac{1}{|\mathcal{A}|} \sum_a A_{\theta, \alpha}(s, a))

Apart from this specific output layer, everything works as usual, especially the gradient of the mse loss function can travel backwards using backpropagation to update the weights θ\theta, α\alpha and β\beta. The resulting architecture outperforms double DQN with prioritized replay (DDQN-PER) on most Atari games, particularly games with repetitive actions.

Figure 9.6: Relative improvement of Duelling DQN over DDQN-PER on Atari games. Source: Wang et al. ().

Categorical DQN

All RL methods based on the Bellman equations use the expectation operator to average returns and compute the values of states and actions:

Qπ(s,a)=Es,aπ[R(s,a)] Q^\pi(s, a) = \mathbb{E}_{s, a \sim \pi}[R(s, a)]

The variance of the returns is not considered in the action selection scheme, and most methods actually try to reduce this variance as it impairs the convergence of neural networks. Decision theory states that only the mean should matter on the long-term, but one can imagine tasks where the variance is an important factor for the decision. Imagine you are in a game where you have two actions available: the first one brings returns of 10 and 20, with a probability of 0.5 each (to simplify), while the second one brings returns of -10 and +40 with probability 0.5 too. Both actions have the same Q-value of 15 (a return which is actually never experienced), so one can theoretically pick whatever action, both are optimal in the Bellman’s sense.

However, this is only true when playing long enough. If, after learning, one is only allowed one try on that game, it is obviously safer (but less fun) to choose the first action, as one wins at worse 10, while it is -10 with the second action. Knowing the distribution of the returns can allow to distinguish risky choices from safe ones more easily and adapt the behavior. Another advantage would be that by learning the distribution of the returns instead of just their mean, one actually gathers more information about the environment dynamics: it can only help the convergence of the algorithm towards the optimal policy.

Bellemare et al. () proposed to learn the value distribution (the probability distribution of the returns) through a modification of the Bellman equation. They show that learning the complete distribution of rewards instead of their mean leads to performance improvements on Atari games over modern variants of DQN.

Their proposed categorical DQN (also called C51) has an architecture based on DQN, but where the output layer predicts the distribution of the returns for each action aa in state ss, instead of its mean Qπ(s,a)Q^\pi(s, a). In practice, each action aa is represented by NN output neurons, who encode the support of the distribution of returns. If the returns take values between VminV_\text{min} and VmaxV_\text{max}, one can represent their distribution Z\mathcal{Z} by taking NN discrete “bins” (called atoms in the paper) in that range. shows how the distribution of returns between -10 and 10 can be represented using 21 atoms.

Figure 9.7: Example of a value distribution using 21 atoms between -10 and 10. The average return is 3, but its variance is explicitly represented.

Of course, the main problem is to know in advance the range of returns [Vmin,Vmax][V_\text{min}, V_\text{max}] (it depends largely on the choice of the discount rate γ\gamma), but you can infer it from training another algorithm such as DQN beforehand. Dabney et al. () got rid of this problem with quantile regression. In the paper, the authors found out experimentally that 51 is the most efficient number of atoms (hence the name C51).

Let’s note ziz_i these atoms with 1i<N1 \leq i < N. The atom probability that the return associated to a state-action pair (s,a)(s, a) lies within the bin associated to the atom ziz_i is noted pi(s,a)p_i(s, a). These probabilities can be predicted by a neural network, typically by using a softmax function over outputs fi(s,a;θ)f_i(s, a; \theta):

pi(s,a;θ)=expfi(s,a;θ)j=1Nexpfj(s,a;θ) p_i(s, a; \theta) = \frac{\exp f_i(s, a; \theta)}{\sum_{j=1}^{N} \exp f_j(s, a; \theta)}

The distribution of the returns Z\mathcal{Z} is simply a sum over the atoms (represented by the Dirac distribution δzi\delta_{z_i}):

Zθ(s,a)=i=1Npi(s,a;θ)δzi \mathcal{Z}_\theta(s, a) = \sum_{i=1}^{N} p_i(s, a; \theta) \, \delta_{z_i}

If these probabilities are correctly estimated, the Q-value is easy to compute as the mean of the distribution:

Qθ(s,a)=E[Zθ(s,a)]=i=1Npi(s,a;θ)zi Q_\theta(s, a) = \mathbb{E} [\mathcal{Z}_\theta(s, a)] = \sum_{i=1}^{N} p_i(s, a; \theta) \, z_i

These Q-values can then be used for action selection as in the regular DQN. The problem is now to learn the value distribution Zθ\mathcal{Z}_\theta, i.e. to find a learning rule / loss function for the pi(s,a;θ)p_i(s, a; \theta). Let’s consider a single transition (s,a,r,s)(s, a, r, s') and select the greedy action aa' in ss' using the current policy πθ\pi_\theta. The value distribution Zθ\mathcal{Z}_\theta can be evaluated by applying recursively the Bellman operator T\mathcal{T}:

TZθ(s,a)=R(s,a)+γZθ(s,a) \mathcal{T} \, \mathcal{Z}_\theta(s, a) = \mathcal{R}(s, a) + \gamma \, \mathcal{Z}_\theta(s', a')

where R(s,a)\mathcal{R}(s, a) is the distribution of immediate rewards after (s,a)(s, a). This use of the Bellman operator is the same as in Q-learning:

TQθ(s,a)=E[r(s,a)]+γQθ(s,a) \mathcal{T} \, \mathcal{Q}_\theta(s, a) = \mathbb{E}[r(s, a)] + \gamma \, \mathcal{Q}_\theta(s', a')

In Q-learning, one minimizes the difference (mse) between TQθ(s,a)\mathcal{T} \, \mathcal{Q}_\theta(s, a) and Qθ(s,a)\mathcal{Q}_\theta(s, a), which are expectations (so we only manipulate scalars). Here, we will minimize the statistical distance between the distributions TZθ(s,a)\mathcal{T} \, \mathcal{Z}_\theta(s, a) and Zθ(s,a)\mathcal{Z}_\theta(s, a) themselves, using for example the KL divergence, Wasserstein metric, total variation or whatnot.

The problem is mostly that the distributions TZθ(s,a)\mathcal{T} \, \mathcal{Z}_\theta(s, a) and Zθ(s,a)\mathcal{Z}_\theta(s, a) do not have the same support: for a particular atom ziz_i, TZθ(s,a)\mathcal{T} \, \mathcal{Z}_\theta(s, a) can have a non-zero probability pi(s,a)p_i(s, a), while Zθ(s,a)\mathcal{Z}_\theta(s, a) has a zero probability. Besides, the probabilities must sum to 1, so one cannot update the ziz_i independently from one another.

The proposed method consists of three steps:

  1. Computation of the Bellman update TZθ(s,a)\mathcal{T} \, \mathcal{Z}_\theta(s, a). They simply compute translated values for each ziz_i according to:

Tzi=r+γzi \mathcal{T} \, z_i = r + \gamma \, z_i

and clip the obtained value to [Vmin,Vmax][V_\text{min}, V_\text{max}]. The reward rr translates the distribution of atoms, while the discount rate γ\gamma scales it. shows the distribution of TZθ(s,a)\mathcal{T} \, \mathcal{Z}_\theta(s, a) compared to Zθ(s,a)\mathcal{Z}_\theta(s, a). Note that the atoms of the two distributions are not aligned.

Figure 9.8: Computation of the Bellman update TZθ(s,a)\mathcal{T} \, \mathcal{Z}_\theta(s, a). The atoms of the two distributions are not aligned.
  1. Distribution of the probabilities of TZθ(s,a)\mathcal{T} \, \mathcal{Z}_\theta(s, a) on the support of Zθ(s,a)\mathcal{Z}_\theta(s, a). The projected atom Tzi\mathcal{T} \, z_i lie between two “real” atoms zlz_l and zuz_u, with a non-integer index bb (for example b=3.4b = 3.4, l=3l = 3 and u=4u=4). The corresponding probability pb(s,a;θ)p_{b}(s', a'; \theta) of the next greedy action (s,a)(s', a') is “spread” to its neighbors through a local interpolation depending on the distances between bb, ll and uu:

Δpl(s,a;θ)=pb(s,a;θ)(bu) \Delta p_{l}(s', a'; \theta) = p_{b}(s', a'; \theta) \, (b - u) Δpu(s,a;θ)=pb(s,a;θ)(lb) \Delta p_{u}(s', a'; \theta) = p_{b}(s', a'; \theta) \, (l - b)

shows how the projected update distribution ΦTZθ(s,a)\Phi \, \mathcal{T} \, \mathcal{Z}_\theta(s, a) now matches the support of Zθ(s,a)\mathcal{Z}_\theta(s, a)

Figure 9.9: Projected update ΦTZθ(s,a)\Phi \, \mathcal{T} \, \mathcal{Z}_\theta(s, a) on the support of Zθ(s,a)\mathcal{Z}_\theta(s, a). The atoms are now aligned, the statistical distance between the two distributions can be minimized.

The projection of the Bellman update onto an atom ziz_i can be summarized by the following equation:

(ΦTZθ(s,a))i=j=1N[1[Tzj]VminVmaxziΔz]01pj(s,a;θ) (\Phi \, \mathcal{T} \, \mathcal{Z}_\theta(s, a))_i = \sum_{j=1}^N \big [1 - \frac{| [\mathcal{T}\, z_j]_{V_\text{min}}^{V_\text{max}} - z_i|}{\Delta z} \big ]_0^1 \, p_j (s', a'; \theta)

where []ab[\cdot]_a^b bounds its argument in [a,b][a, b] and Δz\Delta z is the step size between two atoms.

  1. Minimizing the statistical distance between ΦTZθ(s,a)\Phi \, \mathcal{T} \, \mathcal{Z}_\theta(s, a) and Zθ(s,a)\mathcal{Z}_\theta(s, a). Now that the Bellman update has the same support as the value distribution, we can minimize the KL divergence between the two for a single transition:

L(θ)=DKL(ΦTZθ(s,a)Zθ(s,a)) \mathcal{L}(\theta) = D_\text{KL} (\Phi \, \mathcal{T} \, \mathcal{Z}_{\theta'}(s, a) | \mathcal{Z}_\theta(s, a))

using a target network θ\theta' for the target. It is to be noted that minimizing the KL divergence is the same as minimizing the cross-entropy between the two, as in classification tasks:

L(θ)=i(ΦTZθ(s,a))ilogpi(s,a;θ) \mathcal{L}(\theta) = - \sum_i (\Phi \, \mathcal{T} \, \mathcal{Z}_{\theta'}(s, a))_i \log p_i (s, a; \theta)

The projected Bellman update plays the role of the one-hot encoded target vector in classification (except that it is not one-hot encoded). DQN performs a regression on the Q-values (mse loss), while categorical DQN performs a classification (cross-entropy loss). Apart from the way the target is computed, categorical DQN is very similar to DQN: architecture, experience replay memory, target networks, etc.

illustrates how the predicted value distribution changes when playing Space invaders (also have a look at the Youtube video at https://www.youtube.com/watch?v=yFBwyPuO2Vg). C51 outperforms DQN on most Atari games, both in terms of the achieved performance and the sample complexity.

Figure 9.10: Evolution of the value distribution for the categorical DQN playing Space Invaders. Animation Source: https://deepmind.com/blog/going-beyond-average-reinforcement-learning/

Additional resources:

Noisy DQN

DQN and its variants rely on ϵ\epsilon-greedy action selection over the Q-values to explore. The exploration parameter ϵ\epsilon is annealed during training to reach a final minimal value. It is preferred to softmax action selection, where τ\tau scales with the unknown Q-values. The problem is that it is a global exploration mechanism: well-learned states do not need as much exploration as poorly explored ones.

ϵ\epsilon-greedy and softmax add exploratory noise to the output of DQN: The Q-values predict a greedy action, but another action is taken. What about adding noise to the parameters (weights and biases) of the DQN, what would change the greedy action everytime? Controlling the level of noise inside the neural network indirectly controls the exploration level.

Parameter noise builds on the idea of Bayesian deep learning. Instead of learning a single value of the parameters:

y=θ1x+θ0y = \theta_1 \, x + \theta_0

we learn the distribution of the parameters, for example by assuming they come from a normal distribution:

θN(μθ,σθ2)\theta \sim \mathcal{N}(\mu_\theta, \sigma_\theta^2)

For each new input, we sample a value for the parameter:

θ=μθ+σθϵ\theta = \mu_\theta + \sigma_\theta \, \epsilon

with ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1) a random variable.

The prediction yy will vary for the same input depending on the variances:

y=(μθ1+σθ1ϵ1)x+μθ0+σθ0ϵ0y = (\mu_{\theta_1} + \sigma_{\theta_1} \, \epsilon_1) \, x + \mu_{\theta_0} + \sigma_{\theta_0} \, \epsilon_0

The mean and variance of each parameter can be learned through backpropagation! As the random variables ϵiN(0,1)\epsilon_i \sim \mathcal{N}(0, 1) are not correlated with anything, the variances σθ2\sigma_\theta^2 should decay to 0. The variances σθ2\sigma_\theta^2 represent the uncertainty about the prediction yy.

Applied to DQN, this means that a state which has not been visited very often will have a high uncertainty: The predicted Q-values will change a lot between two evaluations and the greedy action might change: exploration. Conversely, a well-explored state will have a low uncertainty: The greedy action stays the same: exploitation.

Noisy DQN () uses greedy action selection over noisy Q-values. The level of exploration is learned by the network on a per-state basis. No need for scheduling! Parameter noise improves the performance of ϵ\epsilon-greedy-based methods, including DQN, dueling DQN, A3C, DDPG (see later), etc.

Figure 9.12: Results of Noisy DQN. Source: Fortunato et al. ()

Rainbow DQN

As we have seen. the original formulation of DQN () has seen many improvements over the years.

  • Double DQN () separates the selection of the greedy action in the next state from its evaluation in order to prevent over-estimation of Q-values:

L(θ)=ED[(r+γQθ(s´,argmaxaQθ(s,a))Qθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}_\mathcal{D} [(r + \gamma \, Q_{\theta'}(s´, \text{argmax}_{a'} Q_{\theta}(s', a')) - Q_\theta(s, a))^2]

  • Prioritized Experience Replay () selects transitions from the ERM proportionally to their current TD error:

P(k)=(δk+ϵ)αk(δk+ϵ)αP(k) = \frac{(|\delta_k| + \epsilon)^\alpha}{\sum_k (|\delta_k| + \epsilon)^\alpha}

  • Dueling DQN () splits learning of Q-values into learning of advantages and state values:

Qθ(s,a)=Vα(s)+Aβ(s,a)Q_\theta(s, a) = V_\alpha(s) + A_\beta(s, a)

L(θ)=EDs[tklogZθ(sk,ak)]\mathcal{L}(\theta) = \mathbb{E}_{\mathcal{D}_s}[ - \mathbf{t}_k \, \log Z_\theta(s_k, a_k)]

  • n-step returns () reduce the bias of the estimation by taking the next nn rewards into account, at the cost of a slightly higher variance.

L(θ)=ED[(k=1nrt+k+γmaxaQθ(st+n+1,a)Qθ(st,at))2\mathcal{L}(\theta) = \mathbb{E}_\mathcal{D} [(\sum_{k=1}^n r_{t+k} + \gamma \max_a Q_\theta(s_{t+n+1}, a) - Q_\theta(s_t, a_t))^2

  • Noisy DQN () ensures exploration by adding noise to the parameters of the network instead of a softmax / ϵ\epsilon-greedy action selection over the Q-values.

All these improvements exceed the performance of vanilla DQN on most if not all Atari game. But which ones are the most important?

Hessel et al. () designed a Rainbow DQN integrating all these improvements. Not only does the combined network outperform all the DQN variants, but each of its components is important for its performance as shown by ablation studies (apart from double learning and duelling networks), see .

Figure 9.13: Performance of the Rainbow DQN compared to other DQN variants (left) and ablation studies. Figures Source: Hessel et al. ().

Deep Recurrent Q-learning (DRQN)

The Atari games used as a benchmark for value-based methods are partially observable MDPs (POMDP), i.e. a single frame does not contain enough information to predict what is going to happen next (e.g. the speed and direction of the ball on the screen is not known). In DQN, partial observability is solved by stacking four consecutive frames and using the resulting tensor as an input to the CNN. if this approach worked well for most Atari games, it has several limitations (as explained in https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-6-partial-observability-and-deep-recurrent-q-68463e9aeefc):

  1. It increases the size of the experience replay memory, as four video frames have to be stored for each transition.
  2. It solves only short-term dependencies (instantaneous speeds). If the partial observability has long-term dependencies (an object has been hidden a long time ago but now becomes useful), the input to the neural network will not have that information. This is the main explanation why the original DQN performed so poorly on games necessitating long-term planning like Montezuma’s revenge.
Figure 9.14: Architecture of DRQN. Source: Hausknecht and Stone ().

Building on previous ideas from the Schmidhuber’s group (; ), Hausknecht and Stone () replaced one of the fully-connected layers of the DQN network by a LSTM layer while using single frames as inputs. The resulting deep recurrent q-learning (DRQN) network became able to solve POMDPs thanks to the learning abilities of LSTMs: the LSTM layer learn to remember which part of the sensory information will be useful to take decisions later.

However, LSTMs are not a magical solution either. They are trained using truncated BPTT, i.e. on a limited history of states. Long-term dependencies exceeding the truncation horizon cannot be learned. Additionally, all states in that horizon (i.e. all frames) have to be stored in the ERM to train the network, increasing drastically its size. Firthermore, the training time (but not inference time) is orders of magnitude slower than with a comparable feedforward network. Despite these limitations, DRQN is a much more elegant solution to the partial observability problem, letting the network decide which horizon it needs to solve long-term dependencies.

Figure 9.15: Performance of DRWN compared to DQN. Source: Hausknecht and Stone ().