Actor-Critic with Experience Replay (ACER)

The natural gradient methods presented previously (TRPO, PPO) are stochastic actor-critic methods, therefore strictly on-policy. Off-policy methods such as DQN or DDPG allow to reuse past transitions through the usage of an experience replay memory, potentially reducing the sample complexity at the cost of a higher variance and worse stability. Wang et al. (2017) proposed an off-policy actor-critic architecture using variance reduction techniques, the off-policy Retrace algorithm (Munos et al., 2016), parallel training of multiple actor-learners (Mnih et al., 2016), truncated importance sampling with bias correction, stochastic duelling network architectures (Wang et al., 2016), and efficient trust region policy optimization. It can be seen as the off-policy counterpart to A3C.

The first aspect of ACER is that it interleaves on-policy learning with off-policy: the agent samples a trajectory \tau, learns from it on-policy, stores it in the replay buffer, samples n trajectories from the replay buffer and learns off-policy from each of them:

ACER algorithm

Sample a trajectory \tau using the current policy.
Apply ACER on-policy on \tau.
Store \tau in the replay buffer.
Sample n trajectories from the replay buffer.
for each sampled trajectory \tau_k:
- Apply ACER off-policy on \tau_k.

Mixing on-policy learning with off-policy is quite similar to the Self-Imitation Learning approach of (Oh et al., 2018) (see Section Self-Imitation Learning (SIL)).

Retrace evaluation

ACER comes in two flavors: one for discrete action spaces, one for continuous spaces. The discrete version is simpler, so let’s focus on this one. As any policy-gradient method, ACER tries to estimate the policy gradient for each transition of a trajectory, but using importance sampling (Degris et al., 2012):

\nabla_\theta J(\theta) = \mathbb{E}_{s_t, a_t \sim \rho_b} [\frac{\pi_\theta(s_t, a_t)}{b(s_t, a_t)} \, Q_\varphi(s_t, a_t) \, \nabla_\theta \log \pi_\theta(s_t, a_t)]

The problem is now to train the critic Q_\varphi(s_t, a_t) by computing the correct target. ACER learning builds on the Retrace algorithm (Munos et al., 2016):

\Delta Q_\varphi(s_t, a_t) = \alpha \, \sum_{t'=t}^T (\gamma)^{t'-t} \left(\prod_{s=t+1}^{t'} c_s \right) \, \delta_{t'}

with c_s = \lambda \min (1, \frac{\pi_\theta(s_s, a_s)}{b(s_s, a_s)}) being the clipped importance sampling weight and \delta_{t'} is the TD error at time t'>t:

\delta_{t'} = r_{t'+1} + \gamma \, V(s_{t'+1}) - V(s_{t'})

By noting Q^\text{ret} the target value for the update of the critic (neglecting the learning rate \alpha):

Q^\text{ret}(s_t, a_t) = Q_\varphi(s_t, a_t) + \sum_{t'=t}^T (\gamma)^{t'-t} \left(\prod_{s=t+1}^{t'} c_s \right) \, \delta_{t'}

we can transform the Retrace formula into a recurrent one:

\begin{aligned} Q^\text{ret}(s_t, a_t) & = Q_\varphi(s_t, a_t) + \delta_t + \sum_{t'=t+1}^T (\gamma)^{t'-t} \left(\prod_{s=t+1}^{t'} c_s \right) \, \delta_{t'} \\ & = Q_\varphi(s_t, a_t) + \delta_t + \gamma \, c_{t+1} \, (Q^\text{ret}(s_{t+1}, a_{t+1}) - Q_\varphi(s_{t+1}, a_{t+1})) \\ \end{aligned}

Q_\varphi(s_t, a_t) + \delta_t = Q_\varphi(s_t, a_t) + r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t) can furthermore be reduced to r_{t+1} + \gamma \, V(s_{t+1}) by considering that Q_\varphi(s_t, a_t) \approx V(s_t) (the paper does not justify this assumption, but it should be true in expectation).

This gives us the following target value for the Q-values:

Q^\text{ret}(s_t, a_t) = r_{t+1} + \gamma \, c_{t+1} \, (Q^\text{ret}(s_{t+1}, a_{t+1}) - Q_\varphi(s_{t+1}, a_{t+1})) + \gamma \, V(s_{t+1})

One remaining issue is that the critic would also need to output the value of each state V(s_{t+1}), in addition to the Q-values Q_\varphi(s_t, a_t). In the discrete case, this is not necessary, as the value of a state is the expectation of the value of the available actions under the current policy:

V(s_{t+1}) = \mathbb{E}_{a_{t+1} \sim \pi_\theta} [Q_\varphi(s_{t+1}, a_{t+1})] = \sum_a \pi_\theta(s_{t+1}, a) \, Q_\varphi(s_{t+1}, a))

The value of the next state can be easily computed when we have the policy \pi_\theta(s, a) (actor) and the Q-value Q_\varphi(s, a) (critic) of each action a in a state s.

The actor-critic architecture needed for ACER is therefore the following:

The actor \pi_\theta takes a state s as input and outputs a vector of probabilities \pi_\theta for each available action.
The critic Q_\varphi takes a state s as input and outputs a vector of Q-values.

This is different from the architecture of A3C, where the critic “only” had to output the value of a state V_\varphi(s): it is now a vector of Q-values. Note that the actor and the critic can share most of their parameters: the network only needs to output two different vectors \pi_\theta(s) and Q_\varphi(s) for each input state s. This makes a “two heads” NN, similar to the duelling architecture of (Wang et al., 2016).

The target Q-value Q^\text{ret}(s, a) can be found recursively by iterating backwards over the episode:

Retrace evaluation

Initialize Q^\text{ret}(s_T, a_T), Q_\varphi(s_T, a_T) and V(s_T) to 0, as the terminal state has no value.
for t \in [T-1, \ldots, 0]:
- Update the target Q-value using the received reward, the critic and the previous target value:
Q^\text{ret}(s_t, a_t) = r_{t+1} + \gamma \, c_{t+1} \, (Q^\text{ret}(s_{t+1}, a_{t+1}) - Q_\varphi(s_{t+1}, a_{t+1})) + \gamma \, V(s_{t+1})
- Apply the critic on the current action:
Q_\varphi(s_t, a_t)
- Estimate the value of the state using the critic:
V(s_t) = \sum_a \pi_\theta(s_t) \, Q_\varphi(s_t, a)

As the target value Q^\text{ret}(s, a) use multiple “real” rewards r_{t+1}, it is actually less biased than the critic Q_\varphi(s, a). It is then better to use it to update the actor:

\nabla_\theta J(\theta) = \mathbb{E}_{s_t, a_t \sim \rho_b} [\frac{\pi_\theta(s_t, a_t)}{b(s_t, a_t)} \, Q^\text{ret}(s_t, a_t) \, \nabla_\theta \log \pi_\theta(s_t, a_t)]

The critic just has to minimize the mse with the target value:

\mathcal{L}(\varphi) = \mathbb{E}_{s_t, a_t \sim \rho_b} [(Q^\text{ret}(s, a) - Q_\varphi(s, a))^2]

Importance weight truncation with bias correction

When updating the actor, we rely on the importance sampling weight \rho_t = \frac{\pi_\theta(s_t, a_t)}{b(s_t, a_t)} which can vary a lot and destabilize learning.

\nabla_\theta J(\theta) = \mathbb{E}_{s_t, a_t \sim \rho_b} [\rho_t \, Q^\text{ret}(s_t, a_t) \, \nabla_\theta \log \pi_\theta(s_t, a_t)]

PPO solved this problem by clipping the importance sampling weight between 1- \epsilon and 1+\epsilon. ACER uses a similar strategy, but only using an upper bound c = 10 on the weight:

\bar{\rho}_t = \min(c, \rho_t)

Using \bar{\rho}_t in the policy gradient directly would introduce a bias: actions whose importance sampling weight \rho_t is higher than c would contribute to the policy gradient with a smaller value than they should, introducing a bias.

The solution in ACER is to add a bias correcting term, that corrects the policy gradient when an action has a weight higher than c:

\begin{aligned} \nabla_\theta J(\theta) = & \mathbb{E}_{s_t \sim \rho_b} [\mathbb{E}_{a_t \sim b} [\bar{\rho}_t \, Q^\text{ret}(s_t, a_t) \, \nabla_\theta \log \pi_\theta(s_t, a_t)] \\ & + \mathbb{E}_{a \sim \pi_\theta}[(\frac{\rho_t(a) - c}{\rho_t(a)})^+ \, Q_\varphi(s_t, a) \, \nabla_\theta \log \pi_\theta(s_t, a)]] \\ \end{aligned}

The left part of that equation is the same policy gradient as before, but using a clipped importance sampling weight.

The right part requires to integrate over all possible actions in the state s_t according to the learned policy \pi_\theta, although only the action a_t was selected by the behavior policy b. The term (\frac{\rho_t(a) - c}{\rho_t(a)})^+ is zero for all actions having an importance sampling weight smaller than c, and has a maximum of 1. In practice, this correction term can be computed using the vectors \pi_\theta(s, a) and Q_\varphi(s, a), which are the outputs of the actor and the critic, respectively.

Finally, the Q-values Q^\text{ret}(s_t, a_t) and Q_\varphi(s_t, a) are transformed into advantages Q^\text{ret}(s_t, a_t) - V_\varphi(s_t) and Q_\varphi(s_t, a) - V_\varphi(s_t) by substracting the value of the state in order to reduce the variance of the policy gradient.

In short, we now have an estimator of the policy gradient which is unbiased and of smaller variance.

Efficient trust region policy optimization

However, the variance is still too high. The last important step of ACER is an efficient TRPO update for the parameters of the actor.

A first component of their TRPO update is they use a target actor network \theta' (called averaged policy network in the paper) slowly tracking the actor \theta after each update:

\theta' \leftarrow \alpha \, \theta' + (1 - \alpha) \, \theta

A second component is that the actor is decomposed into two components:

a distribution f.
the statistics \Phi_\theta(x) of this distribution.

This is what you do when you apply the softmax action selection on Q-values: the distribution is the Gibbs (or Boltzmann) distribution and the Q-values are its statistics. In the discrete case, they take a categorical (or multinouilli) distribution: \Phi_\theta(s) is the probability for each action to be taken and the distribution selects one of them. Think of a dice with one side per action and probabilities governed by the policy. In the continuous case, it could be anything, for example a normal distribution.

Let’s rewrite the policy gradient with that formulation (we omit here the bias correction, but ACER uses it), but only w.r.t the output of the actor \Phi_\theta(s_t) for a state s_t:

\hat{g_t}^\text{ACER} = \nabla_{\Phi_\theta(s_t)} J(\theta) = \bar{\rho}_t \, (Q^\text{ret}(s_t, a_t) - V_\phi(s_t) ) \, \nabla_{\Phi_\theta(s_t)} \log f(a_t | \Phi_\theta(s_t))

To compute the policy gradient, we would only need to apply the chain rule:

\nabla_\theta J(\theta) = \mathbb{E}_{s_t, a_t \sim \rho_b} [ \hat{g_t}^\text{ACER} \, \nabla_\theta \Phi_\theta(s_t) ]

The variance of \hat{g_t}^\text{ACER} is too high. ACER defines the following TRPO problem: we search for a gradient z solution of:

\min_z ||\hat{g_t}^\text{ACER} - z ||^2 \\ \text{s.t.} \quad \nabla_{\Phi_\theta(s_t)} D_{KL}( f(\cdot | \Phi_{\theta'}(s_t) ) || f(\cdot | \Phi_{\theta'}(s_t)) )^T \times z < \delta

The exact meaning of the constraint is hard to grasp, but here some intuition: the change of policy z (remember that \hat{g_t}^\text{ACER} is defined w.r.t the output of the actor) should be as orthogonal as possible (within a margin \delta) to the change of the Kullback-Leibler divergence between the policy defined by the actor (\theta) and the one defined by the target actor (\theta'). In other words, we want to update the actor, but without making the new policy too different from its past values (the target actor).

The advantage of this formulation is that the objective function is quadratic in z and the constraint is linear. We can therefore very easily find its solution using KKT optimization:

z^* = \hat{g_t}^\text{ACER} - \max(0, \frac{k^T \, \hat{g_t}^\text{ACER} - \delta}{||k||^2}) \, k

where k = \nabla_{\Phi_\theta(s_t)} D_{KL}( f(\cdot | \Phi_{\theta'}(s_t) ) || f(\cdot | \Phi_{\theta'}(s_t)) ).

Having obtained z^*, we can safely update the parameters of the actor in the direction of:

\nabla_\theta J(\theta) = \mathbb{E}_{s_t, a_t \sim \rho_b} [ z^* \, \nabla_\theta \Phi_\theta(s_t) ]

As noted in the paper:

“The trust region step is carried out in the space of the statistics of the distribution f , and not in the space of the policy parameters. This is done deliberately so as to avoid an additional back-propagation step through the policy network”.

We indeed need only one network update per transition. If the KL divergence was computed with respect to \pi_\theta directly, one would need to apply backpropagation on the target network too.

The target network \theta' is furthermore used as the behavior policy b(s, a). here is also a target critic network \varphi', which is primarily used to compute the value of the states V_{\varphi'}(s) for variance reduction.

For a complete description of the algorithm, refer to the paper… To summarize, ACER is an actor-critic architecture using Retrace estimated values, importance weight truncation with bias correction and efficient TRPO. Its variant for continuous action spaces furthermore uses a Stochastic Dueling Network (SDN) in order estimate both Q_\varphi(s, a) and V_\varphi(s). It is straightforward in the discrete case (multiply the policy with the Q-values and take the average) but hard in the continuous case.

ACER improved the performance and/or the sample efficiency of the state-of-the-art (A3C, DDPG, etc) on a variety of tasks (Atari, Mujoco). Apart from truncation with bias correction, all aspects of the algorithm are essential to obtain this improvement, as shown by ablation studies.