Off-policy Actor-Critic
On-policy vs. off-policy
Actor-critic architectures are generally on-policy algorithms: the actions used to explore the environment must have been generated by the actor, otherwise the feedback provided by the critic (the advantage) will introduce a huge bias (i.e. an error) in the policy gradient. This comes from the definition of the policy gradient theorem:
\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\pi, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, Q^{\pi_\theta}(s, a)]
The state distribution \rho^\pi defines the ensemble of states that can be visited using the actor policy \pi_\theta. If, during Monte Carlo sampling of the policy gradient, the states s do not come from this state distribution, the approximated policy gradient will be wrong (high bias) and the resulting policy will be suboptimal.
The major drawback of on-policy methods is their sample complexity: it is difficult to ensure that the “interesting” regions of the policy are actually discovered by the actor (see Figure 13.1). If the actor is initialized in a flat region of the reward space (where there is not a lot of rewards), policy gradient updates will only change slightly the policy and it may take a lot of iterations until interesting policies are discovered and fine-tuned.
The problem becomes even worse when the state or action spaces are highly dimensional, or when rewards are sparse. Imagine the scenario where you are searching for your lost keys at home (a sparse reward is delivered only once you find them): you could spend hours trying randomly each action at your disposal (looking in your jackets, on your counter, but also jumping around, cooking something, watching TV…) until finally you explore the action “look behind the curtains” and find them. (Note: with deep RL, you would even have to do that one million times in order to allow gradient descent to train your brain…). If you had somebody telling you “if I were you, I would first search in your jackets, then on your counter and finally behind the curtains, but forget about watching TV, you will never find anything by doing that”, this would certainly reduce your exploration time.
This is somehow the idea behind off-policy algorithms: they use a behavior policy b(s, a) to explore the environment and train the target policy \pi(s, a) to reproduce the best ones by estimating how good they are. This does not come without caveats: if the behavior policy does not explore the optimal actions, the target policy will likely not be able to find it by itself, except by chance. But if the behavior policy is good enough, this can drastically reduce the amount of exploration needed to obtain a satisfying policy. Sutton and Barto (2017) noted that:
On-policy methods are generally simpler and are considered first. Off-policy methods require additional concepts and notation, and because the data is due to a different policy, off-policy methods are often of greater variance and are slower to converge.
The most famous off-policy method is Q-learning. The reason why it is off-policy is that it does not use the next executed action (a_{t+1}) to update the value of an action, but the greedy action in the next state, which is independent from exploration:
\delta = r(s, a, s') + \gamma \, \max_{a'} Q^\pi(s', a') - Q^\pi(s, a)
The only condition for Q-learning to work (in the tabular case) is that the behavior policy b(s,a) must be able to explore actions which are selected by the target policy:
\pi(s, a) > 0 \rightarrow b(s, a) > 0
Actions which would be selected by the target policy should be selected at least from time to time by the behavior policy in order to allow their update: if the target policy thinks this action should be executed, the behavior policy should try it to confirm or infirm this assumption. In mathematical terms, there is an assumption of coverage of \pi by b (the support of b includes the one of \pi).
There are mostly two ways to create the behavior policy:
Use expert knowledge / human demonstrations. Not all available actions should be explored: the programmer already knows they do not belong to the optimal policy. When an agent learns to play chess, for example, the behavior policy could consist of the moves typically played by human experts: if chess masters play this move, it is likely to be a good action, so it should be tried out, valued and possibly incorporated into the target policy (if it is indeed a good action, experts might be wrong). A similar idea was used to bootstrap early versions of AlphaGo (Silver et al., 2016). In robotics, one could for example use “classical” engineering methods to control the exploration of the robot, while learning (hopefully) a better policy. It is also possible to perform imitation learning, where the agent learns from human demonstrations (e.g. Levine and Koltun (2013)).
Derive it from the target policy. In Q-learning, the target policy can be deterministic, i.e. always select the greedy action (with the maximum Q-value). The behavior policy can be derived from the target policy by making it \epsilon-soft, for example using a \epsilon-greedy or softmax action selection scheme on the Q-values learned by the target policy.
The second option allows to control the level of exploration during learning (by controlling \epsilon or the softmax temperature) while making sure that the target policy (the one used in production) is deterministic and optimal. It furthermore makes sure that the coverage assumption is respected: the greedy action of the target policy always has a non-zero probability of being selected by an \epsilon-greedy or softmax action selection. This is harder to ensure using expert knowledge.
Q-learning methods such as DQN use this second option. The target policy in DQN is actually a greedy policy with respect to the Q-values (i.e. the action with the maximum Q-value will be deterministically chosen), but an \epsilon-soft behavior policy is derived from it to ensure exploration. This explains now the following comment in the description of the DQN algorithm:
- Select the action a_t based on the behavior policy derived from Q_\theta(s_t, a) (e.g. softmax).
Off-policy learning furthermore allows the use of an experience replay memory: in this case, the transitions used for training the target policy were generated by an older version of it (sometimes much older). Only off-policy methods can work with replay buffers. A3C is for example on-policy: it relies on multiple parallel learners to fight against the correlation of inputs and outputs.
Importance sampling
Off-policy methods learn a target policy \pi(s,a) while exploring with a behavior policy b(s,a). The environment is sampled using the behavior policy to form estimates of the state or action values (for value-based methods) or of the policy gradient (for policy gradient methods). But is it mathematically correct?
In policy gradient methods, we want to maximize the expected return of trajectories:
J(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[R(\tau)] = \int_\tau \rho_\theta(\tau) \, R(\tau) \, d\tau \approx \frac{1}{N} \sum_{i=1}^N R(\tau_i)
where \rho_\theta is the distribution of trajectories \tau generated by the target policy \pi_\theta. Mathematical expectations can be approximating by an average of enough samples of the estimator (Monte Carlo). In policy gradient, we estimate the gradient, but let’s consider we sample the objective function for now. If we use a behavior policy to generate the trajectories, what we are actually estimating is:
\hat{J}(\theta) = \mathbb{E}_{\tau \sim \rho_b}[R(\tau)] = \int_\tau \rho_b(\tau) \, R(\tau) \, d\tau
where \rho_b is the distribution of trajectories generated by the behavior policy. In the general case, there is no reason why \hat{J}(\theta) should be close from J(\theta), even when taking their gradient.
As seen in Section Importance sampling, importance sampling is a classical statistical method used to estimate properties of a distribution (here the expected return of the trajectories of the target policy) while only having samples generated from a different distribution (here the trajectories of the behavior policy). See for example https://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf and http://timvieira.github.io/blog/post/2014/12/21/importance-sampling for more generic explanations.
The trick is simply to rewrite the objective function as:
\begin{aligned} J(\theta) & = \mathbb{E}_{\tau \sim \rho_\theta}[R(\tau)] \\ & = \int_\tau \rho_\theta(\tau) \, R(\tau) \, d\tau \\ & = \int_\tau \frac{\rho_b(\tau)}{\rho_b(\tau)} \, \rho_\theta(\tau) \, R(\tau) \, d\tau \\ & = \int_\tau \rho_b(\tau) \frac{\rho_\theta(\tau)}{\rho_b(\tau)} \, R(\tau) \, d\tau \\ & = \mathbb{E}_{\tau \sim \rho_b}[\frac{\rho_\theta(\tau)}{\rho_b(\tau)} \, R(\tau)] \\ \end{aligned}
The ratio \frac{\rho_\theta(\tau)}{\rho_b(\tau)} is called the importance sampling weight for the trajectory. If a trajectory generated by b is associated with a lot of rewards R(\tau) (with \rho_b(\tau) significantly high), the actor should learn to reproduce that trajectory with a high probability \rho_\theta(\tau), as its goal is to maximize J(\theta). Conversely, if the associated reward is low (R(\tau)\approx 0), the target policy can forget about it (by setting \rho_\theta(\tau) = 0), even though the behavior policy still generates it!
The problem is now to estimate the importance sampling weight. Using the definition of the likelihood of a trajectory, the importance sampling weight only depends on the policies, not the dynamics of the environment (they cancel out):
\frac{\rho_\theta(\tau)}{\rho_b(\tau)} = \frac{p_0 (s_0) \, \prod_{t=0}^T \pi_\theta(s_t, a_t) p(s_{t+1} | s_t, a_t)}{p_0 (s_0) \, \prod_{t=0}^T b(s_t, a_t) p(s_{t+1} | s_t, a_t)} = \frac{\prod_{t=0}^T \pi_\theta(s_t, a_t)}{\prod_{t=0}^T b(s_t, a_t)} = \prod_{t=0}^T \frac{\pi_\theta(s_t, a_t)}{b(s_t, a_t)}
This allows to estimate the objective function J(\theta) using Monte Carlo sampling (Meuleau et al., 2000; Peshkin and Shelton, 2002):
J(\theta) \approx \frac{1}{m} \, \sum_{i=1}^m \frac{\rho_\theta(\tau_i)}{\rho_b(\tau_i)} \, R(\tau_i)
All one needs to do is to repeatedly apply the following algorithm:
Tang and Abbeel (2010) showed that the same idea can be applied to the policy gradient, under assumptions often met in practice:
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \rho_b}[ \nabla_\theta \log \rho_\theta(\tau) \, \frac{\rho_\theta(\tau)}{\rho_b(\tau)} \, R(\tau)]
When decomposing the policy gradient for each state encountered, one can also use the causality principle to simplify the terms:
The return after being in a state s_t only depends on future states.
The importance sampling weight (relative probability of arriving in s_t using the behavior and target policies) only depends on the past weights.
This gives the following approximation of the policy gradient, used for example in Guided policy search (Levine and Koltun, 2013):
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \rho_b}[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, \left(\prod_{t'=0}^t \frac{\pi_\theta(s_{t'}, a_{t'})}{b(s_{t'}, a_{t'})} \right) \, \left(\sum_{t'=t}^T \gamma^{t'-t} \, r(s_{t'}, a_{t'}) \right)]
Linear Off-Policy Actor-Critic (Off-PAC)
The first off-policy actor-critic method was proposed by Degris et al. (2012) for linear approximators. Another way to express the objective function in policy search is by using the Bellman equation (here in the off-policy setting):
J(\theta) = \mathbb{E}_{s \sim \rho_b} [V^{\pi_\theta}(s)] = \mathbb{E}_{s \sim \rho_b} [\sum_{a\in\mathcal{A}} \pi(s, a) \, Q^{\pi_\theta}(s, a)]
Maximizing the value of all states reachable by the policy is the same as finding the optimal policy: the encoutered states bring the maximum return. The policy gradient becomes:
\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho_b} [\sum_{a\in\mathcal{A}} \nabla_\theta (\pi_\theta(s, a) \, Q^{\pi_\theta}(s, a))]
Because both \pi(s, a) and Q^\pi(s, a) depend on the target policy \pi_\theta (hence its parameters \theta), one should normally write:
\nabla_\theta (\pi_\theta(s, a) \, Q^{\pi_\theta}(s, a)) = Q^{\pi_\theta}(s, a) \, \nabla_\theta \pi_\theta(s, a) + \pi_\theta(s, a) \, \nabla_\theta Q^{\pi_\theta}(s, a)
The second term depends on \nabla_\theta Q^{\pi_\theta}(s, a), which is very difficult to estimate. Degris et al. (2012) showed that when the Q-values are estimated by an unbiased critic Q_\varphi(s, a), this second term can be omitted. Using the log-trick and importance sampling, the policy gradient can be expressed as:
\begin{aligned} \nabla_\theta J(\theta) & = \mathbb{E}_{s \sim \rho_b} [\sum_{a\in\mathcal{A}} Q_\varphi(s, a) \, \nabla_\theta \pi_\theta(s, a)] \\ & = \mathbb{E}_{s \sim \rho_b} [\sum_{a\in\mathcal{A}} b(s, a) \, \frac{\pi_\theta(s, a)}{b(s, a)} \, Q_\varphi(s, a) \, \frac{\nabla_\theta \pi_\theta(s, a)}{\pi_\theta(s, a)}] \\ & = \mathbb{E}_{s,a \sim \rho_b} [\frac{\pi_\theta(s, a)}{b(s, a)} \, Q_\varphi(s, a) \, \nabla_\theta \log \pi_\theta(s, a)] \\ \end{aligned}
We now have an actor-critic architecture (actor \pi_\theta(s, a), critic Q_\varphi(s, a)) able to learn from single transitions (s,a) (online update instead of complete trajectories) generated off-policy (behavior policy b(s,a) and importance sampling weight \frac{\pi_\theta(s, a)}{b(s, a)}). The off-policy actor-critic (Off-PAC) algorithm of Degris et al. (2012) furthermore uses eligibility traces to stabilize learning. However, it was limited to linear function approximators because its variance is too high to train deep neural networks.
Retrace
For a good deep RL algorithm, we need the two following properties:
Off-policy learning: it allows to learn from transitions stored in a replay buffer (i.e. generated with an older policy). As NN need many iterations to converge, it is important to be able to re-use old transitions for its training, instead of constantly sampling new ones (sample complexity). Multiple parallel actors as in A3C allow to mitigate this problem, but it is still too complex.
Multi-step returns: the two extremes of RL are TD (using a single “real” reward for the update, the rest is estimated) and Monte Carlo (use only “real” rewards, no estimation). TD has a smaller variance, but a high bias (errors in estimates propagate to all other values), while MC has a small bias but a high variance (learns from many real rewards, but the returns may vary a lot between two almost identical episodes). Eligibility traces and n-step returns (used in A3C) are the most common trade-off between TD and MC.
The Retrace algorithm (Munos et al., 2016) is designed to exhibit both properties when learning Q-values. It can therefore be used to train the critic (instead of classical Q-learning) and provide the actor with safe, efficient and low-variance values.
In the generic form, Q-learning updates the Q-value of a transition (s_t, a_t) using the TD error:
\Delta Q^\pi(s_t, a_t) = \alpha \, \delta_t = \alpha \, (r_{t+1} + \gamma \, \max_a Q^\pi(s_{t+1}, a_{t+1}) - Q^\pi(s_t, a_t))
When using eligibility traces in the forward view, the change in Q-value depends also on the TD error of future transitions at times t' > t. A parameter \lambda ensures the stability of the update:
\Delta Q^\pi(s_t, a_t) = \alpha \, \sum_{t'=t}^T (\gamma \lambda)^{t'-t} \delta_{t'}
The Retrace algorithm proposes to generalize this formula using a parameter c_s for each time step between t and t':
\Delta Q^\pi(s_t, a_t) = \alpha \, \sum_{t'=t}^T (\gamma)^{t'-t} \left(\prod_{s=t+1}^{t'} c_s \right) \, \delta_{t'}
Depending on the choice of c_s, the formula covers different existing methods:
- c_s = \lambda is the classical eligibility trace mechanism (Q(\lambda)) in its forward view, which is not safe: the behavior policy b must be very close from the target policy \tau:
|| \pi - b ||_1 \leq \frac{1 - \gamma}{\lambda \gamma}
As \gamma is typically chosen very close from 1 (e.g. 0.99), this does not leave much room for the target policy to differ from the behavior policy (see Harutyunyan et al., 2016 for the proof).
c_s = \frac{\pi(s_s, a_s)}{b(s_s, a_s)} is the importance sampling weight. Importance sampling is unbiased in off-policy settings, but can have a very large variance: the product of ratios \prod_{s=t+1}^{t'} \frac{\pi(s_s, a_s)}{b(s_s, a_s)} can quickly vary between two episodes.
c_s = \pi(s_s, a_s) corresponds to the tree-backup algorithm TB(\lambda) (Precup et al., 2000). It has the advantage to work for arbitrary policies \pi and b, but the product of such probabilities decays very fast to zero when the time difference t' - t increases: TD errors will be efficiently shared over a couple of steps only.
For Retrace, Munos et al. (2016) showed that a much better value for c_s is:
c_s = \lambda \min (1, \frac{\pi(s_s, a_s)}{b(s_s, a_s)})
The importance sampling weight is clipped to 1, and decays exponentially with the parameter \lambda. It can be seen as a trade-off between importance sampling and eligibility traces. The authors showed that Retrace(\lambda) has a low variance (as it uses multiple returns), is safe (works for all \pi and b) and efficient (it can propagate rewards over many time steps). They used retrace to learn Atari games and compared it positively with DQN, both in terms of optimality and speed of learning. These properties make Retrace particularly suited for deep RL and actor-critic architectures: it is for example used in ACER and the Reactor.
- Rémi Munos uploaded some slides explaining Retrace in a simpler manner than in the original paper: https://ewrl.files.wordpress.com/2016/12/munos.pdf.
Self-Imitation Learning (SIL)
We have discussed so far only strictly on-policy or off-policy methods. Off-policy methods are much more stable and efficient, but they learn generally a deterministic policy, what can be problematic in stochastic environments (e.g. two players games: being predictable is clearly an issue). Hybrid methods combining on- and off-policy mechanisms have clearly a great potential.
Oh et al. (2018) proposed a Self-Imitation Learning (SIL) method that can extend on-policy actor-critic algorithms (e.g. A2C) with a replay buffer able to feed past good experiences to the NN to speed up learning.
The main idea is to use prioritized experience replay (Schaul et al. (2015)) to select only transitions whose actual return is higher than their current expected value. This defines two additional losses for the actor and the critic:
\mathcal{L}^\text{SIL}_\text{actor}(\theta) = \mathbb{E}_{s, a \in \mathcal{D}}[\log \pi_\theta(s, a) \, (R(s, a) - V_\varphi(s))^+] \mathcal{L}^\text{SIL}_\text{critic}(\varphi) = \mathbb{E}_{s, a \in \mathcal{D}}[((R(s, a) - V_\varphi(s))^+)^2]
where (x)^+ = \max(0, x) is the positive function. Transitions sampled from the replay buffer will participate to the off-policy learning only if their return is higher that the current value of the state, i.e. if they are good experiences compared to what is currently known (V_\varphi(s)). The pseudo-algorithm is actually quite simple and simply extends the A2C procedure:
In the paper, they furthermore used entropy regularization as in A3C. They showed that A2C+SIL has a better performance both on Atari games and continuous control problems (Mujoco) than state-of-the art methods (A3C, TRPO, Reactor, PPO). It shows that self-imitation learning can be very useful in problems where exploration is hard: a proper level of exploitation of past experiences actually fosters a deeper exploration of environment.
The Reactor
The Reactor (Retrace Actor) of Gruslys et al. (2017) combines many architectural and algorithmic contributions in order to provide an algorithm that is both sample efficient and with a good run-time performance. A3C has for example a better run-time performance (smaller wall-clock time for the training) than DQN or categorical DQN thanks to the use of multiple actor-learners in parallel, but its sample complexity is actually higher (as it is on-policy).
The Reactor combines and improves on:
- An actor-critic architecture using policy gradient with importance sampling,
- Off-policy corrected returns computed by the Retrace algorithm,
- Distributional learning of the Q-values in the critic,
- Prioritized experience replay for sequences.
One could consider REACTOR as the distributional version of ACER. We will not go into all the details here, but simply outline the main novelties.
The Reactor is composed of an actor \pi_\theta(s, a) and a critic Q_\varphi(s, a). The actor is trained using policy gradient with importance sampling, as in Off-PAC. For a single state s and an action \hat{a} sampled by the behavior policy b, the gradient of the objective is defined as:
\begin{aligned} \nabla_\theta J(\theta) = \frac{\pi_\theta(s, \hat{a})}{b(s, \hat{a})} & \, (R(s, \hat{a}) - Q_\varphi(s, \hat{a})) \, \nabla_\theta \log \pi_\theta(s, \hat{a}) \\ & + \sum_a Q_\varphi(s, a) \, \nabla_\theta \pi_\theta(s, a) \\ \end{aligned}
The first term comes from Off-PAC and only concerns the chosen action \hat{a} from the behavior policy. The actual return R(s, a) is compared to its estimate Q_\varphi(s, \hat{a}) in order to reduce its variance. The second term \sum_a Q_\varphi(s, a) \, \nabla_\theta \pi_\theta(s, a) depends on all available actions in s. Its role is to reduce the bias of the first term, without adding any variance as it is only based on estimates. As the value of the state is defined by V^\pi(s) = \sum_a \pi(s, a) \, Q^\pi(s, a), maximizing this term maximizes the value of the state, i.e. the associated returns. This rule is called leave-one-out (LOO), as one action is left out from the sum and estimated from actual returns instead of other estimates.
For a better control on the variance, the behavior probability b(s, a) is replaced by a parameter \beta:
\nabla_\theta J(\theta) = \beta \, (R(s, \hat{a}) - Q_\varphi(s, \hat{a})) \, \nabla_\theta \pi_\theta(s, \hat{a}) + \sum_a Q_\varphi(s, a) \, \nabla_\theta \pi_\theta(s, a)
\beta is defined as \min (c, \frac{1}{b(s, \hat{a})}), where c>1 is a constant. This truncated term is similar to what was used in ACER. The rule is now called \beta-LOO and is a novel proposition of the Reactor.
The second importance contribution of the Reactor is how to combine the Retrace algorithm (Munos et al. (2016)) for estimating the return R(s, \hat{a}) on multiple steps, with the distributional learning method of Categorical DQN. As Retrace uses n-steps returns iteratively, the n-step distributional Bellman target can updated using the n future rewards:
z_i^n = \mathcal{T}^n \, z_i = \sum_{k=t}^{t+n} \gamma^{k-t} r_k + \gamma^n \, z_i
We leave out the details on how Retrace is combined with these distributional Bellman updates: the notation is complicated but the idea is simple. The last importance contribution of the paper is the use of prioritized sequence replay. Prioritized experience replay allows to select in priority transitions from the replay buffer which are the most surprising, i.e. where the TD error is the highest. These transitions are the ones carrying the most information. A similar principle can be applied to sequences of transitions, which are needed by the n-step updates. They devised a specific sampling algorithm in order to achieve this and reduce the variance of the samples.
The last particularities of the Reactor is that it uses a LSTM layer to make the problem Markovian (instead of stacking four frames as in DQN) and train multiple actor-learners as in A3C. The algorithm is trained on CPU, with 10 or 20 actor-learners. The Reactor outperforms DQN and its variants, A3C and ACER on Atari games. Importantly, Reactor only needs one day of training on CPU, compared to the 8 days of GPU training needed by DQN.