Policy gradient
Professur für Künstliche Intelligenz - Fakultät für Informatik
Learning directly the Q-values in value-based methods (DQN) suffers from many problems:
The Q-values are unbounded: they can take any value (positive or negative), so the output layer must be linear.
The Q-values have a high variability: some (s,a) pairs have very negative values, others have very positive values. Difficult to learn for a NN.
Works only for small discrete action spaces: need to iterate over all actions to find the greedy action.
Instead of learning the Q-values, one could approximate directly the policy \pi_\theta(s, a) with a neural network.
\pi_\theta(s, a) is called a parameterized policy: it depends directly on the parameters \theta of the NN.
For discrete action spaces, the output of the NN can be a softmax layer, directly giving the probability of selecting an action.
For continuous action spaces, the output layer can directly control the effector (joint angles).
\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[R(\tau)] = \int_{\tau} \rho_\theta(\tau) \; R(\tau) \; d\tau
\rho_\theta(\tau) is the likelihood of the trajectory \tau under the policy \pi_\theta.
This means that the optimal policy should only select actions that maximizes the expected return: exactly what we want.
\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[R(\tau)] = \int_{\tau} \rho_\theta(\tau) \; R(\tau) \; d\tau
\rho_\theta(\tau) = p_\theta(s_0, a_0, \ldots, s_T) = p_0 (s_0) \, \prod_{t=0}^T \pi_\theta(s_t, a_t) \, p(s_{t+1} | s_t, a_t)
The objective function is furthermore not computable:
An infinity of possible trajectories to integrate if the action space is continuous.
Even if we sample trajectories, we would need a huge number of them to correctly estimate the objective function (sample complexity) because of the huge variance of the returns.
\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[R(\tau)] \approx \frac{1}{M} \, \sum_{i=1}^M R(\tau_i)
\Delta \theta = \eta \, \nabla_\theta \mathcal{J}(\theta)
g = \nabla_\theta \mathcal{J}(\theta)
\mathcal{J}'(\theta) = \alpha \, \mathcal{J}(\theta) + \beta \; \Rightarrow \; \nabla_\theta \mathcal{J}'(\theta) \propto \nabla_\theta \mathcal{J}(\theta) \; \Rightarrow \; \Delta \theta = \eta \, \nabla_\theta \mathcal{J}'(\theta)
This is called surrogate optimization: we actually want to maximize \mathcal{J}(\theta) but we cannot compute it.
We instead create a surrogate objective \mathcal{J}'(\theta) which is locally the same as \mathcal{J}(\theta) and tractable.
Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256.
\nabla_\theta \, \mathcal{J}(\theta) = \nabla_\theta \, \int_\tau \rho_\theta (\tau) \, R(\tau) \, d\tau = \int_\tau (\nabla_\theta \, \rho_\theta (\tau)) \, R(\tau) \, d\tau
by noting that the return of a trajectory does not depend on the weights \theta (the agent only controls its actions, not the environment).
\frac{d \log f(x)}{dx} = \frac{f'(x)}{f(x)}
or:
f'(x) = f(x) \times \frac{d \log f(x)}{dx}
to rewrite the gradient of the likelihood of a single trajectory:
\nabla_\theta \, \rho_\theta (\tau) = \rho_\theta (\tau) \times \nabla_\theta \log \rho_\theta (\tau)
\nabla_\theta \, \mathcal{J}(\theta) = \int_\tau (\nabla_\theta \, \rho_\theta (\tau)) \, R(\tau) \, d\tau = \int_\tau \rho_\theta (\tau) \, \nabla_\theta \log \rho_\theta (\tau) \, R(\tau) \, d\tau
which now has the form of a mathematical expectation:
\nabla_\theta \, \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[ \nabla_\theta \log \rho_\theta (\tau) \, R(\tau) ]
\rho_\theta(\tau) = p_\theta(s_0, a_0, \ldots, s_T) = p_0 (s_0) \, \prod_{t=0}^T \pi_\theta(s_t, a_t) p(s_{t+1} | s_t, a_t)
\log \rho_\theta(\tau) = \log p_0 (s_0) + \sum_{t=0}^T \log \pi_\theta(s_t, a_t) + \sum_{t=0}^T \log p(s_{t+1} | s_t, a_t)
\nabla_\theta \log \rho_\theta(\tau) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t)
The transition dynamics p(s_{t+1} | s_t, a_t) disappear from the gradient.
The Policy Gradient does not depend on the dynamics of the environment:
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, R(\tau) ]
The REINFORCE algorithm is a policy-based variant of Monte Carlo control:
while not converged:
Sample M trajectories \{\tau_i\} using the current policy \pi_\theta and observe the returns \{R(\tau_i)\}.
Estimate the policy gradient as an average over the trajectories:
\nabla_\theta \mathcal{J}(\theta) \approx \frac{1}{M} \sum_{i=1}^M \sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, R(\tau_i)
\theta \leftarrow \theta + \eta \, \nabla_\theta \mathcal{J}(\theta)
Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256.
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, R(\tau) ]
Advantages
The policy gradient is model-free.
Works with partially observable problems (POMDP): as the return is computed over complete trajectories, it does not matter whether the states are Markov or not.
Inconvenients
Only for episodic tasks.
The gradient has a high variance: returns may change a lot during learning.
It has therefore a high sample complexity: we need to sample many episodes to correctly estimate the policy gradient.
Strictly on-policy: trajectories must be frequently sampled and immediately used to update the policy.
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, (R(\tau) - b) ]
\begin{aligned} \mathbb{E}_{\tau \sim \rho_\theta}[\nabla_\theta \log \rho_\theta (\tau) \, b ] & = \int_\tau \rho_\theta (\tau) \nabla_\theta \log \rho_\theta (\tau) \, b \, d\tau \\ & = \int_\tau \nabla_\theta \rho_\theta (\tau) \, b \, d\tau \\ &= b \, \nabla_\theta \int_\tau \rho_\theta (\tau) \, d\tau \\ &= b \, \nabla_\theta 1 \\ &= 0 \end{aligned}
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, (R(\tau) - V^\pi(s_t)) ]
As in dueling networks, it reduces the variance of the returns.
Problem: the value of each state has to be learned separately (see actor-critic architectures).
REINFORCE with baseline can be used to allocate resources (CPU cores, memory, etc) when scheduling jobs on a cloud of compute servers.
The policy is approximated by a shallow NN (one hidden layer with 20 neurons).
The state space is the current occupancy of the cluster as well as the job waiting list.
The action space is sending a job to a particular resource.
The reward is the negative job slowdown: how much longer the job needs to complete compared to the optimal case.
DeepRM outperforms all alternative job schedulers.
Mao et al. (2016) Resource Management with Deep Reinforcement Learning. HotNets ’16 doi:10.1145/3005745.3005750.
Sutton et al. (1999) Policy gradient methods for reinforcement learning with function approximation. NIPS.
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, R(\tau) ] = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T (\nabla_\theta \log \pi_\theta(s_t, a_t)) \, (\sum_{t'=0}^T \gamma^{t'} \, r_{t'+1}) ]
R(\tau) = \sum_{t'=0}^T \gamma^{t'} \, r_{t'+1}
The causality principle states that rewards obtained before time t are not caused by that action.
The policy gradient can be rewritten as:
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, (\sum_{t'=t}^T \gamma^{t' - t} \, r_{t'+1}) ] = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, R_t ]
Sutton et al. (1999) Policy gradient methods for reinforcement learning with function approximation. NIPS.
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, R_t ]
Q^\pi(s, a) = \mathbb{E}_\pi [R_t | s_t =s; a_t =a]
we can replace R_t with Q^{\pi_\theta}(s_t, a_t) without introducing any bias:
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, Q^{\pi_\theta}(s_t, a_t) ]
Sutton et al. (1999) Policy gradient methods for reinforcement learning with function approximation. NIPS.
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \rho_\theta}[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(s_t, a_t) \, Q^{\pi_\theta}(s_t, a_t) ]
However, \nabla_\theta \log \pi_\theta(s_t, a_t) \, Q^{\pi_\theta}(s_t, a_t) now only depends on (s_t, a_t), not the future nor the past.
Each step of the episode is now independent from each other (if we have the Markov property).
We can then sample single transitions instead of complete episodes:
\nabla_\theta \mathcal{J}(\theta) \propto \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, Q^{\pi_\theta}(s, a) ]
Sutton et al. (1999) Policy gradient methods for reinforcement learning with function approximation. NIPS.
For any MDP, the policy gradient is:
g = \nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, Q^{\pi_\theta}(s, a) ]
Sutton et al. (1999) Policy gradient methods for reinforcement learning with function approximation. NIPS.
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, Q_\varphi(s, a) ]
Q_\varphi(s, a) \approx Q^{\pi_\theta}(s, a) \; \forall s, a
\mathcal{L}(\varphi) = \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[(Q^{\pi_\theta}(s, a) - Q_\varphi(s, a))^2]
We obtain an actor-critic architecture:
the actor \pi_\theta(s, a) implements the policy and selects an action a in a state s.
the critic Q_\varphi(s, a) estimates the value of that action and drives learning in the actor.
Sutton et al. (1999) Policy gradient methods for reinforcement learning with function approximation. NIPS.
But how to train the critic? We do not know Q^{\pi_\theta}(s, a).
As always, we can estimate it through sampling:
\mathcal{L}(\varphi) = \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[(R(s, a) - Q_\varphi(s, a))^2]
\mathcal{L}(\varphi) = \mathbb{E}_{s, s' \sim \rho_\theta, a, a' \sim \pi_\theta}[(r + \gamma \, Q_\varphi(s', a') - Q_\varphi(s, a))^2]
\mathcal{L}(\varphi) = \mathbb{E}_{s, s' \sim \rho_\theta, a \sim \pi_\theta}[(r + \gamma \, \max_{a'} Q_\varphi(s', a') - Q_\varphi(s, a))^2]
As with REINFORCE, the PG actor suffers from the high variance of the Q-values.
It is possible to use a baseline in the PG without introducing a bias:
\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, (Q^{\pi_\theta}(s, a) -b)]
\begin{aligned} \nabla_\theta \mathcal{J}(\theta) &= \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, (Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s))] \\ &\\ &= \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, A^{\pi_\theta}(s, a)] \\ \end{aligned}
The critic can either:
learn to approximate both Q^{\pi_\theta}(s, a) and V^{\pi_\theta}(s) with two different NN (SAC).
replace one of them with a sampling estimate (A3C, DDPG)
learn the advantage A^{\pi_\theta}(s, a) directly (GAE, PPO)
\nabla_\theta J(\theta) = \mathbb{E}_{s_t \sim \rho_\theta, a_t \sim \pi_\theta}[\nabla_\theta \log \pi_\theta (s_t, a_t) \, \psi_t ]
where:
\psi_t = R_t is the REINFORCE algorithm (MC sampling).
\psi_t = R_t - b is the REINFORCE with baseline algorithm.
\psi_t = Q^\pi(s_t, a_t) is the policy gradient theorem.
\psi_t = A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t) is the advantage actor-critic.
\psi_t = r_{t+1} + \gamma \, V^\pi(s_{t+1}) - V^\pi(s_t) is the TD actor-critic.
\psi_t = \displaystyle\sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V^\pi(s_{t+n}) - V^\pi(s_t) is the n-step advantage.
and many others…
\nabla_\theta J(\theta) = \mathbb{E}_{s_t \sim \rho_\theta, a_t \sim \pi_\theta}[\nabla_\theta \log \pi_\theta (s_t, a_t) \, \psi_t ]
the more \psi_t relies on sampled rewards (e.g. R_t), the more the gradient will be correct on average (small bias), but the more it will vary (high variance).
the more \psi_t relies on estimations (e.g. the TD error), the more stable the gradient (small variance), but the more incorrect it is (high bias).
A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)
can be written as function of the TD error of the next n transitions:
A^{n}_t = \sum_{l=0}^{n-1} \gamma^l \, \delta_{t+l}
Proof with n=2:
\begin{aligned} A^2_t &= r_{t+1} + \gamma \, r_{t+2} + \gamma^2 \, V(s_{t+2}) - V(s_{t}) \\ &\\ &= (r_{t+1} - V(s_t)) + \gamma \, (r_{t+2} + \gamma \, V(s_{t+2}) ) \\ &\\ &= (r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)) + \gamma \, (r_{t+2} + \gamma \, V(s_{t+2}) - V(s_{t+1})) \\ &\\ &= \delta_t + \gamma \, \delta_{t+1} \end{aligned}
Schulman et al. (2015) High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)
A_t^{\text{GAE}(\gamma, \lambda)} = (1 - \lambda) \sum_{n=1}^\infty \lambda^n \, A^n_t
This is just a forward eligibility trace over distant n-step advantages: the 1-step advantage is more important the the 1000-step advantage (too much variance).
We can show that the GAE can be expressed as a function of the future 1-step TD errors: A_t^{\text{GAE}(\gamma, \lambda)} = \sum_{k=0}^\infty (\gamma \, \lambda)^k \, \delta_{t+k}
Schulman et al. (2015) High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.
A_t^{\text{GAE}(\gamma, \lambda)} = (1 - \lambda) \sum_{n=1}^\infty \lambda^n \, A^n_t = \sum_{k=0}^\infty (\gamma \, \lambda)^k \, \delta_{t+k}
The parameter \lambda controls the bias-variance trade-off.
When \lambda=0, the generalized advantage is the TD error:
A_t^{\text{GAE}(\gamma, 0)} = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t) = \delta_{t}
A_t^{\text{GAE}(\gamma, 1)} = \sum_{k=0}^\infty \gamma^k \, r_{t+k+1} - V(s_t) = R_t - V(s_t)
Any value in between controls the bias-variance trade-off: from the high bias / low variance of TD to the small bias / high variance of MC.
In practice, it leads to a better estimation than n-step advantages, but is more computationally expensive.
Schulman et al. (2015) High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.