Deep Reinforcement Learning
Policy gradient
Julien Vitay
Professur für Künstliche Intelligenz - Fakultät für Informatik
Policy search
Instead of learning the Q-values, one could approximate directly the policy πθ(s,a) with a neural network.
πθ(s,a) is called a parameterized policy: it depends directly on the parameters θ of the NN.
For discrete action spaces, the output of the NN can be a softmax layer, directly giving the probability of selecting an action.
For continuous action spaces, the output layer can directly control the effector (joint angles).
Policy search
- Parameterized policies can represent continuous policies and avoid the curse of dimensionality.
Policy search
- Policy search methods aim at maximizing directly the expected return over all possible trajectories (episodes) τ=(s0,a0,…,sT,aT)
J(θ)=Eτ∼ρθ[R(τ)]=∫τρθ(τ)R(τ)dτ
- All trajectories τ selected by the policy πθ should be associated with a high expected return R(τ) in order to maximize this objective function.
ρθ(τ) is the likelihood of the trajectory τ under the policy πθ.
This means that the optimal policy should only select actions that maximizes the expected return: exactly what we want.
Policy search
- Objective function to be maximized:
J(θ)=Eτ∼ρθ[R(τ)]=∫τρθ(τ)R(τ)dτ
- The objective function is however not model-free, as the likelihood of a trajectory does depend on the environments dynamics:
ρθ(τ)=pθ(s0,a0,…,sT)=p0(s0)t=0∏Tπθ(st,at)p(st+1∣st,at)
J(θ)=Eτ∼ρθ[R(τ)]≈M1i=1∑MR(τi)
Policy gradient
- All we need to find is a computable gradient ∇θJ(θ) to apply gradient ascent and backpropagation.
Δθ=η∇θJ(θ)
- Policy Gradient (PG) methods only try to estimate this gradient, but do not care about the objective function itself…
g=∇θJ(θ)
- In particular, any function J′(θ) whose gradient is locally the same (or has the same direction) will do:
J′(θ)=αJ(θ)+β⇒∇θJ′(θ)∝∇θJ(θ)⇒Δθ=η∇θJ′(θ)
This is called surrogate optimization: we actually want to maximize J(θ) but we cannot compute it.
We instead create a surrogate objective J′(θ) which is locally the same as J(θ) and tractable.
REINFORCE
- The REINFORCE algorithm (Williams, 1992) proposes an unbiased estimate of the policy gradient:
∇θJ(θ)=∇θ∫τρθ(τ)R(τ)dτ=∫τ(∇θρθ(τ))R(τ)dτ
by noting that the return of a trajectory does not depend on the weights θ (the agent only controls its actions, not the environment).
- We now use the log-trick, a simple identity based on the fact that:
dxdlogf(x)=f(x)f′(x)
or:
f′(x)=f(x)×dxdlogf(x)
to rewrite the gradient of the likelihood of a single trajectory:
∇θρθ(τ)=ρθ(τ)×∇θlogρθ(τ)
REINFORCE
- The policy gradient becomes:
∇θJ(θ)=∫τ(∇θρθ(τ))R(τ)dτ=∫τρθ(τ)∇θlogρθ(τ)R(τ)dτ
which now has the form of a mathematical expectation:
∇θJ(θ)=Eτ∼ρθ[∇θlogρθ(τ)R(τ)]
- The policy gradient is, in expectation, the gradient of the log-likelihood of a trajectory multiplied by its return.
REINFORCE
- The advantage of REINFORCE is that it is model-free:
ρθ(τ)=pθ(s0,a0,…,sT)=p0(s0)t=0∏Tπθ(st,at)p(st+1∣st,at)
logρθ(τ)=logp0(s0)+t=0∑Tlogπθ(st,at)+t=0∑Tlogp(st+1∣st,at)
∇θlogρθ(τ)=t=0∑T∇θlogπθ(st,at)
The transition dynamics p(st+1∣st,at) disappear from the gradient.
The Policy Gradient does not depend on the dynamics of the environment:
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)R(τ)]
REINFORCE algorithm
The REINFORCE algorithm is a policy-based variant of Monte Carlo control:
while not converged:
Sample M trajectories {τi} using the current policy πθ and observe the returns {R(τi)}.
Estimate the policy gradient as an average over the trajectories:
∇θJ(θ)≈M1i=1∑Mt=0∑T∇θlogπθ(st,at)R(τi)
- Update the policy using gradient ascent:
θ←θ+η∇θJ(θ)
REINFORCE
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)R(τ)]
Advantages
The policy gradient is model-free.
Works with partially observable problems (POMDP): as the return is computed over complete trajectories, it does not matter whether the states are Markov or not.
Inconvenients
Only for episodic tasks.
The gradient has a high variance: returns may change a lot during learning.
It has therefore a high sample complexity: we need to sample many episodes to correctly estimate the policy gradient.
Strictly on-policy: trajectories must be frequently sampled and immediately used to update the policy.
REINFORCE with baseline
- To reduce the variance of the estimated gradient, a baseline is often subtracted from the return:
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)(R(τ)−b)]
- As long as the baseline b is independent from θ, it does not introduce a bias:
Eτ∼ρθ[∇θlogρθ(τ)b]=∫τρθ(τ)∇θlogρθ(τ)bdτ=∫τ∇θρθ(τ)bdτ=b∇θ∫τρθ(τ)dτ=b∇θ1=0
REINFORCE with baseline
- In practice, a baseline that works well is the value of the encountered states:
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)(R(τ)−Vπ(st))]
- R(τ)−Vπ(st) becomes the advantage of the action at in st: how much return does it provide compared to what can be expected in st generally:
As in dueling networks, it reduces the variance of the returns.
Problem: the value of each state has to be learned separately (see actor-critic architectures).
Application of REINFORCE to resource management
REINFORCE with baseline can be used to allocate resources (CPU cores, memory, etc) when scheduling jobs on a cloud of compute servers.
The policy is approximated by a shallow NN (one hidden layer with 20 neurons).
The state space is the current occupancy of the cluster as well as the job waiting list.
The action space is sending a job to a particular resource.
The reward is the negative job slowdown: how much longer the job needs to complete compared to the optimal case.
DeepRM outperforms all alternative job schedulers.
3 - Policy Gradient Theorem
Policy Gradient
- The REINFORCE gradient estimate is the following:
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)R(τ)]=Eτ∼ρθ[t=0∑T(∇θlogπθ(st,at))(t′=0∑Tγt′rt′+1)]
- For each state-action pair (st,at) encountered during the episode, the gradient of the log-likelihood of the policy is multiplied by the complete return of the episode:
R(τ)=t′=0∑Tγt′rt′+1
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)(t′=t∑Tγt′−trt′+1)]=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)Rt]
Policy Gradient
- The return at time t (reward-to-go) multiplies the gradient of the log-likelihood of the policy (the score) for each transition in the episode:
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)Rt]
Qπ(s,a)=Eπ[Rt∣st=s;at=a]
we can replace Rt with Qπθ(st,at) without introducing any bias:
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)Qπθ(st,at)]
- This is true on average (no bias if the Q-value estimates are correct) and has a much lower variance!
Policy Gradient
- The policy gradient is defined over complete trajectories:
∇θJ(θ)=Eτ∼ρθ[t=0∑T∇θlogπθ(st,at)Qπθ(st,at)]
However, ∇θlogπθ(st,at)Qπθ(st,at) now only depends on (st,at), not the future nor the past.
Each step of the episode is now independent from each other (if we have the Markov property).
We can then sample single transitions instead of complete episodes:
∇θJ(θ)∝Es∼ρθ,a∼πθ[∇θlogπθ(s,a)Qπθ(s,a)]
- Note that this is not directly the gradient of J(θ), as the value of J(θ) changes (computed over single transitions instead of complete episodes, so it is smaller), but the gradients both go in the same direction!
Policy Gradient Theorem
For any MDP, the policy gradient is:
g=∇θJ(θ)=Es∼ρθ,a∼πθ[∇θlogπθ(s,a)Qπθ(s,a)]
Policy Gradient Theorem with function approximation
- Better yet, (Sutton et al. 1999) showed that we can replace the true Q-value Qπθ(s,a) by an estimate Qφ(s,a) as long as this one is unbiased:
∇θJ(θ)=Es∼ρθ,a∼πθ[∇θlogπθ(s,a)Qφ(s,a)]
Qφ(s,a)≈Qπθ(s,a)∀s,a
- The approximated Q-values can for example minimize the mean square error with the true Q-values:
L(φ)=Es∼ρθ,a∼πθ[(Qπθ(s,a)−Qφ(s,a))2]
Policy Gradient : Actor-critic
Policy Gradient : Actor-critic
But how to train the critic? We do not know Qπθ(s,a).
As always, we can estimate it through sampling:
- Monte Carlo critic: sampling the complete episode.
L(φ)=Es∼ρθ,a∼πθ[(R(s,a)−Qφ(s,a))2]
- SARSA critic: sampling (s,a,r,s′,a′) transitions.
L(φ)=Es,s′∼ρθ,a,a′∼πθ[(r+γQφ(s′,a′)−Qφ(s,a))2]
- Q-learning critic: sampling (s,a,r,s′) transitions.
L(φ)=Es,s′∼ρθ,a∼πθ[(r+γa′maxQφ(s′,a′)−Qφ(s,a))2]
Policy Gradient : reducing the variance
As with REINFORCE, the PG actor suffers from the high variance of the Q-values.
It is possible to use a baseline in the PG without introducing a bias:
∇θJ(θ)=Es∼ρθ,a∼πθ[∇θlogπθ(s,a)(Qπθ(s,a)−b)]
- In particular, the advantage actor-critic uses the value of a state as the baseline:
∇θJ(θ)=Es∼ρθ,a∼πθ[∇θlogπθ(s,a)(Qπθ(s,a)−Vπθ(s))]=Es∼ρθ,a∼πθ[∇θlogπθ(s,a)Aπθ(s,a)]
The critic can either:
learn to approximate both Qπθ(s,a) and Vπθ(s) with two different NN (SAC).
replace one of them with a sampling estimate (A3C, DDPG)
learn the advantage Aπθ(s,a) directly (GAE, PPO)
Many variants of the Policy Gradient
- Policy Gradient methods can take many forms :
∇θJ(θ)=Est∼ρθ,at∼πθ[∇θlogπθ(st,at)ψt]
where:
ψt=Rt is the REINFORCE algorithm (MC sampling).
ψt=Rt−b is the REINFORCE with baseline algorithm.
ψt=Qπ(st,at) is the policy gradient theorem.
ψt=Aπ(st,at)=Qπ(st,at)−Vπ(st) is the advantage actor-critic.
ψt=rt+1+γVπ(st+1)−Vπ(st) is the TD actor-critic.
ψt=k=0∑n−1γkrt+k+1+γnVπ(st+n)−Vπ(st) is the n-step advantage.
and many others…
Bias and variance of Policy Gradient methods
- The different variants of PG deal with the bias/variance trade-off.
∇θJ(θ)=Est∼ρθ,at∼πθ[∇θlogπθ(st,at)ψt]
the more ψt relies on sampled rewards (e.g. Rt), the more the gradient will be correct on average (small bias), but the more it will vary (high variance).
- This increases the sample complexity: we need to average more samples to correctly estimate the gradient.
the more ψt relies on estimations (e.g. the TD error), the more stable the gradient (small variance), but the more incorrect it is (high bias).
- This can lead to suboptimal policies, i.e. local optima of the objective function.
- All the methods we will see in the rest of the course are attempts at finding the best trade-off.
4 - Generalized advantage estimation
Generalized advantage estimation (GAE)
- The n-step advantage at time t:
Atn=k=0∑n−1γkrt+k+1+γnV(st+n)−V(st)
can be written as function of the TD error of the next n transitions:
Atn=l=0∑n−1γlδt+l
At2=rt+1+γrt+2+γ2V(st+2)−V(st)=(rt+1−V(st))+γ(rt+2+γV(st+2))=(rt+1+γV(st+1)−V(st))+γ(rt+2+γV(st+2)−V(st+1))=δt+γδt+1
Generalized advantage estimation (GAE)
- The n-step advantage realizes a bias/variance trade-off, but which value of n should we choose?
Atn=k=0∑n−1γkrt+k+1+γnV(st+n)−V(st)
- Schulman et al. (2015) proposed a generalized advantage estimate (GAE) AtGAE(γ,λ) summing all possible n-step advantages with a discount parameter λ:
AtGAE(γ,λ)=(1−λ)n=1∑∞λnAtn
This is just a forward eligibility trace over distant n-step advantages: the 1-step advantage is more important the the 1000-step advantage (too much variance).
We can show that the GAE can be expressed as a function of the future 1-step TD errors: AtGAE(γ,λ)=k=0∑∞(γλ)kδt+k
Generalized advantage estimation (GAE)
- Generalized advantage estimate (GAE) :
AtGAE(γ,λ)=(1−λ)n=1∑∞λnAtn=k=0∑∞(γλ)kδt+k
The parameter λ controls the bias-variance trade-off.
When λ=0, the generalized advantage is the TD error:
AtGAE(γ,0)=rt+1+γV(st+1)−V(st)=δt
- When λ=1, the generalized advantage is the MC advantage:
AtGAE(γ,1)=k=0∑∞γkrt+k+1−V(st)=Rt−V(st)
Any value in between controls the bias-variance trade-off: from the high bias / low variance of TD to the small bias / high variance of MC.
In practice, it leads to a better estimation than n-step advantages, but is more computationally expensive.
Deep Reinforcement Learning Policy gradient Julien Vitay Professur für Künstliche Intelligenz - Fakultät für Informatik