Policy optimization (TRPO, PPO)

Trust Region Policy Optimization (TRPO)

Principle

Schulman et al. () extended the idea of natural gradients to allow their use for non-linear function approximators (e.g. deep networks), as the previous algorithms only worked efficiently for linear approximators. The proposed algorithm, Trust Region Policy Optimization (TRPO), has now been replaced in practice by Proximal Policy Optimization (PPO, see next section) but its novel ideas are important to understand already.

Let’s note the expected return of a policy π\pi as:

η(π)=Esρπ,aπ[t=0γtr(st,at,st+1)] \eta(\pi) = \mathbb{E}_{s \sim \rho_\pi, a \sim \pi}[\sum_{t=0}^\infty \gamma^t \, r(s_t, a_t, s_{t+1})]

where ρπ\rho_\pi is the discounted visitation frequency distribution (the probability that a state ss will be visited at some point in time by the policy π\pi):

ρπ(s)=P(s0=s)+γP(s1=s)+γ2P(s2=s)+ \rho_\pi(s) = P(s_0=s) + \gamma \, P(s_1=s) + \gamma^2 \, P(s_2=s) + \ldots

Kakade and Langford () had shown that it is possible to relate the expected return of two policies πθ\pi_\theta and πθold\pi_{\theta_\text{old}} using advantages (omitting π\pi in the notations):

η(θ)=η(θold)+Esρπθ,aπθ[Aπθold(s,a)] \eta(\theta) = \eta(\theta_\text{old}) + \mathbb{E}_{s \sim \rho_{\pi_\theta}, a \sim \pi_\theta} [A_{\pi_{\theta_\text{old}}}(s, a)]

The advantage Aπθold(s,a)A_{\pi_{\theta_\text{old}}}(s, a) denotes the change in the expected return obtained after (s,a)(s, a) when using the new policy πθ\pi_\theta, in comparison to the one obtained with the old policy πθold\pi_{\theta_\text{old}}. While this formula seems interesting (it measures how good the new policy is with regard to the average performance of the old policy, so we could optimize directly), it is difficult to estimate as the mathematical expectation depends on state-action pairs generated by the new policy : sρπθ,aπθs \sim \rho_{\pi_\theta}, a \sim \pi_\theta.

Schulman et al. () propose an approximation to this formula, by considering that if the two policies πθ\pi_\theta and πθold\pi_{\theta_\text{old}} are not very different from another, one can sample the states from the old distribution:

η(θ)η(θold)+Esρπθold,aπθ[Aπθold(s,a)] \eta(\theta) \approx \eta(\theta_\text{old}) + \mathbb{E}_{s \sim \rho_{\pi_{\theta_\text{old}}}, a \sim \pi_\theta} [A_{\pi_{\theta_\text{old}}}(s, a)]

One can already recognize the main motivation behind natural gradients: finding a weight update that moves the policy in the right direction (getting more rewards) while keeping the change in the policy distribution as small as possible (to keep the assumption correct).

Let’s now define the following objective function:

Jθold(θ)=η(θold)+Esρπθold,aπθ[Aπθold(s,a)] J_{\theta_\text{old}}(\theta) = \eta(\theta_\text{old}) + \mathbb{E}_{s \sim \rho_{\pi_{\theta_\text{old}}}, a \sim \pi_\theta} [A_{\pi_{\theta_\text{old}}}(s, a)]

It is easy to see that Jθold(θold)=η(θold)J_{\theta_\text{old}}(\theta_\text{old}) = \eta(\theta_\text{old}) by definition of the advantage, and that its gradient w.r.t to θ\theta taken in θold\theta_\text{old} is the same as the one of η(θold)\eta(\theta_\text{old}):

θJθold(θ)θ=θold=θη(θ)θ=θold \nabla_\theta J_{\theta_\text{old}}(\theta)|_{\theta = \theta_\text{old}} = \nabla_\theta \eta(\theta)|_{\theta = \theta_\text{old}}

This means that, at least locally, one maximization step of Jθold(θ)J_{\theta_\text{old}}(\theta) goes in the same direction as maximizing η(θ)\eta(\theta) if we do not go too far. JJ is called a surrogate objective function: it is not what we want to optimize, but it leads to the same result. TRPO belongs to the class of minorization-majorization algorithms (MM, we first find a local lower bound and then maximize it, iteratively).

Let’s now suppose that we can find its maximum, i.e. a policy π\pi' that maximizes the advantage of each state-action pair over πθold\pi_{\theta_\text{old}}. There would be no guarantee that π\pi' and πθold\pi_{\theta_\text{old}} are close enough so that the assumption stands. We could therefore make only a small step in its direction and hope for the best:

πθ(s,a)=(1α)πθold(s,a)+απ(s,a)(16.1) \pi_\theta(s, a) = (1-\alpha) \, \pi_{\theta_\text{old}}(s, a) + \alpha \, \pi'(s,a) \tag{16.1}

This is the conservative policy iteration method of Kakade and Langford (), where a bound on the difference between η(πθold)\eta(\pi_{\theta_\text{old}}) and Jθold(θ)J_{\theta_\text{old}}(\theta) can be derived.

Schulman et al. () propose to penalize instead the objective function by the KL divergence between the new and old policies. There are basically two ways to penalize an optimization problem:

  1. Adding a hard constraint on the KL divergence, leading to a constrained optimization problem (where Lagrange methods can be applied):

maximizeθJθold(θ)subject toDKL(πθoldπθ)δ \text{maximize}_\theta \qquad J_{\theta_\text{old}}(\theta) \\ \qquad \text{subject to} \qquad D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) \leq \delta

  1. Regularizing the objective function with the KL divergence:

maximizeθL(θ)=Jθold(πθ)CDKL(πθoldπθ) \text{maximize}_\theta \qquad L(\theta) = J_{\theta_\text{old}}(\pi_\theta) - C \, D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta)

In the first case, we force the KL divergence to stay below a certain threshold. In the second case, we penalize solutions that would maximize Jθold(θ)J_{\theta_\text{old}}(\theta) but would be too different from the previous policy. In both cases, we want to find a policy πθ\pi_\theta maximizing the expected return (the objective), but which is still close (in terms of KL divergence) from the current one. Both methods are however sensible to the choice of the parameters δ\delta and CC.

Formally, the KL divergence DKL(πθoldπθ)D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) should be the maximum KL divergence over the state space:

DKLmax(πθoldπθ)=maxsDKL(πθold(s,.)πθ(s,.)) D^\text{max}_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) = \max_s D_{KL}(\pi_{\theta_\text{old}}(s, .) || \pi_\theta(s, .))

This maximum KL divergence over the state space would be very hard to compute. Empirical evaluations showed however that it is safe to use the mean KL divergence, or even to sample it:

DˉKL(πθoldπθ)=Es[DKL(πθold(s,.)πθ(s,.))]1Ni=1NDKL(πθold(si,.)πθ(si,.)) \bar{D}_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) = \mathbb{E}_s [D_{KL}(\pi_{\theta_\text{old}}(s, .) || \pi_\theta(s, .))] \approx \frac{1}{N} \sum_{i=1}^N D_{KL}(\pi_{\theta_\text{old}}(s_i, .) || \pi_\theta(s_i, .))

Trust regions

Figure 16.1: Graphical illustration of trust regions. From the current parameters θold\theta_\text{old}, we search for the maximum θ\theta^* of the real objective η(θ)\eta(\theta). The unconstrained objective Jθold(θ)J_{\theta_\text{old}}(\theta) is locally similar to η(θ)\eta(\theta) but quickly diverge as πθ\pi_\theta and πθold\pi_{\theta_\text{old}} become very different. The surrogate objective L(θ)=Jθold(θ)CDKL(πθoldπθ)L(\theta) = J_{\theta_\text{old}} (\theta) - C \, D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) is always smaller than η(θ)\eta(\theta) and has a maximum close to θ\theta^* which keeps πθ\pi_\theta and πθold\pi_{\theta_\text{old}} close from each other in terms of KL divergence. The region around θold\theta_\text{old} where big optimization steps can be taken without changing the policy too much is called the trust region.

Before diving further into how these optimization problems can be solved, let’s wonder why the algorithm is called trust region policy optimization using the regularized objective. illustrates the idea. The “real” objective function η(θ)\eta(\theta) should be maximized (with gradient descent or similar) starting from the parameters θold\theta_\text{old}. We cannot estimate the objective function directly, so we build a surrogate objective function L(θ)=Jθold(θ)CDKL(πθoldπθ)L(\theta) = J_{\theta_\text{old}} (\theta) - C \, D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta). We know that:

  1. The two objectives have the same value in θold\theta_\text{old}: L(θold)=Jθold(θold)CDKL(πθoldπθold)=η(θold) L(\theta_\text{old}) = J_{\theta_\text{old}}(\theta_\text{old}) - C \, D_{KL}(\pi_{\theta_\text{old}} || \pi_{\theta_\text{old}}) = \eta(\theta_\text{old})
  2. Their gradient w.r.t θ\theta are the same in θold\theta_\text{old}: θL(θ)θ=θold=θη(θ)θ=θold \nabla_\theta L(\theta)|_{\theta = \theta_\text{old}} = \nabla_\theta \eta(\theta)|_{\theta = \theta_\text{old}}
  3. The surrogate objective is always smaller than the real objective, as the KL divergence is positive: η(θ)Jθold(θ)CDKL(πθoldπθ) \eta(\theta) \geq J_{\theta_\text{old}}(\theta) - C \, D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta)

Under these conditions, the surrogate objective is also called a lower bound of the primary objective. The interesting fact is that the value of θ\theta that maximizes L(θ)L(\theta) is at the same time:

  • A big step in the parameter space towards the maximum of η(θ)\eta(\theta), as θ\theta and θold\theta_\text{old} can be very different.
  • A small step in the policy distribution space, as the KL divergence between the previous and the new policies is kept small.

Exactly what we needed! The parameter region around θold\theta_\text{old} where the KL divergence is kept small is called the trust region. This means that we can safely take big optimization steps (e.g. with a high learning rate or even analytically) without risking to violate the initial assumptions.

Sample-based formulation

Although the theoretical proofs in Schulman et al. () used the regularized optimization method, the practical implementation uses the constrained optimization problem:

maximizeθJθold(θ)=η(θold)+Esρπθold,aπθ[Aπθold(s,a)]subject toDKL(πθoldπθ)δ \text{maximize}_\theta \qquad J_{\theta_\text{old}}(\theta) = \eta(\theta_\text{old}) + \mathbb{E}_{s \sim \rho_{\pi_{\theta_\text{old}}}, a \sim \pi_\theta} [A_{\pi_{\theta_\text{old}}}(s, a)] \\ \qquad \text{subject to} \qquad D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) \leq \delta

The first thing to notice is that η(θold)\eta(\theta_\text{old}) does not depend on θ\theta, so it is constant in the optimization problem. We only need to maximize the advantage of the actions taken by πθ\pi_\theta in each state visited by πθold\pi_{\theta_\text{old}}. The problem is that πθ\pi_\theta is what we search, so we can not sample actions from it. The solution is to use importance sampling to allow sampling actions from πθold\pi_{\theta_\text{old}}. This is possible as long as we correct the objective with the importance sampling weight:

maximizeθEsρπθold,aπθold[πθ(s,a)πθold(s,a)Aπθold(s,a)]subject toDKL(πθoldπθ)δ \text{maximize}_\theta \qquad \mathbb{E}_{s \sim \rho_{\pi_{\theta_\text{old}}}, a \sim \pi_{\theta_\text{old}}} [\frac{\pi_\theta(s, a)}{\pi_{\theta_\text{old}}(s, a)} \, A_{\pi_{\theta_\text{old}}}(s, a)] \\ \qquad \text{subject to} \qquad D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) \leq \delta

Now that states and actions are generated by the old policy, we can safely sample many trajectories using πθold\pi_{\theta_\text{old}} (Schulman et al. () proposes two methods called single path and Vine, but we ignore it here), compute the advantages of all state-action pairs (using real rewards along the trajectories), form the surrogate objective function and optimize it using second-order optimization methods.

One last thing to notice is that the advantages Aπθold(s,a)=Qπθold(s,a)Vπθold(s)A_{\pi_{\theta_\text{old}}}(s, a) = Q_{\pi_{\theta_\text{old}}}(s, a) - V_{\pi_{\theta_\text{old}}}(s) depend on the value of the states encountered by πθold\pi_{\theta_\text{old}}. The state values do not depend on the policies, they are constant for each optimization step, so they can also be safely removed:

maximizeθEsρπθold,aπθold[πθ(s,a)πθold(s,a)Qπθold(s,a)]subject toDKL(πθoldπθ)δ \text{maximize}_\theta \qquad \mathbb{E}_{s \sim \rho_{\pi_{\theta_\text{old}}}, a \sim \pi_{\theta_\text{old}}} [\frac{\pi_\theta(s, a)}{\pi_{\theta_\text{old}}(s, a)} \, Q_{\pi_{\theta_\text{old}}}(s, a)] \\ \qquad \text{subject to} \qquad D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) \leq \delta

Here we go, that’s TRPO. It could seem a bit disappointing to come up with such a simple formulation (find a policy which maximizes the Q-value of sampled actions while being not too different from the previous one) after so many mathematical steps, but that is also the beauty of it: not only it works, but it is guaranteed to work. With TRPO, each optimization step brings the policy closer from an optimum, what is called monotonic improvement.

Practical implementation

Now, how do we solve the constrained optimization problem? And what is the link with natural gradients?

To solve constrained optimization problems, we can form a Lagrange function with an additional parameter λ\lambda and search for its maximum:

L(θ,λ)=Jθold(θ)λ(DKL(πθoldπθ)δ) \mathcal{L}(\theta, \lambda) = J_{\theta_\text{old}}(\theta) - \lambda \, (D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) - \delta)

Notice how close the Lagrange function is from the regularized problem used in the theory. We can form a first-order approximation of Jθold(θ)J_{\theta_\text{old}}(\theta):

Jθold(θ)=θJθold(θ)(θθold) J_{\theta_\text{old}}(\theta) = \nabla_\theta J_{\theta_\text{old}}(\theta) \, (\theta- \theta_\text{old})

as Jθold(θold)=0J_{\theta_\text{old}}(\theta_\text{old}) = 0. g=θJθold(θ)g = \nabla_\theta J_{\theta_\text{old}}(\theta) is the now familiar policy gradient with importance sampling. Higher-order terms do not matter, as they are going to be dominated by the KL divergence term.

We will use a second-order approximation of the KL divergence term using the Fisher Information Matrix:

DKL(πθoldπθ)=(θθold)TF(θold)(θθold) D_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) = (\theta- \theta_\text{old})^T \, F(\theta_\text{old}) \, (\theta- \theta_\text{old})

We get the following Lagrangian function:

L(θ,λ)=θJθold(θ)(θθold)λ(θθold)TF(θold)(θθold) \mathcal{L}(\theta, \lambda) = \nabla_\theta J_{\theta_\text{old}}(\theta) \, (\theta- \theta_\text{old}) - \lambda \, (\theta- \theta_\text{old})^T \, F(\theta_\text{old}) \, (\theta- \theta_\text{old})

which is quadratic in Δθ=θθold\Delta \theta = \theta- \theta_\text{old}. It has therefore a unique maximum, characterized by a first-order derivative equal to 0:

θJθold(θ)=λF(θold)Δθ \nabla_\theta J_{\theta_\text{old}}(\theta) = \lambda \, F(\theta_\text{old}) \, \Delta \theta

or:

Δθ=1λF(θold)1θJθold(θ) \Delta \theta = \frac{1}{\lambda} \, F(\theta_\text{old})^{-1} \, \nabla_\theta J_{\theta_\text{old}}(\theta)

which is the natural gradient descent! The size of the step 1λ\frac{1}{\lambda} still has to be determined, but it can also be replaced by a fixed hyperparameter.

The main problem is now to compute and inverse the Fisher information matrix, which is quadratic with the number of parameters θ\theta, i.e. with the number of weights in the NN. Schulman et al. () proposes to used conjugate gradients to iteratively approximate the Fisher, a second-order method which will not be presented here (see https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf for a detailed introduction). After the conjugate gradient optimization step, the constraint DKL(πθoldπθ)δD_{KL}(\pi_{\theta_\text{old}} || \pi_\theta) \leq \delta is however not ensured anymore, so a line search is made as in until that criteria is met.

Summary

TRPO is a policy gradient method using natural gradients to monotonically improve the expected return associated to the policy. As a minorization-maximization (MM) method, it uses a surrogate objective function (a lower bound on the expected return) to iteratively change the parameters of the policy using large steps, but without changing the policy too much (as measured by the KL divergence). Its main advantage over DDPG is that it is much less sensible to the choice of the learning rate.

However, it has several limitations:

  • It is hard to use with neural networks having multiple outputs (e.g. the policy and the value function, as in actor-critic methods) as natural gradients are dependent on the policy distribution and its relationship with the parameters.
  • It works well when the NN has only fully-connected layers, but empirically performs poorly on tasks requiring convolutional layers or recurrent layers.
  • The use of conjugate gradients makes the implementation much more complicated and less flexible than regular SGD.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) was proposed by Schulman et al. () to overcome the problems of TRPO (complexity, inability to share parameters or to use complex NN architectures) while increasing the range of tasks learnable by the system (compared to DQN) and improving the sample complexity (compared to online PG methods, which perform only one update per step).

For that, they investigated various surrogate objectives (lower bounds) that could be solved using first-order optimization techniques (gradient descent). Let’s rewrite the surrogate loss of TRPO in the following manner:

LCPI(θ)=Et[πθ(st,at)πθold(st,at)Aπθold(st,at)]=Et[ρt(θ)Aπθold(st,at)] L^\text{CPI}(\theta) = \mathbb{E}_{t} [\frac{\pi_\theta(s_t, a_t)}{\pi_{\theta_\text{old}}(s_t, a_t)} \, A_{\pi_{\theta_\text{old}}}(s_t, a_t)] = \mathbb{E}_{t} [\rho_t(\theta) \, A_{\pi_{\theta_\text{old}}}(s_t, a_t)]

by making the dependency over time explicit and noting the importance sampling weight ρt(θ)\rho_t(\theta). The superscript CPI refers to conservative policy iteration (). Without a constraint, the maximization of LCPIL^\text{CPI} would lead to an excessively large policy updates. The authors searched how to modify the objective, in order to penalize changes to the policy that make ρt(θ)\rho_t(\theta) very different from 1, i.e. where the KL divergence between the new and old policies would become high. They ended up with the following surrogate loss:

LCLIP(θ)=Et[min(ρt(θ)Aπθold(st,at),clip(ρt(θ),1ϵ,1+ϵ)Aπθold(st,at)] L^\text{CLIP}(\theta) = \mathbb{E}_{t} [ \min (\rho_t(\theta) \, A_{\pi_{\theta_\text{old}}}(s_t, a_t), \text{clip}(\rho_t(\theta) , 1- \epsilon, 1+\epsilon) \, A_{\pi_{\theta_\text{old}}}(s_t, a_t)]

The left part of the min operator is the surrogate objective of TRPO LCPI(θ)L^\text{CPI}(\theta). The right part restricts the importance sampling weight between 1ϵ1-\epsilon and 1+ϵ1 +\epsilon. Let’s consider two cases (depicted on ):

Figure 16.2: Illustration of the effect of clipping the importance sampling weight. Source: Schulman et al. ().
  1. the transition st,ats_t, a_t has a positive advantage, i.e. it is a better action than expected. The probability of selecting that action again should be increased, i.e. πθ(st,at)>πθold(st,at)\pi_\theta(s_t, a_t) > \pi_{\theta_\text{old}}(s_t, a_t). However, the importance sampling weight could become very high (a change from 0.01 to 0.05 is a ration of ρt(θ)=5\rho_t(\theta) = 5). In that case, ρt(θ)\rho_t(\theta) will be clipped to 1+ϵ1+\epsilon, for example 1.2. As a consequence, the parameters θ\theta will move in the right direction, but the distance between the new and the old policies will stay small.

  2. the transition st,ats_t, a_t has a negative advantage, i.e. it is a worse action than expected. Its probability will be decreased and the importance sampling weight might become much smaller than 1. Clipping it to 1ϵ1-\epsilon avoids drastic changes to the policy, while still going in the right direction.

Finally, they take the minimum of the clipped and unclipped objective, so that the final objective is a lower bound of the unclipped objective. In the original paper, they use generalized advantage estimation (GAE, section ) to estimate Aπθold(st,at)A_{\pi_{\theta_\text{old}}}(s_t, a_t), but anything could be used (n-steps, etc). Transitions are sampled by multiple actors in parallel, as in A2C.

The pseudo-algorithm of PPO is as follows:

PPO algorithm
  • Initialize an actor πθ\pi_\theta and a critic VφV_\varphi with random weights.

  • while not converged :

    • for NN actors in parallel:

      • Collect TT transitions using πold\pi_\text{old}.

      • Compute the generalized advantage of each transition using the critic.

    • for KK epochs:

      • Sample MM transitions from the ones previously collected.

      • Train the actor to maximize the clipped surrogate objective.

      LCLIP(θ)=Et[min(ρt(θ)Aπθold(st,at),clip(ρt(θ),1ϵ,1+ϵ)Aπθold(st,at)] L^\text{CLIP}(\theta) = \mathbb{E}_{t} [ \min (\rho_t(\theta) \, A_{\pi_{\theta_\text{old}}}(s_t, a_t), \text{clip}(\rho_t(\theta) , 1- \epsilon, 1+\epsilon) \, A_{\pi_{\theta_\text{old}}}(s_t, a_t)]

      • Train the critic to minimize the mse using TD learning.
    • θoldθ\theta_\text{old} \leftarrow \theta

The main advantage of PPO with respect to TRPO is its simplicity: the clipped objective can be directly maximized using first-order methods like stochastic gradient descent or Adam. It does not depend on assumptions about the parameter space: CNNs and RNNs can be used for the policy. It is sample-efficient, as several epochs of parameter updates are performed between two transition samplings: the policy network therefore needs less fresh samples that strictly on-policy algorithms to converge.

The only drawbacks of PPO is that there no convergence guarantee (although in practice it converges more often than other state-of-the-art methods) and that the right value for ϵ\epsilon has to be determined. PPO has improved the state-of-the-art on Atari games and Mujoco robotic tasks. It has become the go-to method for continuous control problems.

Additional resources