Temporal Difference learning

Julien Vitay

Professur für Künstliche Intelligenz - Fakultät für Informatik

- MC methods wait until the end of the episode to compute the obtained return:

V(s_t) = V(s_t) + \alpha (R_t - V(s_t))

- If the episode is very long, learning might be very slow. If the task is continuing, it is impossible.

- Considering that the return at time t is the immediate reward plus the return in the next step:

R_t = r_{t+1} + \gamma \, R_{t+1}

we could replace R_{t+1} by an estimate, which is the value of the next state V^\pi(s_{t+1}) = \mathbb{E}_\pi [R_{t+1} | s_{t+1}=s]:

R_t \approx r_{t+1} + \gamma \, V^\pi(s_{t+1})

**Temporal-Difference (TD)**methods simply replace the actual return by an estimation in the update rule:

V(s_t) = V(s_t) + \alpha \, (r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t))

where r_{t+1} + \gamma\, V(s_{t+1}) is a sampled estimate of the return.

- The quantity

\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)

is called equivalently the **reward prediction error** (RPE), the **TD error** or the **advantage** of the action a_t.

It is the difference between:

the estimated return in state s_t: V(s_t).

the actual return r_{t+1} + \gamma \, V(s_{t+1}), computed with an estimation.

If \delta_t > 0, it means that:

we received more reward r_{t+1} than expected, or:

we arrive in a state s_{t+1} that is better than expected.

we should increase the value of s_t as we

**underestimate**it.

If \delta_t < 0, we should decrease the value of s_t as we

**overestimate**it.

- The learning procedure in TD is then possible after each transition: the backup diagram is limited to only one state and its follower.

**Backup diagram of TD(0)**

**while**True:Start from an initial state s_0.

**foreach**step t of the episode:Select a_t using the current policy \pi in state s_t.

Apply a_t, observe r_{t+1} and s_{t+1}.

Compute the TD error:

\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)

- Update the state-value function of s_t:

V(s_t) = V(s_t) + \alpha \, \delta_t

**if**s_{t+1} is terminal:**break**

- TD learns from experience in a fully incremental manner. It does not need to wait until the end of an episode. It is therefore possible to learn continuing tasks. TD converges to V^{\pi} if the step-size parameter \alpha is small enough.

- The
**TD error**is used to evaluate the policy:

V(s_t) = V(s_t) + \alpha \, (r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)) = V(s_t) + \alpha \, \delta_t

- The estimates converge to:

V^\pi(s) = \mathbb{E}_\pi [r(s, a, s') + \gamma \, V^\pi(s')]

By using an

**estimate of the return**R_t instead of directly the return as in MC,we

**increase the bias**(estimates are always wrong, especially at the beginning of learning)but we

**reduce the variance**: only r(s, a, s') is stochastic, not the value function V^\pi.

We can therefore expect

**less optimal solutions**, but we will also need**less samples**.better

**sample efficiency**than MC.worse

**convergence**(suboptimal).

- Q-values can be estimated in the same way:

Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

Like for MC, the exploration/exploitation trade-off has to be managed: what is the next action a_{t+1}?

There are therefore two classes of TD control algorithms:

**on-policy**(SARSA)**off-policy**(Q-learning).

**SARSA**(state-action-reward-state-action) updates the value of a state-action pair by using the predicted value of the next state-action pair according to the current policy.

- When arriving in s_{t+1} from (s_t, a_t), we already sample the next action:

a_{t+1} \sim \pi(s_{t+1}, a)

- We can now update the value of (s_t, a_t):

Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

The next action a_{t+1} will

**have to**be executed next: SARSA is**on-policy**. You cannot change your mind and execute another a_{t+1}.The learned policy must be \epsilon-soft (stochastic) to ensure exploration.

SARSA converges to the optimal policy if \alpha is small enough and if \epsilon (or \tau) slowly decreases to 0.

**while**True:Start from an initial state s_0 and select a_0 using the current policy \pi.

**foreach**step t of the episode:Apply a_{t}, observe r_{t+1} and s_{t+1}.

Select a_{t+1} using the current

**stochastic**policy \pi.Update the action-value function of (s_t, a_t):

Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

- Improve the stochastic policy, e.g:

\pi(s_t, a) = \begin{cases} 1 - \epsilon \; \text{if} \; a = \text{argmax} \, Q(s_t, a) \\ \frac{\epsilon}{|\mathcal{A}(s_t) -1|} \; \text{otherwise.} \\ \end{cases}

**if**s_{t+1} is terminal:**break**

**Q-learning**directly approximates the optimal action-value function Q^* independently of the current policy, using the greedy action in the next state.

Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, \max_a Q(s_{t+1}, a) - Q(s_t, a_t))

The next action a_{t+1} can be generated by a behavior policy: Q-learning is

**off-policy**.The learned policy can be deterministic.

The behavior policy can be an \epsilon-soft policy derived from Q or expert knowledge.

The behavior policy only needs to visit all state-action pairs during learning to ensure optimality.

**while**True:Start from an initial state s_0.

**foreach**step t of the episode:Select a_{t} using the behavior policy b (e.g. derived from \pi).

Apply a_t, observe r_{t+1} and s_{t+1}.

Update the action-value function of (s_t, a_t):

Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (r_{t+1} + \gamma \, \max_a Q(s_{t+1}, a) - Q(s_t, a_t))

- Improve greedily the learned policy:

\pi(s_t, a) = \begin{cases} 1\; \text{if} \; a = \text{argmax} \, Q(s_t, a) \\ 0 \; \text{otherwise.} \\ \end{cases}

**if**s_{t+1} is terminal:**break**

- In off-policy Monte-Carlo, Q-values are estimated using the return of the rest of the episode on average:

Q^\pi(s, a) = \mathbb{E}_{\tau \sim \rho_b}[\rho_{0:T-1} \, R(\tau) | s_0 = s, a_0=a]

As the rest of the episode is generated by b, we need to correct the returns using the importance sampling weight.

In Q-learning, Q-values are estimated using other estimates:

Q^\pi(s, a) = \mathbb{E}_{s_t \sim \rho_b, a_t \sim b}[ r_{t+1} + \gamma \, \max_a Q^\pi(s_{t+1}, a) | s_t = s, a_t=a]

As we only sample

**transitions**using b and not episodes, there is no need to correct the returns:The returns use estimates Q^\pi, which depend on \pi and not b.

The immediate reward r_{t+1} is stochastic, but is the same whether you sample a_t from \pi or from b.

**States:**position of the red rectangle in the grid.**Action:**left, right, up, down.**Rewards:**+100 for the blue circle

-100 for the green triangles

-1 otherwise.

You will implement this in the exercises.

**Temporal Difference**allow to learn Q-values from single transitions instead of complete episodes.MC methods can only be applied to episodic problems, while TD works for continuing tasks.

MC and TD methods are

**model-free**: you do not need to know anything about the environment (p(s' |s, a) and r(s, a, s')) to learn.The

**exploration-exploitation**dilemma must be dealt with:**On-policy**TD (SARSA) follows the learned stochastic policy.

Q(s, a) = Q(s, a) + \alpha \, (r(s, a, s') + \gamma \, Q(s', a') - Q(s, a))

**Off-policy**TD (Q-learning) follows a behavior policy and learns a deterministic policy.

Q(s, a) = Q(s, a) + \alpha \, (r(s, a, s') + \gamma \, \max_a Q(s', a) - Q(s, a))

TD uses

**bootstrapping**like DP: it uses other estimates to update one estimate.Q-learning is the go-to method in tabular RL.

Actor-critic methods are TD methods that have a separate memory structure to explicitly represent the policy independent of the value function.

The policy \pi is implemented by the

**actor**, because it is used to select actions.The estimated values V(s) are implemented by the

**critic**, because it criticizes the actions made by the actor.Learning is always

**on-policy**: the critic must learn about and critique whatever policy is currently being followed by the actor.

- The critic computes the
**TD error**or**1-step advantage**:

\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)

- This scalar signal is the output of the critic and drives learning in both the actor and the critic.

- The TD error after each transition (s_t, a_t, r_{t+1}, s_{t+1}):

\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)

tells us how good the action a_t was compared to our expectation V(s_t).

When the advantage \delta_t > 0, this means that the action lead to a better reward or a better state than what was expected by V(s_t), which is a

**good surprise**, so the action should be reinforced (selected again) and the value of that state increased.When \delta_t < 0, this means that the previous estimation of (s_t, a_t) was too high (

**bad surprise**), so the action should be avoided in the future and the value of the state reduced.

- TD error after each transition:

\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)

- The critic is updated using this scalar signal:

V(s_t) \leftarrow V(s_t) + \alpha \, \delta_t

- The actor is updated according to this TD error signal. For example a softmax actor over preferences:

p(s_t, a_t) \leftarrow p(s_t, a_t) + \beta \, \delta_t

\pi(s, a) = \frac{\exp{p(s, a)}}{\sum_b \exp{p(s, b)}}

When \delta_t >0, the preference is increased, so the probability of selecting it again increases.

When \delta_t <0, the preference is decreased, so the probability of selecting it again decreases.

This is the equivalent of

**reinforcement comparison**for bandits.

Start in s_0. Initialize the preferences p(s,a) for each state action pair and the critic V(s) for each state.

**foreach**step t:- Select a_t using the
**actor**\pi in state s_t:

\pi(s_t, a) = \frac{\exp{p(s, a)}}{\sum_b \exp{p(s, b)}}

Apply a_t, observe r_{t+1} and s_{t+1}.

Compute the TD error in s_t using the

**critic**:

\delta_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)

- Update the
**actor**:

p(s_t, a_t) \leftarrow p(s_t, a_t) + \beta \, \delta_t

- Update the
**critic**:

V(s_t) \leftarrow V(s_t) + \alpha \, \delta_t

- Select a_t using the

The advantage of the separation between the actor and the critic is that now the actor can take any form (preferences, linear approximation, deep networks).

It requires minimal computation in order to select the actions, in particular when the action space is huge or even continuous.

It can learn stochastic policies, which is particularly useful in non-Markov problems.

**It is obligatory to learn on-policy:**the critic must evaluate the actions taken by the current actor.

the actor must learn from the current critic, not “old” V-values.

**Value-based**methods use value estimates Q_t(s, a) to infer a policy:**On-policy**methods learn and use a stochastic policy to explore.**Off-policy**methods learn a deterministic policy but use a (stochastic) behavior policy to explore.

**Policy-based**methods directly learn the policy \pi_t(s, a) (**actor**) using preferences or function approximators.A

**critic**learning values is used to improve the policy w.r.t a performance baseline.Actor-critic architectures are strictly

**on-policy**.

Bandits | MDP | |
---|---|---|

Value-based |
||

\qquadOn-policy | \epsilon-greedy, softmax | SARSA |

\qquadOff-policy | greedy | Q-learning |

Policy-based |
||

\qquadOn-policy | Reinforcement comparison | Actor-critic |

MC has

**high variance, zero bias**:Good convergence properties. We are more likely to find the optimal policy.

Not very sensitive to initial estimates.

Very simple to understand and use.

TD has

**low variance, some bias**:Usually more

**sample efficient**than MC.TD(0) converges to V^\pi(s) (but not always with function approximation). The policy might be suboptimal.

More sensitive to initial values (bootstrapping).

- When the reward function is sparse (e.g. only at the end of a game), only the last action, leading to that reward, will be updated the first time in TD.

Q(s, a) = Q(s, a) + \alpha \, (r(s, a, s') + \gamma \, \max_a Q(s', a) - Q(s, a))

The previous actions, which were equally important in obtaining the reward, will only be updated the next time they are visited.

This makes learning very slow: if the path to the reward has n steps, you will need to repeat the same episode at least n times to learn the Q-value of the first action.

Optimally, we would like a trade-off between:

TD (only one state/action is updated each time, small variance but significant bias)

Monte-Carlo (all states/actions in an episode are updated, no bias but huge variance).

In

**n-step TD prediction**, the next n rewards are used to estimate the return, the rest is approximated.

- The
**n-step return**is the discounted sum of the n next rewards is computed as in MC plus the predicted value at step t+n which replaces the rest as in TD.

R^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n})

- We can update the value of the state with this n-step return:

V(s_t) = V(s_t) + \alpha \, (R^n_t - V (s_t))

- The
**n-step advantage**at time t is:

A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)

- It is easy to check that the
**TD error**is the 1-step advantage:

\delta_t = A^1_t = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t)

As you use more “real” rewards, you

**reduce the bias**of Q-learning.As you use estimates for the rest of the episode, you

**reduce the variance**of MC methods.But how to choose n?

- One solution is to
**average**the n-step returns, using a discount factor \lambda :

R^\lambda_t = (1 - \lambda) \, \sum_{n=1}^\infty \lambda^{n-1} \, R^n_t

- The term 1- \lambda is there to ensure that the coefficients \lambda^{n-1} sum to one.

\sum_{n=1}^\infty \lambda^{n-1} = \dfrac{1}{1 - \lambda}

Each reward r_{t+k+1} will count multiple times in the \lambda-return. Distant rewards are discounted by \lambda^k in addition to \gamma^k.

Large n-step returns (MC) should not have as much importance as small ones (TD), as they have a high variance.

- To understand the role of \lambda, let’s split the infinite sum w.r.t the end of the episode at time T. n-step returns with n \geq T all have a MC return of R_t:

R^\lambda_t = (1 - \lambda) \, \sum_{n=1}^{T-t-1} \lambda^{n-1} \, R^n_t + \lambda^{T-t-1} \, R_t

\lambda controls the bias-variance trade-off:

If \lambda=0, the \lambda-return is equal to R^1_t = r_{t+1} + \gamma \, V(s_{t+1}), i.e. TD: high bias, low variance.

If \lambda=1, the \lambda-return is equal to R_t = \sum_{k=0}^{\infty} \gamma^{k} \, r_{t+k+1}, i.e. MC: low bias, high variance.

This

**forward view**of eligibility traces implies to look at all future rewards until the end of the episode to perform a value update. This prevents online learning using single transitions.

- Another view on eligibility traces is that the
**TD reward prediction error**at time t is sent backwards in time:

\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)

- Every state s previously visited during the episode will be updated proportionally to the current TD error and an
**eligibility trace**e_t(s):

V(s) \leftarrow V(s) + \alpha \, \delta_t \, e_t(s)

- The eligibility trace defines since how long the state was visited:

e_t(s) = \begin{cases} \gamma \, \lambda \, e_{t-1}(s) \qquad\qquad \text{if} \quad s \neq s_t \\ e_{t-1}(s) + 1 \qquad \text{if} \quad s = s_t \\ \end{cases}

- \lambda defines how important is a future TD error for the current state.

**foreach**step t of the episode:Select a_t using the current policy \pi in state s_t, observe r_{t+1} and s_{t+1}.

Compute the TD error in s_t:

\delta_t = r_{t+1} + \gamma \, V_k(s_{t+1}) - V_k(s_t)

- Increment the trace of s_t:

e_{t+1}(s_t) = e_t(s_t) + 1

**foreach**state s \in [s_o, \ldots, s_t] in the episode:- Update the state value function:

V_{k+1}(s) = V_k(s) + \alpha \, \delta_t \, e_t(s)

- Decay the eligibility trace:

e_{t+1}(s) = \lambda \, \gamma \, e_t(s)

**if**s_{t+1} is terminal:**break**

The backward view of eligibility traces can be applied on single transitions, given we know the history of visited states and maintain a trace for each of them.

Eligibility traces are a very useful way to speed learning up in TD methods and control the bias/variance trade-off.

This modification can be applied to all TD methods: TD(\lambda) for states, SARSA(\lambda) and Q(\lambda) for actions.

The main drawback is that we need to keep a trace for ALL possible state-action pairs: memory consumption. Clever programming can limit this issue.

The value of \lambda has to be carefully chosen for the problem: perhaps initial actions are random and should not be reinforced.

If your problem is not strictly Markov (POMDP), eligibility traces can help as they update the history!

- The
**n-step advantage**at time t:

A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)

can be written as function of the TD error of the next n transitions:

A^{n}_t = \sum_{l=0}^{n-1} \gamma^l \, \delta_{t+l}

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.

- The
**n-step advantage**realizes a bias/variance trade-off, but which value of n should we choose?

A^n_t = \sum_{k=0}^{n-1} \gamma^{k} \, r_{t+k+1} + \gamma^n \, V(s_{t+n}) - V (s_t)

- Schulman et al. (2015) proposed a
**generalized advantage estimate**(GAE) A_t^{\text{GAE}(\gamma, \lambda)} summing all possible n-step advantages with a discount parameter \lambda:

A_t^{\text{GAE}(\gamma, \lambda)} = (1 - \lambda) \sum_{n=1}^\infty \lambda^n \, A^n_t

This is just a forward eligibility trace over distant n-step advantages: the 1-step advantage is more important the the 1000-step advantage (too much variance).

We can show that the GAE can be expressed as a function of the future 1-step TD errors: A_t^{\text{GAE}(\gamma, \lambda)} = \sum_{k=0}^\infty (\gamma \, \lambda)^k \, \delta_{t+k}

**Generalized advantage estimate**(GAE) :

A_t^{\text{GAE}(\gamma, \lambda)} = (1 - \lambda) \sum_{n=1}^\infty \lambda^n \, A^n_t = \sum_{k=0}^\infty (\gamma \, \lambda)^k \, \delta_{t+k}

The parameter \lambda controls the

**bias-variance**trade-off.When \lambda=0, the generalized advantage is the TD error:

A_t^{\text{GAE}(\gamma, 0)} = r_{t+1} + \gamma \, V(s_{t+1}) - V(s_t) = \delta_{t}

- When \lambda=1, the generalized advantage is the MC advantage:

A_t^{\text{GAE}(\gamma, 1)} = \sum_{k=0}^\infty \gamma^k \, r_{t+k+1} - V(s_t) = R_t - V(s_t)

Any value in between controls the bias-variance trade-off: from the high bias / low variance of TD to the small bias / high variance of MC.

In practice, it leads to a better estimation than n-step advantages, but is more computationally expensive.