# Monte-Carlo (MC) methods

## Monte-Carlo policy evaluation

The value of a state s is defined as the mathematical expectation of the return obtained after that state and thereafter following the policy \pi:

V^{\pi} (s) = \mathbb{E}_{\rho_\pi} ( R_t | s_t = s) = \mathbb{E}_{\rho_\pi} ( \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} | s_t = s )

**Monte-Carlo methods** (MC) approximate this mathematical expectation by **sampling** M trajectories \tau_i starting from s and computing the sampling average of the obtained returns:

V^{\pi}(s) = \mathbb{E}_{\rho_\pi} (R_t | s_t = s) \approx \frac{1}{M} \sum_{i=1}^M R(\tau_i)

If you have enough trajectories, the sampling average is an unbiased estimator of the value function. The advantage of Monte-Carlo methods is that they require only **experience**, not the complete dynamics p(s' | s,a) and r(s, a, s').

The idea of MC policy evaluation is to repeatedly sample **episodes** starting from each possible state s_0 and maintain a **running average** of the obtained returns for each state:

The same method can be used to estimate Q-values.

There are much more values to estimate (one per state-action pair), but the policy will be easier to derive.

## Monte-Carlo policy improvement

After each episode, the state or action values of the visited (s, a) pairs have changed, so the current policy might not be optimal anymore. As in DP, the policy can then be improved in a greedy manner:

\begin{aligned} \pi'(s) & = \text{argmax}_a Q(s, a)\\ &\\ & = \text{argmax}_a \sum_{s' \in \mathcal{S}} p(s'|s, a) \, [r(s, a, s') + \gamma \, V(s')] \\ \end{aligned}

Estimating the Q-values allows to act greedily, while estimating the V-values still requires the dynamics p(s' | s,a) and r(s, a, s').

## On-policy Monte-Carlo control

**Monte-Carlo control** alternates between **MC policy evaluation** and **policy improvement** until the optimal policy is found: generalized policy iteration (GPI).

The problem with MC control is that we need a policy to generate the sample episodes, but it is that policy that we want to learn. We have the same **exploration/exploitation** problem as in bandits:

- If I trust my estimates too much (
**exploitation**), I may miss more interesting solutions by keeping generating the same episodes. - If I act randomly (
**exploration**), I will find more interesting solutions, but I won’t keep doing them.

**Exploitation** is using the current estimated values to select the greedy action: The estimated values represent how good we think an action is, so we have to use this value to update the policy.

**Exploration** is executing non-greedy actions to try to reduce our uncertainty about the true values: The values are only estimates: they may be wrong so we can not trust them completely.

If you only **exploit** your estimates, you may miss interesting solutions. If you only **explore**, you do not use what you know: you act randomly and do not obtain as much reward as you could. **You can’t exploit all the time; you can’t explore all the time.** You can never stop exploring; but you can reduce it if your performance is good enough.

An easy solution to ensure exploration is to assume **exploring starts**, where every state-action pair has a non-zero probability to be selected as the start of an episode.

Exploration can be ensured by forcing the learned policy to be **stochastic**, aka **\epsilon-soft**.

**\epsilon-Greedy action selection**randomly selects non-greedy actions with a small probability \epsilon:

\pi(s, a) = \begin{cases} 1 - \epsilon \; \text{if} \; a = \text{argmax}\, Q(s, a) \\ \frac{\epsilon}{|\mathcal{A}| - 1} \; \text{otherwise.} \end{cases}

**Softmax action selection**uses a Gibbs (or Boltzmann) distribution to represent the probability of choosing the action a in state s:

\pi(s, a) = \frac{\exp Q(s, a) / \tau}{ \sum_b \exp Q(s, b) / \tau}

\epsilon-greedy choses non-greedy actions randomly, while softmax favors the best alternatives.

In **on-policy** control methods, the learned policy has to be \epsilon-soft, which means all actions have a probability of at least \frac{\epsilon}{|\mathcal{A}|} to be visited. \epsilon-greedy and softmax policies meet this criteria. Each sample episode is generated using this policy, which ensures exploration, while the control method still converges towards the optimal \epsilon-policy.

Another option to ensure exploration is to generate the sample episodes using a policy b(s, a) different from the current policy \pi(s, a) of the agent. The **behavior policy** b(s, a) used to generate the episodes is only required to select at least occasionally the same actions as the **learned policy** \pi(s, a) (coverage assumption):

\pi(s,a) > 0 \Rightarrow b(s,a) > 0

There are mostly two choices regarding the behavior policy:

An \epsilon-soft behavior policy over the

**Q-values**as in on-policy MC is often enough, while a deterministic (greedy) policy can be learned implicitly.The behavior policy could also come from

**expert knowledge**, i.e. known episodes from the MDP generated by somebody else (human demonstrator, classical algorithm).

But are we mathematically allowed to do this?

## Importance sampling

We search for the optimal policy that maximizes in expectation the return of each **trajectory** (episode) possible under the learned policy \pi:

\mathcal{J}(\pi) = \mathbb{E}_{\tau \sim \rho_\pi} [R(\tau)]

\rho_\pi denotes the probability distribution of trajectories achievable using the policy \pi. If we generate the trajectories from the behavior policy b(s, a), we end up maximizing something else:

\mathcal{J}'(\pi) = \mathbb{E}_{\tau \sim \rho_b} [R(\tau)]

The policy that maximizes \mathcal{J}'(\pi) is **not** the optimal policy of the MDP.

If you try to estimate a parameter of a random distribution \pi using samples of another distribution b, the sample average will have a strong **bias**. We need to **correct** the samples from b in order to be able to estimate the parameters of \pi correctly: this is called **importance sampling** (IS).

We want to estimate the expected return of the trajectories generated by the policy \pi:

\mathcal{J}(\pi) = \mathbb{E}_{\tau \sim \rho_\pi} [R(\tau)]

We start by using the definition of the mathematical expectation:

\mathcal{J}(\pi) = \int_\tau \rho_\pi(\tau) \, R(\tau) \, d\tau

The expectation is the integral over all possible trajectories of their return R(\tau), weighted by the likelihood \rho_\pi(\tau) that a trajectory \tau is generated by the policy \pi.

The trick is to introduce the behavior policy b in what we want to estimate:

\mathcal{J}(\pi) = \int_\tau \frac{\rho_b(\tau)}{\rho_b(\tau)} \, \rho_\pi(\tau) \, R(\tau) \, d\tau

\rho_b(\tau) is the likelihood that a trajectory \tau is generated by the behavior policy b. We shuffle a bit the terms:

\mathcal{J}(\pi) = \int_\tau \rho_b(\tau) \, \frac{\rho_\pi(\tau)}{\rho_b(\tau)} \, R(\tau) \, d\tau

and notice that it has the form of an expectation over trajectories generated by b:

\mathcal{J}(\pi) = \mathbb{E}_{\tau \sim \rho_b} [\frac{\rho_\pi(\tau)}{\rho_b(\tau)} \, R(\tau)]

This means that we can sample trajectories from b, but we need to **correct** the observed return by the **importance sampling weight** \frac{\rho_\pi(\tau)}{\rho_b(\tau)}.

The importance sampling weight corrects the mismatch between \pi and b.

If the two distributions are the same (on-policy), the IS weight is 1, there is no need to correct the returns. If a sample is likely under b but not under \pi, we should not care about its return: \frac{\rho_\pi(\tau)}{\rho_b(\tau)} << 1. If a sample is likely under \pi but not much under b, we increase its importance in estimating the return: \frac{\rho_\pi(\tau)}{\rho_b(\tau)} >> 1. The sampling average of the corrected samples will be closer from the true estimate (unbiased).

How do we compute these probability distributions \rho_\pi(\tau) and \rho_b(\tau) for a trajectory \tau? A trajectory \tau is a sequence of state-action transitions (s_0, a_0, s_1, a_1, \ldots, s_T) whose probability depends on:

- the probability of choosing an action a_t in state s_t: the
**policy**\pi(s, a). - the probability of arriving in the state s_{t+1} from the state s_t with the action a_t: the
**transition probability**p(s_{t+1} | s_t, a_t).

The **likelihood** of a trajectory \tau = (s_0, a_0, s_1, a_1, \ldots, s_T) under a policy \pi depends on the policy and the transition probabilities (Markov property):

\rho_\pi(\tau) = p_\pi(s_0, a_0, s_1, a_1, \ldots, s_T) = p(s_0) \, \prod_{t=0}^{T-1} \pi_\theta(s_t, a_t) \, p(s_{t+1} | s_t, a_t)

p(s_0) is the probability of starting an episode in s_0, we do not have control over it.

What is interesting is that the transition probabilities disappear when calculating the **importance sampling weight**:

\rho_{0:T-1} = \frac{\rho_\pi(\tau)}{\rho_b(\tau)} = \frac{p_0 (s_0) \, \prod_{t=0}^{T-1} \pi(s_t, a_t) p(s_{t+1} | s_t, a_t)}{p_0 (s_0) \, \prod_{t=0}^T b(s_t, a_t) p(s_{t+1} | s_t, a_t)} = \frac{\prod_{t=0}^{T-1} \pi(s_t, a_t)}{\prod_{t=0}^T b(s_t, a_t)} = \prod_{t=0}^{T-1} \frac{\pi(s_t, a_t)}{b(s_t, a_t)}

The importance sampling weight is simply the product over the length of the episode of the ratio between \pi(s_t, a_t) and b(s_t, a_t).

## Off-policy Monte-Carlo control

In **off-policy MC control**, we generate episodes using the behavior policy b and update **greedily** the learned policy \pi. For the state s_t, the obtained returns just need to be weighted by the relative probability of occurrence of the **rest of the episode** following the policies \pi and b:

\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(s_k, a_k)}{b(s_k, a_k)}

V^\pi(s_t) = \mathbb{E}_{\tau \sim \rho_b} [\rho_{t:T-1} \, R_t]

This gives us the updates:

V(s_t) = V(s_t) + \alpha \, \rho_{t:T-1} \, (R_t - V(s_t))

and:

Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, \rho_{t:T-1} \, (R_t - Q(s_t, a_t))

Unlikely episodes under \pi are barely used for learning, likely ones are used a lot.

**Problem 1:**if the learned policy is greedy, the IS weight becomes quickly 0 for a non-greedy action a_t:

\pi(s_t, a_t) = 0 \rightarrow \rho_{0:T-1} = \prod_{k=0}^{T-1} \frac{\pi(s_k, a_k)}{b(s_k, a_k)} = 0

Off-policy MC control only learns from the last greedy actions, what is slow at the beginning.

**Solution:** \pi and b should not be very different. Usually \pi is greedy and b is a softmax (or \epsilon-greedy) over it.

**Problem 2:**if the learned policy is stochastic, the IS weights can quickly**vanish**to 0 or**explode**to infinity:

\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(s_k, a_k)}{b(s_k, a_k)}

If \dfrac{\pi(s_k, a_k)}{b(s_k, a_k)} is smaller than 1, the products go to 0. If it is bigger than 1, it grows to infinity.

**Solution:** one can normalize the IS weight between different episodes (see Sutton and Barto) or **clip** it (e.g. restrict it to [0.9, 1.1], see PPO later in this course).

The main advantage of **off-policy** strategies is that you can learn from other’s actions, you don’t have to rely on your initially wrong policies to discover the solution by chance. Example: learning to play chess by studying thousands/millions of plays by chess masters. In a given state, only a subset of the possible actions are actually executed by experts: the others may be too obviously wrong. The exploration is then guided by this expert knowledge, not randomly among all possible actions.

Off-policy methods greatly reduce the number of transitions needed to learn a policy: very stupid actions are not even considered, but the estimation policy learns an optimal strategy from the “classical” moves. Drawback: if a good move is not explored by the behavior policy, the learned policy will never try it.