# Dynamic programming

import numpy as np
import matplotlib.pyplot as plt

The goal of this exercise is to find the optimal policy for the recycling robot.

In this problem, a recycling robot has to search for empty cans to collect (each can defines a “reward” given to the robot). It can also decide to stay where it is to save its battery and wait that somebody brings it a can (which gives less cans in average than actively searching for them).

The robot has two battery levels, high and low.

• In the high level, the robot can either search or wait.

• In the low state, three actions are possible: search, wait and recharge.

State-action transitions are probabilistic, i.e. they bring the robot in different states based on different probabilities \alpha and \beta.

This problem defines a finite MDP, with two states high and low corresponding to the battery level. The actions search and wait are possible in the high and low states, while the action recharge is only possible in the low state.

\begin{aligned} \mathcal{S} &=& \{ \text{high}, \text{low} \} \\ \mathcal{A}(\text{high} ) &=& \{ \text{search}, \text{wait} \} \\ \mathcal{A}(\text{low} ) &=& \{ \text{search}, \text{wait}, \text{recharge} \} \end{aligned}

The action search brings on average a reward of \mathcal{R}^\text{search}, the action wait a reward of \mathcal{R}^\text{wait}, the action recharge brings no reward, but allows to get in the high state.

Note that if the robot decides to search in the low state, there is a probability 1 - \beta that it totally empties its battery, requiring human intervention. This is punished with a negative reward of -3.

The transition and reward probabilities of each transition is defined in the following table, completely defining a MDP.

s s’ a p(s’ / s, a) r(s, a, s’)
high high search \alpha \mathcal{R}^\text{search}
high low search 1 - \alpha \mathcal{R}^\text{search}
low high search 1 - \beta -3
low low search \beta \mathcal{R}^\text{search}
high high wait 1 \mathcal{R}^\text{wait}
high low wait 0 \mathcal{R}^\text{wait}
low high wait 0 \mathcal{R}^\text{wait}
low low wait 1 \mathcal{R}^\text{wait}
low high recharge 1 0
low low recharge 0 0

The goal of this exercise is to find the optimal policy \pi^* of the robot, i.e to find for each state the action that should be performed systematically in order to gather the maximum of reward on the long term.

We will apply here two dynamic programming methods, policy iteration and value iteration, to solve the Bellman equations.

The Bellman equation for the state function is:

V^{\pi} (s) = \sum_{a \in \mathcal{A}(s)} \pi(s, a) \, \sum_{s' \in \mathcal{S}} p(s' | s, a) \, [ r(s, a, s') + \gamma \, V^{\pi} (s') ]

Q: On paper, adapt the Bellman equation to the problem. First, for every state s and possible action a, find the optimal value of the action with the form:

Q^{\pi} (s, a) = f( V^\pi (\text{high}), V^\pi (\text{low}), \alpha, \beta, \gamma, \mathcal{R}^{\text{search}}, \mathcal{R}^{\text{wait}} )

Deduce the Bellman equation for the two states V^\pi (\text{high}) and V^\pi (\text{low}).

A:

Q^\pi(\text{high}, \text{search}) = \alpha \, (\mathcal{R }^\text{search} + \gamma \, V^\pi(\text{high} )) + (1- \alpha)\, (\mathcal{R}^\text{search} + \gamma \, V^\pi(\text{low}))

Q^\pi(\text{high}, \text{wait}) = \mathcal{R}^\text{wait} + \gamma \, V^\pi(\text{high})

Q^\pi(\text{low}, \text{search}) = \beta * (\mathcal{R }^\text{search} + \gamma \, V^\pi(\text{low})) + (1- \beta) \, (-3 + \gamma \, V^\pi(\text{high}))

Q^\pi(\text{low}, \text{wait}) = \mathcal{R}^\text{wait} + \gamma \, V^\pi(\text{low})

Q^\pi(\text{low}, \text{recharge}) = \gamma \, V^\pi(\text{high})

V^\pi(\text{high}) = \pi(\text{high}, \text{search}) \, Q^\pi(\text{high}, \text{search}) + \pi(\text{high}, \text{wait}) \, Q^\pi(\text{high}, \text{wait})

V^\pi(\text{low}) = \pi(\text{low}, \text{search}) \, Q^\pi(\text{low}, \text{search}) + \pi(\text{low}, \text{wait}) \, Q^\pi(\text{low}, \text{wait}) + \pi(\text{low}, \text{recharge}) \, Q^\pi(\text{low}, \text{recharge})

## Policy Iteration

Now that we have the Bellman equations for the two states high and low, we can solve them using iterative policy evaluation for a fixed policy \pi.

### Iterative policy evaluation

Let’s start by setting the parameters of the MDP. In the rest of the exercise, you will modify these parameters to investigate how it changes the optimal policy.

# Transition probabilities
alpha = 0.3
beta = 0.2

# Discount parameter
gamma = 0.7

# Expected rewards
r_search = 6.0
r_wait = 2.0

There are many ways to represent states and actions in a MDP. The suggestion for this exercise is to use dictionaries here the keys are the actions’ name and the vaues are indices:

nb_states = 2
nb_actions = 3

s = {'high': 0, 'low': 1}
a = {'search': 0, 'wait': 1, 'recharge': 2}

Using dictionaries, you can access numpy arrays with s['high'] or a['recharge'] instead of 0 and 2, what will make the code readable.

The next step is to initialize numpy arrays where we will store the V and Q values. V will have only two elements for high and low, while Q will be a 2x3 matrix with one element for each state-action pair. Notice that (high, recharge) is not a possible action, so this element will not be be updated.

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

You can then access the individual values with V[s['high']] or Q[s['low'], a['wait']].

We can now evaluate a policy \pi. In dynamic programming, the policies are deterministic, as we want to estimate the optimal policy.

To implement the policy, we just need to assign the index of an action to each state, i.e. \pi(s). The following cell creates an initial policy \pi where the agent searches in both states high and low. We here make sure that the array contains integers (0, 1 or 2), but that is not even necessary.

pi = np.array([a['search'], a['search']], dtype=int)

Q: Evaluate this policy using iterative policy evaluation.

We would normally only need to update the V-value of the two states using:

V (s) \leftarrow \sum_{a \in \mathcal{A}(s)} \pi(s, a) \, \sum_{s' \in \mathcal{S}} p(s' | s, a) \, [ r(s, a, s') + \gamma \, V (s') ] \quad \forall s \in \mathcal{S}

The code will be more readable if you first update the Q-values of the 5 state-action pairs:

Q (s, a) \leftarrow \sum_{s' \in \mathcal{S}} p(s' | s, a) \, [ r(s, a, s') + \gamma \, V (s') ] \quad \forall s \in \mathcal{S}

and only then update the two V-values:

V (s) \leftarrow \sum_{a \in \mathcal{A}(s)} \pi(s, a) \, Q(s, a)

These updates should normally be applied until the V-values converge. For simplicity, we could decide to simply apply 50 updates or so, and hope that it is enough.

Record the V-value of the two states after each update and plot them.

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

V_high_history = []
V_low_history = []

for k in range(50):

Q[s['high'], a['search']] = alpha * (r_search + gamma * V[s['high']]) + (1 - alpha) * (r_search + gamma*V[s['low']])
Q[s['high'], a['wait']] = r_wait + gamma * V[s['high']]

Q[s['low'], a['search']] = beta * (r_search + gamma * V[s['low']]) + (1 - beta) * (-3 + gamma*V[s['high']])
Q[s['low'], a['wait']] = r_wait + gamma * V[s['low']]
Q[s['low'], a['recharge']] = gamma * V[s['high']]

V[s['high']] = Q[s['high'], pi[s['high']]]
V[s['low']] = Q[s['low'], pi[s['low']]]

V_high_history.append(V[s['high']])
V_low_history.append(V[s['low']])

print(Q)

plt.figure(figsize=(10, 6))
plt.plot(V_high_history, label="high")
plt.plot(V_low_history, label="low")
plt.legend()
plt.show()
[[11.28888873  9.90222206  0.        ]
[ 5.9555554   6.16888873  7.90222206]]

Q: Do the V-values converge? How fast? What do the final values represent? Change the value of \gamma and conclude on its importance (do not forget to reset the V and Q arrays to 0!).

A: The V-values converge quite fast (~15 iterations) to their true value. The high state has a higher value than the low state, as there is no risk in that state to receive the punishment of -3.

The final value is the expected return in that state, that is:

R_t = \sum_{k=0}^\infty \gamma^k \, r_{t+k+1}

\gamma completely changes the scale of the return. Small values of \gamma lead to small returns (only a couple of rewards count in the sum), while high values lead to high returns (there are a lot of rewards to be summed, especially because the task is continuing).

Q: Print the Q-values at the end of the policy evaluation. What would the greedy policy with respect to these Q-values?

Q: Change the initial policy to this policy and evaluate it. What happens? Compare the final value of the states under both policies. Which one is the best?

A: The greedy policy w.r.t. the Q-values is searching in high, recharging in low, as the Q-values are maximal for these actions.

If we evaluate this policy, we observe that:

• the value of both states is higher: this is a better policy, as we collect more return on average.
• the greedy policy does not change after the evaluation: we have found the optimal policy already!

### Policy iteration

Improving the policy is now straightforward. We just to look at the Q-values in each state, and change the policy so that it takes the action with the maximal Q-value. If this does not change the policy (we still take the same actions), we have found the optimal policy, we can stop.

Q: Implement policy iteration.

Do not forget to reset the V and Q arrays at the beginning of the cell, as well as the original policy.

Use an infinite loop that you will quit when the policy has not changed between two iterations. Something like:

while True:
# 1 - Policy evaluation
for k in range(50):
# Update the values

# 2 - Policy improvement

if pi != pi_old:
break

Beware: if you simply assign the policy to another array and modify the policy:

pi_old = pi
pi[s['high']] = a['search']

pi_old will also change! You need to .copy() the policy.

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

pi = np.array([a['search'], a['search']], dtype=int)

V_high_history = []
V_low_history = []

t = 1
while True:
# Policy evaluation
for k in range(50):

Q[s['high'], a['search']] = alpha * (r_search + gamma * V[s['high']]) + (1 - alpha) * (r_search + gamma*V[s['low']])
Q[s['high'], a['wait']] = r_wait + gamma * V[s['high']]

Q[s['low'], a['search']] = beta * (r_search + gamma * V[s['low']]) + (1 - beta) * (-3 + gamma*V[s['high']])
Q[s['low'], a['wait']] = r_wait + gamma * V[s['low']]
Q[s['low'], a['recharge']] = gamma * V[s['high']]

V[s['high']] = Q[s['high'], pi[s['high']]]
V[s['low']] = Q[s['low'], pi[s['low']]]

V_high_history.append(V[s['high']])
V_low_history.append(V[s['low']])

# Policy improvement
pi_old = pi.copy()
pi[s['high']] = Q[s['high'], :2].argmax()
pi[s['low']] = Q[s['low'], :].argmax()

print('Greedy policy after iteration', t)
print('pi(high)=', pi[s['high']])
print('pi(low)=', pi[s['low']])
print('-')

# Exit if the policy does not change
if pi[s['high']] == pi_old[s['high']] and pi[s['low']] == pi_old[s['low']]:
break
t += 1

plt.figure(figsize=(10, 6))
plt.plot(V_high_history, label="high")
plt.plot(V_low_history, label="low")
plt.legend()
plt.show()
Greedy policy after iteration 1
pi(high)= 0
pi(low)= 2
-
Greedy policy after iteration 2
pi(high)= 0
pi(low)= 2
-

## Value iteration

In value iteration, we merge the policy evaluation and improvement in a single update rule:

V (s) \leftarrow \max_{a \in \mathcal{A}(s)} \sum_{s' \in \mathcal{S}} p(s' | s, a) \, [ r(s, a, s') + \gamma \, V (s') ]

The value of state takes the value of its greedy action. The policy is therefore implicitly greedy w.r.t the Q-values.

The algorithm becomes:

• while not converged:

• for all states s:

• Update the value estimates with:

V (s) \leftarrow \max_{a \in \mathcal{A}(s)} \sum_{s' \in \mathcal{S}} p(s' | s, a) \, [ r(s, a, s') + \gamma \, V (s') ]

Q: Modify your previous code to implement value iteration. Use a fixed number of iterations (e.g. 50) as in policy evaluation. Visualize the evolution of the V-values and print the greedy policy after each iteration. Conclude.

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

pi = np.array([a['search'], a['search']], dtype=int)

V_high_history = []
V_low_history = []

for k in range(50):

# Policy evaluation
Q[s['high'], a['search']] = alpha * (r_search + gamma * V[s['high']]) + (1 - alpha) * (r_search + gamma*V[s['low']])
Q[s['high'], a['wait']] = r_wait + gamma * V[s['high']]

Q[s['low'], a['search']] = beta * (r_search + gamma * V[s['low']]) + (1 - beta) * (-3 + gamma*V[s['high']])
Q[s['low'], a['wait']] = r_wait + gamma * V[s['low']]
Q[s['low'], a['recharge']] = gamma * V[s['high']]

V[s['high']] = Q[s['high'], :2].max()
V[s['low']] = Q[s['low'], :].max()

V_high_history.append(V[s['high']])
V_low_history.append(V[s['low']])

# Compute the greedy policy
pi_old = pi.copy()
pi[s['high']] = Q[s['high'], :2].argmax()
pi[s['low']] = Q[s['low'], :].argmax()

print('Greedy policy after iteration', k)
print('pi(high)=', pi[s['high']])
print('pi(low)=', pi[s['low']])
print("V=", V)
print("Q=", Q)

plt.figure(figsize=(10, 6))
plt.plot(V_high_history, label="high")
plt.plot(V_low_history, label="low")
plt.legend()
plt.show()
Greedy policy after iteration 49
pi(high)= 0
pi(low)= 2
V= [13.4228186   9.39597296]
Q= [[13.4228186  11.39597296  0.        ]
[ 7.63221457  8.57718102  9.39597296]]

Q: Change the value of the discount factor \gamma =0.3 so that the agent becomes short-sighted: it only takes into account the immediate rewards, but forgets about the long-term. Does it change the strategy? Explain why.

# Transition probabilities
alpha = 0.3
beta = 0.2

# Discount parameter
gamma = 0.3

# Expected rewards
r_search = 6.0
r_wait = 2.0

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

pi = np.array([a['search'], a['search']], dtype=int)

V_high_history = []
V_low_history = []

for k in range(50):

# Policy evaluation
Q[s['high'], a['search']] = alpha * (r_search + gamma * V[s['high']]) + (1 - alpha) * (r_search + gamma*V[s['low']])
Q[s['high'], a['wait']] = r_wait + gamma * V[s['high']]

Q[s['low'], a['search']] = beta * (r_search + gamma * V[s['low']]) + (1 - beta) * (-3 + gamma*V[s['high']])
Q[s['low'], a['wait']] = r_wait + gamma * V[s['low']]
Q[s['low'], a['recharge']] = gamma * V[s['high']]

V[s['high']] = Q[s['high'], :2].max()
V[s['low']] = Q[s['low'], :].max()

V_high_history.append(V[s['high']])
V_low_history.append(V[s['low']])

# Compute the greedy policy
pi_old = pi.copy()
pi[s['high']] = Q[s['high'], :2].argmax()
pi[s['low']] = Q[s['low'], :].argmax()

print('Greedy policy after iteration', k)
print('pi(high)=', pi[s['high']])
print('pi(low)=', pi[s['low']])
print("V=", V)
print("Q=", Q)

plt.figure(figsize=(10, 6))
plt.plot(V_high_history, label="high")
plt.plot(V_low_history, label="low")
plt.legend()
plt.show()
Greedy policy after iteration 49
pi(high)= 0
pi(low)= 1
V= [7.25274725 2.85714286]
Q= [[7.25274725 4.17582418 0.        ]
[0.71208791 2.85714286 2.17582418]]

A: The agent now decides to wait in the low state (r=2) instead of recharging (r=0) and then be in the high state (r=6). The agent is so greedy that it cannot stand not getting reward for one step, although he will collect much more reard later.

Q: Change \gamma to 0.99 (far-sighted agent). What does it change and why?

# Transition probabilities
alpha = 0.3
beta = 0.2

# Discount parameter
gamma = 0.99

# Expected rewards
r_search = 6.0
r_wait = 2.0

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

pi = np.array([a['search'], a['search']], dtype=int)

V_high_history = []
V_low_history = []

for k in range(50):

# Policy evaluation
Q[s['high'], a['search']] = alpha * (r_search + gamma * V[s['high']]) + (1 - alpha) * (r_search + gamma*V[s['low']])
Q[s['high'], a['wait']] = r_wait + gamma * V[s['high']]

Q[s['low'], a['search']] = beta * (r_search + gamma * V[s['low']]) + (1 - beta) * (-3 + gamma*V[s['high']])
Q[s['low'], a['wait']] = r_wait + gamma * V[s['low']]
Q[s['low'], a['recharge']] = gamma * V[s['high']]

V[s['high']] = Q[s['high'], :2].max()
V[s['low']] = Q[s['low'], :].max()

V_high_history.append(V[s['high']])
V_low_history.append(V[s['low']])

# Compute the greedy policy
pi_old = pi.copy()
pi[s['high']] = Q[s['high'], :2].argmax()
pi[s['low']] = Q[s['low'], :].argmax()

print('Greedy policy after iteration', k)
print('pi(high)=', pi[s['high']])
print('pi(low)=', pi[s['low']])
print("V=", V)
print("Q=", Q)

plt.figure(figsize=(10, 6))
plt.plot(V_high_history, label="high")
plt.plot(V_low_history, label="low")
plt.legend()
plt.show()
Greedy policy after iteration 49
pi(high)= 0
pi(low)= 2
V= [141.37219244 137.82818773]
Q= [[141.37219244 139.82818773   0.        ]
[135.92647479 136.31962304 137.82818773]]

A: The optimal policy stays the same (search in high, recharge in low) but the V values grow very high. The difference between the values of the high and low state is comparatively very small: the high state is always only one action away from the low state, it is nothing with such a high gamma.

Q: Change the parameters to:

Find the optimal policy. What is the optimal action to be taken in the state high, although the probability to stay in this state is very small? Why?

# Transition probabilities
alpha = 0.01
beta = 0.2

# Discount parameter
gamma = 0.7

# Expected rewards
r_search = 6.0
r_wait = 5.0

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

pi = np.array([a['search'], a['search']], dtype=int)

V_high_history = []
V_low_history = []

for k in range(50):

# Policy evaluation
Q[s['high'], a['search']] = alpha * (r_search + gamma * V[s['high']]) + (1 - alpha) * (r_search + gamma*V[s['low']])
Q[s['high'], a['wait']] = r_wait + gamma * V[s['high']]

Q[s['low'], a['search']] = beta * (r_search + gamma * V[s['low']]) + (1 - beta) * (-3 + gamma*V[s['high']])
Q[s['low'], a['wait']] = r_wait + gamma * V[s['low']]
Q[s['low'], a['recharge']] = gamma * V[s['high']]

V[s['high']] = Q[s['high'], :2].max()
V[s['low']] = Q[s['low'], :].max()

V_high_history.append(V[s['high']])
V_low_history.append(V[s['low']])

# Compute the greedy policy
pi_old = pi.copy()
pi[s['high']] = Q[s['high'], :2].argmax()
pi[s['low']] = Q[s['low'], :].argmax()

print('Greedy policy after iteration', k)
print('pi(high)=', pi[s['high']])
print('pi(low)=', pi[s['low']])
print("V=", V)
print("Q=", Q)

plt.figure(figsize=(10, 6))
plt.plot(V_high_history, label="high")
plt.plot(V_low_history, label="low")
plt.legend()
plt.show()
Greedy policy after iteration 49
pi(high)= 0
pi(low)= 1
V= [17.67371571 16.66666637]
Q= [[17.67371571 17.37160091  0.        ]
[11.030614   16.66666637 12.37160091]]

A: The agent now decides to wait in the low state and accumulate quite a lot of rewards (r=5, compared to 6 while searching). But it is still worth searching in the high state: even if we transition immediately into low with 99% probability, one still gets 6 instead of 5, so the return is higher than when waiting in high.

Q: Find a set of parameters where it would be optimal to search while in the low state.

# Transition probabilities
alpha = 0.01
beta = 0.8

# Discount parameter
gamma = 0.7

# Expected rewards
r_search = 10.0
r_wait = 5.0

V = np.zeros(nb_states)
Q = np.zeros((nb_states, nb_actions))

pi = np.array([a['search'], a['search']], dtype=int)

V_high_history = []
V_low_history = []

for k in range(50):

# Policy evaluation
Q[s['high'], a['search']] = alpha * (r_search + gamma * V[s['high']]) + (1 - alpha) * (r_search + gamma*V[s['low']])
Q[s['high'], a['wait']] = r_wait + gamma * V[s['high']]

Q[s['low'], a['search']] = beta * (r_search + gamma * V[s['low']]) + (1 - beta) * (-3 + gamma*V[s['high']])
Q[s['low'], a['wait']] = r_wait + gamma * V[s['low']]
Q[s['low'], a['recharge']] = gamma * V[s['high']]

V[s['high']] = Q[s['high'], :2].max()
V[s['low']] = Q[s['low'], :].max()

V_high_history.append(V[s['high']])
V_low_history.append(V[s['low']])

# Compute the greedy policy
pi_old = pi.copy()
pi[s['high']] = Q[s['high'], :2].argmax()
pi[s['low']] = Q[s['low'], :].argmax()

print('Greedy policy after iteration', k)
print('pi(high)=', pi[s['high']])
print('pi(low)=', pi[s['low']])
print("V=", V)
print("Q=", Q)

plt.figure(figsize=(10, 6))
plt.plot(V_high_history, label="high")
plt.plot(V_low_history, label="low")
plt.legend()
plt.show()
Greedy policy after iteration 49
pi(high)= 0
pi(low)= 0
V= [28.03236199 25.7375694 ]
Q= [[28.03236199 24.62265325  0.        ]
[25.7375694  23.01629844 19.62265325]]