import numpy as np
import matplotlib.pyplot as plt
Sampling
In this first exercise, we will investigate how to evaluate the Q-value of each action available in a 5-armed bandit. It is mostly to give you intuition about the limits of sampling and the central limit theorem.
Let’s start with importing numpy and matplotlib:
Sampling a n-armed bandit
Let’s now create the n-armed bandit. The only thing we need to do is to randomly choose 5 true Q-values Q^*(a).
To be generic, let’s define nb_actions=5
and create an array corresponding to the index of each action (0, 1, 2, 3, 4) for plotting purpose.
= 5
nb_actions = np.arange(nb_actions) actions
Q: Create a numpy array Q_star
with nb_actions
values, normally distributed with a mean of 0 and standard deviation of 1 (as in the lecture).
= np.random.default_rng()
rng = rng.normal(0, 1, nb_actions) Q_star
Q: Plot the Q-values. Identify the optimal action a^*.
Tip: you could plot the array Q_star
with plt.plot
, but that would be ugly. Check the documentation of the plt.bar
method.
print("Optimal action:", Q_star.argmax())
=(10, 6))
plt.figure(figsize
plt.bar(actions, Q_star)'Actions')
plt.xlabel('$Q^*(a)$')
plt.ylabel( plt.show()
Optimal action: 0
Great, now let’s start evaluating these Q-values with random sampling.
Q: Define an action sampling method get_reward
taking as arguments: * The array Q_star
. * The index a
of the action you want to sample (between 0 and 4). * An optional variance argument var
, which should have the value 1.0 by default.
It should return a single value, sampled from the normal distribution with mean Q_star[a]
and variance var
.
def get_reward(Q_star, a, var=1.0):
return float(rng.normal(Q_star[a], var, 1))
Q: For each possible action a
, take nb_samples=10
out of the reward distribution and store them in a numpy array. Compute the mean of the samples for each action separately in a new array Q_t
. Make a bar plot of these estimated Q-values.
= 10
nb_samples = np.zeros((nb_actions, nb_samples))
rewards
for a in actions:
for play in range(nb_samples):
= get_reward(Q_star, a, var=1.0)
rewards[a, play]
= np.mean(rewards, axis=1)
Q_t
=(10, 6))
plt.figure(figsize
plt.bar(actions, Q_t)'Actions')
plt.xlabel('$Q_t(a)$')
plt.ylabel( plt.show()
Q: Make a bar plot of the difference between the true values Q_star
and the estimates Q_t
. Conclude. Re-run the sampling cell with different numbers of samples.
=(10, 6))
plt.figure(figsize- Q_star)
plt.bar(actions, Q_t 'Actions')
plt.xlabel('$Q^*(a) - Q_t(a)$')
plt.ylabel( plt.show()
Q: To better understand the influence of the number of samples on the accuracy of the sample average, create a for
loop over the preceding code, with a number of samples increasing from 1 to 100. For each value, compute the mean square error (mse) between the estimates Q_t
and the true values Q^*
.
The mean square error is simply defined over the N = nb_actions
actions as:
\epsilon = \frac{1}{N} \, \sum_{a=0}^{N-1} (Q_t(a) - Q^*(a))^2
At the end of the loop, plot the evolution of the mean square error with the number of samples. You can append each value of the mse in an empty list and then plot it with plt.plot
, for example.
= []
errors for nb_sample in range(1, 100):
= np.zeros((nb_actions, nb_sample))
rewards
for a in actions:
for play in range(nb_sample):
= get_reward(Q_star, a, var=1.0)
rewards[a, play]
= np.mean(rewards, axis=1)
Q_t = np.mean((Q_star - Q_t)**2)
error
errors.append(error)
=(10, 6))
plt.figure(figsize
plt.plot(errors) plt.show()
The plot should give you an indication of how many samples you at least need to correctly estimate each action (30 or so). But according to the central limit theorem (CLT), the variance of the sample average also varies with the variance of the distribution itself.
The distribution of sample averages is normally distributed with mean \mu and variance \frac{\sigma^2}{N}.
S_N \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{N}})
Q: Vary the variance of the reward distribution (as an argument to get_reward
) and re-run the previous experiment. Do not hesitate to take more samples. Conclude.
= []
errors for nb_sample in range(1, 1000):
= np.zeros((nb_actions, nb_sample))
rewards
for a in actions:
for play in range(nb_sample):
= get_reward(Q_star, a, var=10.0)
rewards[a, play]
= np.mean(rewards, axis=1)
Q_t = np.mean((Q_star - Q_t)**2)
error
errors.append(error)
print(error)
=(10, 6))
plt.figure(figsize
plt.plot(errors) plt.show()
0.18921383192673544
A: the higher the variance of the distribution, the more samples we need to get correct estimates.
Bandit environment
In order to prepare the next exercise, let’s now implement the n-armed bandit in a Python class. As reminded in the tutorial on Python, a class is defined using this structure:
class MyClass:
"""
Documentation of the class.
"""
def __init__(self, param1, param2):
"""
Constructor of the class.
:param param1: first parameter.
:param param2: second parameter.
"""
self.param1 = param1
self.param2 = param2
def method(self, another_param):
"""
Method to do something.
:param another_param: another parameter.
"""
return (another_param + self.param1)/self.param2
You can then create an object of the type MyClass
:
= MyClass(param1= 1.0, param2=2.0) my_object
and call any method of the class on the object:
= my_object.method(3.0) result
Q: Create a Bandit
class taking as arguments:
- nb_actions: number of arms.
- mean: mean of the normal distribution for Q^*.
- std_Q: standard deviation of the normal distribution for Q^*.
- std_r: standard deviation of the normal distribution for the sampled rewards.
The constructor should initialize a Q_star
array accordingly and store it as an attribute. It should also store the optimal action.
Add a method step(action)
that samples a reward for a particular action and returns it.
class Bandit:
"""
n-armed bandit.
"""
def __init__(self, nb_actions, mean=0.0, std_Q=1.0, std_r=1.0):
"""
:param nb_actions: number of arms.
:param mean: mean of the normal distribution for $Q^*$.
:param std_Q: standard deviation of the normal distribution for $Q^*$.
:param std_r: standard deviation of the normal distribution for the sampled rewards.
"""
# Store parameters
self.nb_actions = nb_actions
self.mean = mean
self.std_Q = std_Q
self.std_r = std_r
# Initialize the true Q-values
self.Q_star = rng.normal(self.mean, self.std_Q, self.nb_actions)
# Optimal action
self.a_star = self.Q_star.argmax()
def step(self, action):
"""
Sampled a single reward from the bandit.
:param action: the selected action.
:return: a reward.
"""
return float(rng.normal(self.Q_star[action], self.std_r, 1))
Q: Create a 5-armed bandits and sample each action multiple times. Compare the mean reward to the ground truth as before.
= 5
nb_actions = Bandit(nb_actions)
bandit
= []
all_rewards for t in range(1000):
= []
rewards for a in range(nb_actions):
rewards.append(bandit.step(a))
all_rewards.append(rewards)
= np.mean(all_rewards, axis=0)
mean_reward
=(20, 6))
plt.figure(figsize131)
plt.subplot(range(nb_actions), bandit.Q_star)
plt.bar("Actions")
plt.xlabel("$Q^*(a)$")
plt.ylabel(132)
plt.subplot(range(nb_actions), mean_reward)
plt.bar("Actions")
plt.xlabel("$Q_t(a)$")
plt.ylabel(133)
plt.subplot(range(nb_actions), np.abs(bandit.Q_star - mean_reward))
plt.bar("Actions")
plt.xlabel("Absolute error")
plt.ylabel( plt.show()