Function approximation
Limits of tabular RL
All the methods seen so far belong to tabular RL. Q-learning necessitates to store in a Q-table one Q-value per state-action pair (s, a).
If a state has never been visited during learning, the Q-values will still be at their initial value (0.0), no policy can be derived.
Similar states likely have the same optimal action: we want to be able to generalize the policy between states. For most realistic problems, the size of the Q-table becomes quickly untractable. If you use black-and-white 256x256 images as inputs, you have 2^{256 * 256} = 10^{19728} possible states! Tabular RL is therefore limited to toy problems.
Tabular RL also only works for small discrete action spaces. Robots have continuous action spaces, where the actions are changes in joint angles or torques. A joint angle could for example take any value in [0, \pi]. A solution would be to discretize the action space (one action per degree), but we would fall into the curse of dimensionality.
The more degrees of freedom, the more discrete actions, the more entries in the Q-table… Tabular RL cannot deal with continuous action spaces, unless we approximate the policy with an actor-critic architecture.
Function approximation
Feature vectors
Let’s represent a state s by a vector of d features \phi(s) = [\phi_1(s), \phi_2(s), \ldots, \phi_d(s)]^T. For the cartpole, the feature vector would be:
\phi(s) = \begin{bmatrix}x \\ \dot{x} \\ \theta \\ \dot{\theta} \end{bmatrix}
where x is the position, \theta the angle, \dot{x} and \dot{\theta} their derivatives. We are able to represent any state s of the cartpole problem using these four variables.
For more complex problems, the feature vector should include all the necessary information (Markov property). Example of breakout:
\phi(s) = \begin{bmatrix} x \, \text{position of the paddle} \\ x \, \text{position of the ball} \\ y \, \text{position of the ball} \\ x \, \text{speed of the ball} \\ y \, \text{speed of the position} \\ \text{presence of brick 1} \\ \text{presence of brick 2} \\ \vdots \\ \end{bmatrix}
In deep RL, we will learn these feature vectors, but let’s suppose for now that we have them.
Note that we can always fall back to the tabular case using one-hot encoding of the states:
\phi(s_1) = \begin{bmatrix}1\\0\\0\\ \ldots\\ 0\end{bmatrix} \qquad \phi(s_2) = \begin{bmatrix}0\\1\\0\\ \ldots\\ 0\end{bmatrix}\qquad \phi(s_3) = \begin{bmatrix}0\\0\\1\\ \ldots\\ 0\end{bmatrix} \qquad \ldots
But the idea is that we can represent states with much less values than the number of states:
d \ll |\mathcal{S}|
We can also represent continuous state spaces with feature vectors, as in cartpole.
State value approximation
In state value approximation, we want to approximate the state value function V^\pi(s) with a parameterized function V_\varphi(s):
V_\varphi(s) \approx V^\pi(s)
The parameterized function can have any form. Its has a set of parameters \varphi used to transform the feature vector \phi(s) into an approximated value V_\varphi(s).
The simplest function approximator (FA) is the linear approximator.
The approximated value is a linear combination of the features:
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)
The weight vector \mathbf{w} = [w_1, w_2, \ldots, w_d]^Tis the set of parameters \varphi of the function. A linear approximator is a single artificial neuron (linear regression) without a bias.
Regardless the form of the function approximator, we want to find the parameters \varphi making the approximated values V_\varphi(s) as close as possible from the true values V^\pi(s) for all states s.
This is a regression problem, so we want to minimize the mean-square error between the two quantities:
\min_\varphi \mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2]
The loss function \mathcal{L}(\varphi) is minimal when the predicted values are close to the true ones on average over the state space.
Let’s suppose for now that we know the true state values V^\pi(s) for all states and that the parametrized function is differentiable. We can find the minimum of the loss function by applying gradient descent (GD) iteratively:
\Delta \varphi = - \eta \, \nabla_\varphi \mathcal{L}(\varphi)
\nabla_\varphi \mathcal{L}(\varphi) is the gradient of the loss function w.r.t to the parameters \varphi.
\nabla_\varphi \mathcal{L}(\varphi) = \begin{bmatrix} \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_1} \\ \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_2} \\ \ldots \\ \frac{\partial \mathcal{L}(\varphi)}{\partial \varphi_K} \\ \end{bmatrix}
When applied repeatedly, GD converges to a local minimum of the loss function.
In order to minimize the mean square error, we will iteratively modify the parameters \varphi according to:
\begin{aligned} \Delta \varphi = \varphi_{k+1} - \varphi_n & = - \eta \, \nabla_\varphi \mathcal{L}(\varphi) \\ &\\ & = - \eta \, \nabla_\varphi \mathbb{E}_{s \in \mathcal{S}} [ (V^\pi(s) - V_\varphi(s))^2] \\ &\\ & = \mathbb{E}_{s \in \mathcal{S}} [- \eta \, \nabla_\varphi (V^\pi(s) - V_\varphi(s))^2] \\ &\\ & = \mathbb{E}_{s \in \mathcal{S}} [\eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)] \\ \end{aligned}
As it would be too slow to compute the expectation on the whole state space (batch algorithm), we will sample the quantity:
\delta_\varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
and update the parameters with stochastic gradient descent (SGD).
If we sample K states s_i from the state space, we get:
\Delta \varphi = \eta \, \frac{1}{K} \sum_{k=1}^K (V^\pi(s_k) - V_\varphi(s_k)) \, \nabla_\varphi V_\varphi(s_k)
We can also sample a single state s (online algorithm):
\Delta \varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
Unless stated otherwise, we will sample single states in this section, but beware that the parameter updates will be noisy (high variance).
The approximated value is a linear combination of the features:
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)
The weights are updated using stochastic gradient descent:
\Delta \mathbf{w} = \eta \, (V^\pi(s) - V_\varphi(s)) \, \phi(s)
This is the delta learning rule of linear regression and classification, with \phi(s) being the input vector and V^\pi(s) - V_\varphi(s) the prediction error.
The rule can be used with any function approximator, we only need to be able to differentiate it:
\Delta \varphi = \eta \, (V^\pi(s) - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
The problem is that we do not know V^\pi(s), as it is what we are trying to estimate. We can replace V^\pi(s) by a sampled estimate using Monte-Carlo or TD:
- Monte-Carlo function approximation:
\Delta \varphi = \eta \, (R_t - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
- Temporal Difference function approximation:
\Delta \varphi = \eta \, (r_{t+1} + \gamma \, V_\varphi(s') - V_\varphi(s)) \, \nabla_\varphi V_\varphi(s)
As in tabular RL, Gradient Monte-Carlo has no bias (real returns) but a high variance. Semi-gradient TD has less variance, but a significant bias as V_\varphi(s_{t+1}) is initially wrong. You can never trust these estimates completely.
Note that for Temporal Difference, we actually want to minimize the TD reward-prediction error for all states, i.e. the surprise:
\mathcal{L}(\varphi) = \mathbb{E}_{s \in \mathcal{S}} [ (r_{t+1} + \gamma \, V_\varphi(s') - V_\varphi(s))^2]= \mathbb{E}_{s \in \mathcal{S}} [ \delta_t^2]
Action value approximation
Q-values can be approximated by a parameterized function Q_\theta(s, a) in the same manner. There are basically two options for the structure of the function approximator:
- The FA takes a feature vector for both the state s and the action a (which can be continuous) as inputs, and outputs a single Q-value Q_\theta(s ,a).
- The FA takes a feature vector for the state s as input, and outputs one Q-value Q_\theta(s ,a) per possible action (the action space must be discrete).
In both cases, we minimize the mse between the true value Q^\pi(s, a) and the approximated value Q_\theta(s, a).
Feature construction
Linear features
Before we dive into deep RL (i.e. RL with non-linear FA), let’s see how we can design good feature vectors for linear function approximation. The problem with deep NN is that they need a lot of samples to converge, what worsens the fundamental problem of RL: sample efficiency. By engineering the right features, we could use linear approximators, which converge much faster. The convergence of linear FA is guaranteed, not (always) non-linear ones.
Why do we need to choose features? For the cartpole, the feature vector \phi(s) could be:
\phi(s) = \begin{bmatrix}x \\ \dot{x} \\ \theta \\ \dot{\theta} \end{bmatrix}
where x is the position, \theta the angle, \dot{x} and \dot{\theta} their derivatives. Can we predict the value of a state linearly?
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \mathbf{w}^T \times \phi(s)
This answer is no, as a high angular velocity \dot{\theta} is good when the pole is horizontal (going up) but bad if the pole is vertical (will not stop). The value would depends linearly on something like \dot{\theta} \, \sin \theta, which is a non-linear combination of features.
Let’s suppose we have a simple problem where the state s is represented by two continuous variables x and y. The true value function V^\pi(s) is a non-linear function of x and y.
If we apply linear FA directly on the feature vector [x, y], we catch the tendency of V^\pi(s) but we make a lot of bad predictions: high bias (underfitting).
Polynomial features
To introduce non-linear relationships between continuous variables, a simple method is to construct the feature with polynomials of the variables.
Example with polynomials of order 2:
\phi(s) = \begin{bmatrix}1 & x & y & x\, y & x^2 & y^2 \end{bmatrix}^T
We transform the two input variables x and y into a vector with 6 elements. The 1 (order 0) is there to learn the offset / bias.
Example with polynomials of order 3:
\phi(s) = \begin{bmatrix}1 & x & y & x\, y & x^2 & y^2 & x^2 \, y & x \, y^2 & x^3 & y^3\end{bmatrix}^T
We then just need to apply linear FA on these feature vectors (polynomial regression).
V_\varphi(s) = w_0 + w_1 \, x + w_2 \, y + w_3 \, x \, y + w_4 \, x^2 + w_5 \, y^2 + \ldots
The higher the degree of the polynomial, the better the fit, but the number of features grows exponentially. This adds to the computational complexity and leads to overfitting: if we only sample some states, high-order polynomials will not interpolate correctly.
Fourier transforms
Instead of approximating a state variable x by a polynomial:
V_\varphi(s) = w_0 + w_1 \, x + w_2 \, x^2 + w_3 \, x^3 + \ldots
we could also use its Fourier decomposition (here DCT, discrete cosine transform):
V_\varphi(s) = w_0 + w_1 \, \cos(\pi \, x) + w_2 \, \cos( 2 \, \pi \, x) + w_3 \, \cos(3 \, \pi \, x) + \ldots
The Fourier theorem tells us that, if we take enough frequencies, we can reconstruct the signal V_\varphi(s) perfectly.
It is just a change of basis, the problem stays a linear regression to find w_0, w_1, w_2, etc.
Fourier transforms can be applied on multivariate functions as well.
A Fourier basis tends to work better than a polynomial basis. The main problem is that the number of features increases very fast with the number of input dimensions and the desired precision (higher-order polynomials, more frequencies).
Discrete coding
An obvious solution for continuous state variables is to discretize the input space. The input space is divided into a grid of non-overlapping tiles.
The feature vector is a binary vector with a 1 when the input is inside a tile, 0 otherwise.
\phi(s) = \begin{bmatrix}0 & 0 & \ldots & 0 & 1 & 0 & \ldots & 0 \\ \end{bmatrix}^T
This ensures generalization inside a tile: you only need a couple of samples inside a tile to know the mean value of all the states. Drawbacks: the value function is step-like (discontinuous), the correct size of a tile is not known, we fall into the curse of dimensionality.
Coarse coding
A more efficient solution is coarse coding. The tiles (rectangles, circles, or what you need) need to overlap.
A state s is encoded by a binary vector, but with several 1, for each tile it belongs.
\phi(s) = \begin{bmatrix}0 & 1 & 0 & \ldots & 1 & 1 & 0 & \ldots & 0 \\ \end{bmatrix}^T
This allows generalization inside a tile, but also across tiles. The size and shape of the “receptive field” influences the generalization properties.
Tile coding
A simple way to ensure that tiles overlap is to use several regular grids with an offset. Each tiling will be coarse, but the location of a state will be quite precise as it may belong to many tiles.
This helps against the curse of dimensionality: high precision, but the number of tiles does not grow exponentially.
Radial-basis functions (RBF)
The feature vector in tile coding is a binary vector: there will be discontinuous jumps in the approximated value function when moving between tiles. We can use radial-basis functions (RBF) such as Gaussians to map the state space.
We set a set of centers \{c_i\}_{i=1}^K in the input space on a regular grid (or randomly). Each element of the feature vector will be a Gaussian function of the distance between the state s and one center:
\phi_i(s) = \exp \frac{-(s - c_i)^2}{2\, \sigma_i^2}
The approximated value function now represents continuously the states:
V_\varphi(s) = \sum_{i=1}^d w_i \, \phi_i(s) = \sum_{i=1}^d w_i \, \exp \frac{-(s - c_i)^2}{2\, \sigma_i^2}
If you have enough centers and they overlap sufficiently, you can even decode the original state perfectly:
\hat{s} = \sum_{i=1}^d \phi_i(s) \, c_i