Linear regression
Professur für Künstliche Intelligenz - Fakultät für Informatik
We have a training set of N examples \mathcal{D} = (x_i, t_i)_{i=1..N}.
In linear regression, we want to learn a linear model (hypothesis) y that is linearly dependent on the input x:
y = f_{w, b}(x) = w \, x + b
The free parameters of the model are
the slope w,
the intercept b.
The data \mathcal{D} = (x_i, t_i)_{i=1..N} is given (fixed).
y = f_{w, b}(x) = w \, x + b
This corresponds to a single artificial neuron y with:
one input x,
one weight w,
one bias b,
a linear activation function.
We will see that this generalizes to multiple inputs and outputs.
The goal of the linear regression (or least mean squares - LMS) is to minimize the mean square error (mse) between the targets and the predictions.
It is defined as the mathematical expectation of the quadratic error over the training data:
\mathcal{L}(w, b) = \mathbb{E}_{x_i, t_i \in \mathcal{D}} [ (t_i - y_i )^2 ]
\mathcal{L}(w, b) = \frac{1}{N} \, \sum_{i=1}^{N} (t_i - y_i )^2
The minimum of the mse is achieved when the prediction y_i = f_{w, b}(x_i) is equal to the ground truth t_i for all training examples.
In other words, we want to minimize the residual error of the model on the data.
It is not always possible to obtain the global minimum (0) but the closer, the better.
\mathcal{L}(w, b) = \frac{1}{N} \, \sum_{i=1}^{N} (t_i - y_i )^2
\Delta w = - \eta \, \frac{\partial \mathcal{L}(w, b)}{\partial w}
\Delta b = - \eta \, \frac{\partial \mathcal{L}(w, b)}{\partial b}
\frac{\partial \mathcal{L}(w, b)}{\partial w} = \frac{\partial}{\partial w} [\frac{1}{N} \, \sum_{i=1}^{N} (t_i - y_i )^2]
\frac{\partial \mathcal{L}(w, b)}{\partial w} = \frac{1}{N} \, \sum_{i=1}^{N} \frac{\partial}{\partial w} (t_i - y_i )^2
\frac{\partial \mathcal{L}(w, b)}{\partial w} = \frac{1}{N} \, \sum_{i=1}^{N} \frac{\partial}{\partial w} \mathcal{l}_i(w, b) \qquad \text{with} \qquad \mathcal{l}_i(w, b) = (t_i - y_i )^2
The individual loss \mathcal{l}_i(w, b) = (t_i - y_i )^2 is the composition of two functions:
a square error function g_i(y_i) = (t_i - y_i)^2.
the prediction y_i = f_{w, b}(x_i) = w \, x_i + b.
The chain rule tells us how to derive such composite functions:
\frac{ d f(g(x))}{dx} = \frac{ d f(g(x))}{d g(x)} \times \frac{ d g(x)}{dx} = \frac{ d f(y)}{dy} \times \frac{ d g(x)}{dx}
The first derivative considers g(x) to be a single variable.
Applied to our problem, this gives:
\frac{\partial}{\partial w} \mathcal{l}_i(w, b) = \frac{\partial g_i(y_i)}{\partial y_i} \times \frac{\partial y_i}{\partial w}
\frac{\partial g_i(y_i)}{\partial y_i} = - 2 \, (t_i - y_i)
\frac{\partial y_i}{\partial w} = x_i
\frac{\partial y_i}{\partial b} = 1
\frac{\partial \mathcal{l}_i(w, b)}{\partial w} = - 2 \, (t_i - y_i) \, x_i
\frac{\partial \mathcal{l}_i(w, b)}{\partial b} = - 2 \, (t_i - y_i)
\frac{\partial \mathcal{L}(w, b)}{\partial w} = - \frac{2}{N} \sum_{i=1}^{N} (t_i - y_i) \, x_i
\frac{\partial \mathcal{L}(w, b)}{\partial b} = - \frac{2}{N} \sum_{i=1}^{N} (t_i - y_i)
\Delta w = \eta \, \frac{1}{N} \sum_{i=1}^{N} (t_i - y_i) \, x_i
\Delta b = \eta \, \frac{1}{N} \sum_{i=1}^{N} (t_i - y_i)
\begin{cases} \Delta w = \eta \, \dfrac{1}{N} \displaystyle\sum_{i=1}^{N} (t_i - y_i) \, x_i \\ \\ \Delta b = \eta \, \dfrac{1}{N} \displaystyle\sum_{i=1}^{N} (t_i - y_i) \\ \end{cases}
The parameter changes have to be applied multiple times (epochs) in order for the parameters to converge.
One can stop when the parameters do not change much, or after a fixed number of epochs.
Least Mean Squares algorithm
w=0 \quad;\quad b=0
for M epochs:
dw=0 \quad;\quad db=0
for each sample (x_i, t_i):
y_i = w \, x_i + b
dw = dw + (t_i - y_i) \, x_i
db = db + (t_i - y_i)
\Delta w = \eta \, \frac{1}{N} dw
\Delta b = \eta \, \frac{1}{N} db
\begin{cases} \Delta w = \eta \, \dfrac{1}{N} \displaystyle\sum_{i=1}^{N} (t_i - y_i) \, x_i \\ \\ \Delta b = \eta \, \dfrac{1}{N} \displaystyle\sum_{i=1}^{N} (t_i - y_i) \\ \end{cases}
Online version of LMS : delta learning rule
w=0 \quad;\quad b=0
for M epochs:
for each sample (x_i, t_i):
y_i = w \, x_i + b
\Delta w = \eta \, (t_i - y_i ) \, x_i
\Delta b = \eta \, (t_i - y_i)
\begin{cases} \Delta w = \eta \, (t_i - y_i) \, x_i \\ \\ \Delta b = \eta \, (t_i - y_i)\\ \end{cases}
Multiple Linear Regression (MLR) predicts several output variables based on several explanatory variables or features: \begin{cases} y_1 = w_1 \, x_1 + w_2 \, x_2 + b_1\\ \\ y_2 = w_3 \, x_1 + w_4 \, x_2 + b_2\\ \end{cases}
All we have is some samples: we want to know the best model for the data.
Let’s suppose you have 13971 measurements in some Excel file, linking engine size, number of cylinders, fuel consumption and CO2 emissions of various cars.
You want to predict fuel consumption and CO2 emissions when you know the engine size and the number of cylinders.
Engine size | Cylinders | Fuel consumption | CO2 emissions |
---|---|---|---|
2 | 4 | 8.5 | 196 |
2.4 | 4 | 9.6 | 221 |
1.5 | 4 | 5.9 | 136 |
3.5 | 6 | 11 | 255 |
3.5 | 6 | 11 | 244 |
3.5 | 6 | 10 | 230 |
3.5 | 6 | 10 | 232 |
3.7 | 6 | 11 | 255 |
3.7 | 6 | 12 | 267 |
… | … | … | … |
\begin{cases} y_1 = w_1 \, x_1 + w_2 \, x_2 + b_1\\ \\ y_2 = w_3 \, x_1 + w_4 \, x_2 + b_2\\ \end{cases}
and use the least mean squares method to obtain the value of the parameters.
scikit-learn
(https://scikit-learn.org), this is done in two lines of code:\begin{cases} y_1 = w_1 \, x_1 + w_2 \, x_2 + b_1\\ \\ y_2 = w_3 \, x_1 + w_4 \, x_2 + b_2\\ \end{cases}
can be put in a matrix-vector form:
\begin{bmatrix} y_1 \\ y_2 \\\end{bmatrix} = \begin{bmatrix} w_1 & w_2 \\ w_3 & w_4 \\\end{bmatrix} \times \begin{bmatrix} x_1 \\ x_2 \\\end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\\end{bmatrix}
\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\\end{bmatrix} \qquad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\\end{bmatrix} \qquad \mathbf{t} = \begin{bmatrix} t_1 \\ t_2 \\\end{bmatrix} \qquad \mathbf{b} = \begin{bmatrix} b_1 \\ b_2 \\\end{bmatrix} \qquad W = \begin{bmatrix} w_1 & w_2 \\ w_3 & w_4 \\\end{bmatrix}
\mathbf{x} is the input vector, \mathbf{y} is the output vector, \mathbf{t} is the target vector.
W is called the weight matrix and \mathbf{b} the bias vector.
\mathbf{y} = f_{W, \mathbf{b}}(\mathbf{x}) = W \times \mathbf{x} + \mathbf{b}
\mathbf{y} = f_{W, \mathbf{b}}(\mathbf{x}) = W \times \mathbf{x} + \mathbf{b}
The problem is exactly the same as before, except that we use vectors and matrices instead of scalars: \mathbf{x} and \mathbf{y} can have any number of dimensions, the same procedure will apply.
This corresponds to a linear neural network (or linear perceptron), with one output neuron per predicted value y_i using the linear activation function.
\min_{W, \mathbf{b}} \, \mathcal{L}(W, \mathbf{b}) = \mathbb{E}_\mathcal{D} [ ||\mathbf{t} - \mathbf{y}||^2 ] = \mathbb{E}_\mathcal{D} [ ((t_1 - y_1)^2 + (t_2 - y_2)^2) ]
\begin{cases} \Delta W = - \eta \, \nabla_W \, \mathcal{L}(W, \mathbf{b}) \\ \\ \Delta \mathbf{b} = - \eta \, \nabla_\mathbf{b} \, \mathcal{L}(W, \mathbf{b}) \\ \end{cases}
https://web.stanford.edu/class/cs224n/readings/gradient-notes.pdf
\min_{W, \mathbf{b}} \, \mathcal{L}(W, \mathbf{b}) = \mathbb{E}_\mathcal{D} [ ||\mathbf{t} - \mathbf{y}||^2 ] \approx \frac{1}{N} \, \sum_{i=1}^N ||\mathbf{t}_i - \mathbf{y}_i||^2 = \frac{1}{N} \, \sum_{i=1}^N \mathcal{l}_i(W, \mathbf{b})
\mathcal{l}_i(W, \mathbf{b}) = ||\mathbf{t}_i - \mathbf{y}_i||^2 = \langle \mathbf{t}_i - \mathbf{y}_i \cdot \mathbf{t}_i - \mathbf{y}_i \rangle = (\mathbf{t}_i - \mathbf{y}_i)^T \times (\mathbf{t}_i - \mathbf{y}_i)
\mathbf{x}^T \times \mathbf{x} = \begin{bmatrix} x_1 & x_2 & \ldots & x_n \end{bmatrix} \times \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = x_1 \, x_1 + x_2 \, x_2 + \ldots + x_n \, x_n = \langle \mathbf{x} \cdot \mathbf{x} \rangle = ||\mathbf{x}||^2_2
\nabla_{W} \, \mathcal{l}_i(W, \mathbf{b}) = \nabla_{\mathbf{y}_i} \, \mathcal{l}_i(W, \mathbf{b}) \times \nabla_{W} \, \mathbf{y}_i
\nabla_{\mathbf{y}_i} \, \mathcal{l}_i(W, \mathbf{b}) = \nabla_{\mathbf{y}_i} \, (\mathbf{t}_i - \mathbf{y}_i)^T \times (\mathbf{t}_i - \mathbf{y}_i)
\begin{aligned} \nabla_{\mathbf{y}_i} \, (\mathbf{t}_i - \mathbf{y}_i)^T \times (\mathbf{t}_i - \mathbf{y}_i) & = ( \nabla_{\mathbf{y}_i} \, (\mathbf{t}_i - \mathbf{y}_i) ) \times (\mathbf{t}_i - \mathbf{y}_i) + (\mathbf{t}_i - \mathbf{y}_i) \times \nabla_{\mathbf{y}_i} \, (\mathbf{t}_i - \mathbf{y}_i) \\ &\\ &= - (\mathbf{t}_i - \mathbf{y}_i) - (\mathbf{t}_i - \mathbf{y}_i) \\ &\\ &= - 2 \, (\mathbf{t}_i - \mathbf{y}_i) \\ \end{aligned}
Note: We use the properties \nabla_{\mathbf{x}}\, \mathbf{x}^T \times \mathbf{z} = \mathbf{z} and \nabla_{\mathbf{z}} \, \mathbf{x}^T \times \mathbf{z} = \mathbf{x} to get rid of the transpose.
The “problem” is when computing \nabla_{W} \, \mathbf{y}_i = \nabla_{W} \, (W \times \mathbf{x}_i + \mathbf{b}):
\mathbf{y}_i is a vector and W a matrix.
\nabla_{W} \, \mathbf{y}_i is then a Jacobian (matrix), not a gradient (vector).
Intuitively, differentiating W \times \mathbf{x}_i + \mathbf{b} w.r.t W should return \mathbf{x}_i, but it is a vector, not a matrix…
The gradient (or Jacobian) of \mathcal{l}_i(W, \mathbf{b}) w.r.t W should be a matrix of the same size as W so that we can apply gradient descent:
\Delta W = - \eta \, \nabla_W \, \mathcal{L}(W, \mathbf{b})
\nabla_{W} \, \mathcal{l}_i(W, \mathbf{b}) = - 2\, (\mathbf{t}_i - \mathbf{y}_i) \times \nabla_{W} \, \mathbf{y}_i
If \mathbf{x}_i has n elements and \mathbf{y}_i m elements, W is a m \times n matrix.
Remember the outer product between two vectors:
\mathbf{u} \times \mathbf{v}^\textsf{T} = \begin{bmatrix}u_1 \\ u_2 \\ u_3 \\ u_4\end{bmatrix} \begin{bmatrix}v_1 & v_2 & v_3\end{bmatrix} = \begin{bmatrix} u_1v_1 & u_1v_2 & u_1v_3 \\ u_2v_1 & u_2v_2 & u_2v_3 \\ u_3v_1 & u_3v_2 & u_3v_3 \\ u_4v_1 & u_4v_2 & u_4v_3 \end{bmatrix}.
\nabla_W \, \mathcal{l}_i(W, \mathbf{b}) = - 2 \, (\mathbf{t}_i - \mathbf{y}_i) \times \mathbf{x}_i^T\\
\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\\end{bmatrix} = W \times \mathbf{x} + \mathbf{b} = \begin{bmatrix} w_1 & w_2 \\ w_3 & w_4 \\\end{bmatrix} \times \begin{bmatrix} x_1 \\ x_2 \\\end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\\end{bmatrix}
\mathcal{l}(W, \mathbf{b}) = (\mathbf{t} - \mathbf{y})^T \times (\mathbf{t} - \mathbf{y}) = \begin{bmatrix} t_1 - y_1 & t_2 - y_2 \\\end{bmatrix} \times \begin{bmatrix} t_1 - y_1 \\ t_2 - y_2 \\\end{bmatrix} = (t_1 - y_1)^2 + (t_2 - y_2)^2
\nabla_W \, \mathcal{l}(W, \mathbf{b}) = \begin{bmatrix} \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial w_1} & \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial w_2} \\ \\ \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial w_3} & \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial w_4} \\ \end{bmatrix} = \begin{bmatrix} -2 \, (t_1 - y_1) \, x_1 & -2 \, (t_1 - y_1) \, x_2 \\ \\ -2 \, (t_2 - y_2) \, x_1 & -2 \, (t_2 - y_2) \, x_2 \\ \end{bmatrix}
\nabla_W \, \mathcal{l}(W, \mathbf{b}) = -2 \, \begin{bmatrix} t_1 - y_1 \\ t_2 - y_2 \\ \end{bmatrix} \times \begin{bmatrix} x_1 & x_2 \\ \end{bmatrix} = - 2 \, (\mathbf{t} - \mathbf{y}) \times \mathbf{x}^T
\begin{cases} \Delta W = \eta \, \dfrac{1}{N} \displaystyle\sum_{i=1}^N \, (\mathbf{t}_i - \mathbf{y}_i ) \times \mathbf{x}_i^T \\ \\ \Delta \mathbf{b} = \eta \, \dfrac{1}{N} \displaystyle\sum_{i=1}^N \, (\mathbf{t}_i - \mathbf{y}_i) \\ \end{cases}
\begin{cases} \Delta W = \eta \, (\mathbf{t}_i - \mathbf{y}_i ) \times \mathbf{x}_i^T \\ \\ \Delta \mathbf{b} = \eta \, (\mathbf{t}_i - \mathbf{y}_i) \\ \end{cases}
\begin{cases} \Delta w_1 = \eta \, (t_1 - y_1) \, x_1 \\ \Delta w_2 = \eta \, (t_1 - y_1) \, x_2 \\ \Delta w_3 = \eta \, (t_2 - y_2) \, x_1 \\ \Delta w_4 = \eta \, (t_2 - y_2) \, x_2 \\ \end{cases} \qquad \begin{cases} \Delta b_1 = \eta \, (t_1 - y_1) \\ \Delta b_2 = \eta \, (t_2 - y_2) \\ \end{cases}
Let’s suppose we want to perform a regression, but where the outputs t_i are bounded between 0 and 1.
We could use a logistic (or sigmoid) function instead of a linear function in order to transform the input into an output:
y = \sigma(w \, x + b ) = \frac{1}{1+\exp(-w \, x - b )}
\sigma(x)=\frac{1}{1+\exp(-x)}
has the nice property that
\sigma'(x) = \sigma(x) \, (1 - \sigma(x) )
l_i(w, b) = (t_i - \sigma(w \, x_i + b) )^2
\begin{aligned} \frac{\partial l_i(w, b)}{\partial w} &= 2 \, (t_i - y_i) \, \frac{\partial}{\partial w} (t_i - \sigma(w \, x_i + b ))\\ &\\ &= - 2 \, (t_i - y_i) \, \sigma'(w \, x_i + b ) \, x_i \\ \end{aligned}
\Delta w = \eta \, (t_i - y_i) \, \sigma'(w \, x_i + b ) \, x_i
\Delta w = \eta \, (t_i - y_i) \, y_i \, ( 1 - y_i ) \, x_i
so we do not even need to compute the derivative!
y = \sigma(w \, x + b ) = \frac{1}{1+\exp(-w \, x - b )}
\begin{cases} \Delta w = \eta \, (t_i - y_i) \, y_i \, ( 1 - y_i ) \, x_i \\ \\ \Delta b = \eta \, (t_i - y_i) \, y_i \, ( 1 - y_i ) \\ \end{cases}
\mathbf{y} = f(W \times \mathbf{x} + \mathbf{b} )
\mathcal{L}(W, \mathbf{b}) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \in \mathcal{D}}[||\mathbf{t} - \mathbf{y}||^2]
\begin{cases} \Delta W = \eta \, [(\mathbf{t} - \mathbf{y}) \odot f'(W \times \mathbf{x} + \mathbf{b}) ] \times \mathbf{x}^T \\ \\ \Delta \mathbf{b} = \eta \, (\mathbf{t} - \mathbf{y}) \odot f'(W \times \mathbf{x} + \mathbf{b}) \\ \end{cases}
\odot denotes element-wise multiplication, i.e. (\mathbf{t} - \mathbf{y}) \odot f'(W \times \mathbf{x} + \mathbf{b}) is also a vector.
In the linear case, f'(x) = 1.
One can use any non-linear function, e.g hyperbolic tangent tanh(), ReLU, etc.
Transfer functions are chosen for neural networks so that we can compute their derivative easily.
The functions underlying real data are rarely linear plus some noise around the ideal value.
In the figure, the input/output function would be better modeled by a second-order polynomial (or higher):
y = f_{\mathbf{w}, b}(x) = w_1 \, x + w_2 \, x^2 +b
y = f_{\mathbf{w}, b}(x) = w_1 \, x + w_2 \, x^2 +b
\mathbf{x} = \begin{bmatrix} x \\ x^2 \\ \end{bmatrix} \qquad \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \end{bmatrix}
y = \langle \mathbf{w} . \mathbf{x} \rangle + b = \sum_j w_j \, x_j + b
\begin{cases} \Delta \mathbf{w} = \eta \, (t - y) \, \mathbf{x} \\ \\ \Delta b = \eta \, (t - y) \\ \end{cases}
y = f_{\mathbf{w}, b}(x) = w_1 \, x + w_2 \, x^2 + \ldots + w_p \, x^p + b
\mathbf{x} = \begin{bmatrix} x \\ x^2 \\ \ldots \\ x^p \end{bmatrix} \qquad \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \ldots \\ w_p \end{bmatrix}
\begin{cases} \Delta \mathbf{w} = \eta \, (t - y) \, \mathbf{x} \\ \\ \Delta b = \eta \, (t - y) \\ \end{cases}
Non-linear problem solved! The only unknown is which order for the polynomial matches best the data.
One can perform regression with any kind of parameterized function using gradient descent.
y_i = w \, x_i + b
\epsilon_\mathcal{D} = \frac{1}{N} \, \sum_{i=1}^{N} (t_i - y_i )^2
\epsilon_\mathcal{D} = \frac{1}{N} \, \sum_{i=1}^N (t_i - y_i)^2
R^2 = 1 - \frac{\text{Var}(\text{residuals})}{\text{Var}(\text{data})} = 1 - \frac{\sum_{i=1}^N (t_i- y_i)^2}{\sum_{i=1}^N (t_i - \hat{t})^2}
When only looking at the residual mse on the training data, one could think that the higher the order of the polynomial, the better.
But it is obvious that the interpolation quickly becomes very bad when the order is too high.
A complex model (with a lot of parameters) is useless for predicting new values.
We actually do not care about the error on the training set.
We care about generalization.
Let’s suppose we dispose of m models \mathcal{M} = \{ M_1, ..., M_m\} that could be used to fit (or classify) some data \mathcal{D} = \{x_i, t_i\}_{i=1}^N.
Such a class could be the ensemble of polynomes with different orders, different algorithms (NN, SVM) or the same algorithm with different values for the hyperparameters (learning rate, regularization parameters…).
The naive and wrong method to find the best hypothesis would be:
Wrong method!
For all models M_i:
Train M_i on \mathcal{D} to obtain an hypothesis h_i.
Compute the training error \epsilon_\mathcal{D}(h_i) of h_i on \mathcal{D} :
\epsilon_\mathcal{D}(h_i) = \mathbb{E}_{(\mathbf{x}, t) \in \mathcal{D}} [(h_i(\mathbf{x}) - t)^2]
Select the hypothesis h_{i}^* with the minimal training error : h_{i}^* = \text{argmin}_{h_i \in \mathcal{M}} \quad \epsilon_\mathcal{D}(h_i)
The solution is randomly take some samples out of the training set to form the test set.
Typical values are 20 or 30 % of the samples in the test set.
Method:
Train the model on the training set (70% of the data).
Test the performance of the model on the test set (30% of the data).
The test performance will better measure how well the model generalizes to new examples.
Algorithm
Split the training data \mathcal{D} into \mathcal{S}_{\text{train}} and \mathcal{S}_{\text{test}}.
For all models M_i:
Train M_i on \mathcal{S}_{\text{train}} to obtain an hypothesis h_i.
Compute the empirical error \epsilon_{\text{test}}(h_i) of h_i on \mathcal{S}_{\text{test}} :
\epsilon_{\text{test}}(h_i) = \mathbb{E}_{(\mathbf{x}, t) \in \mathcal{S}_{\text{test}}} [(h_i(\mathbf{x}) - t)^2]
Select the hypothesis h_{i}^* with the minimal empirical error : h_{i}^* = \text{argmin}_{h_i \in \mathcal{M}} \quad \epsilon_{\text{test}}(h_i)
Disadvantages:
20 or 30% of the data is wasted and not used for learning. It may be a problem when data is rare or expensive.
The test set must be representative of the difficulty of the training set (same distribution).
Idea:
build several different training/test sets with the same data.
train and test each model repeatedly on each partition.
choose the hypothesis that works best on average.
Algorithm
Randomly split the data \mathcal{D} into k subsets of \frac{N}{k} examples \{ \mathcal{S}_{1}, \dots , \mathcal{S}_{k}\}
For all models M_i:
For all k subsets \mathcal{S}_j:
Train M_i on \mathcal{D} - \mathcal{S}_j to obtain an hypothesis h_{ij}
Compute the empirical error \epsilon_{\mathcal{S}_j}(h_{ij}) of h_{ij} on \mathcal{S}_j
The empirical error of the model M_i on \mathcal{D} is the average of empirical errors made on (\mathcal{S}_j)_{j=1}^{k}
\epsilon_{\mathcal{D}} (M_i) = \frac{1}{k} \cdot \sum_{j=1}^{k} \epsilon_{\mathcal{S}_j}(h_{ij})
Select the model M_{i}^* with the minimal empirical error on \mathcal{D}.
In general k=10. Extreme cases take k=N: leave-one-out cross-validation.
k-fold cross-validation works well, but needs a lot of repeated learning.
The bare minimum in ML is to have separate training and test sets. However, the test set should only be used once:
A third validation set is typically used to track overfitting during training and perform model selection.
The test set is ultimately used to report the final performance.
While the training mse always decrease with more complex models, the validation/test mse increases after a while.
This is called overfitting: learning by heart the data without caring about generalization.
The two curves suggest that we should chose a polynomial order between 2 and 9.
A model not complex enough for the data will underfit: its training error is high.
A model too complex for the data will overfit: its test error is high.
In between, there is the right complexity for the model: it learns the data correctly but does not overfit.
y = f_{\mathbf{w}, b}(x) = \sum_{k=1}^p w_k \, x^k + b
\mathbf{x} = \begin{bmatrix} x \\ x^2 \\ \ldots \\ x^p \end{bmatrix} \qquad \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \ldots \\ w_p \end{bmatrix}
A polynomial of order p has p+1 unknown parameters (free parameters): the p weights and the bias.
Generally, the complexity of a model relates to its number of free parameters:
Underfitting
Overfitting
When linear regression underfits (both training and test errors are high), the data is not linear: we need to use a neural network.
When linear regression overfits (the test error is higher than the training error), we would like to decrease its complexity.
y = \sum_{i=1}^d w_i \, x_i + b
We must find a way to reduce the complexity of the linear regression without changing the number of parameters, which is impossible.
The solution is to constrain the values that the parameters can take: regularization.
Regularization reduces the variance at the cost of increasing the bias.
Using L2 regularization for linear regression leads to the Ridge regression algorithm.
The individual loss function is defined as:
\mathcal{l}_i(\mathbf{w}, b) = (t_i - y_i)^2 + \lambda \, ||\mathbf{w}||^2
The first part of the loss function is the classical mse on the training set: its role is to reduce the bias.
The second part minimizes the L2 norm of the weight vector (or matrix), reducing the variance:
||\mathbf{w}||^2 = \sum_{i=1}^d w_i^2
\Delta w_i = \eta \, ((t_i - y_i) \ x_i - \lambda \, w_i)
Using L1 regularization for linear regression leads to the LASSO regression algorithm (least absolute shrinkage and selection operator).
The individual loss function is defined as:
\mathcal{l}_i(\mathbf{w}, b) = (t_i - y_i)^2 + \lambda \, |\mathbf{w}|
|\mathbf{w}| = \sum_{i=1}^d |w_i|
\Delta w_i = \eta \, ((t_i - y_i) \ x_i - \lambda \, \text{sign}(w_i))
Both methods depend on the regularization parameter \lambda. Its value determines how important the regularization term should.
Regularization introduce a bias, as the solution found is not the minimum of the mse, but reduces the variance of the estimation, as small weights are less sensible to noise.
Linear regression
LASSO
\mathcal{l}_i(\mathbf{w}, b) = (t_i - y_i)^2 + \lambda_1 \, |\mathbf{w}| + \lambda_2 \, ||\mathbf{w}||^2