Learning theory
Professur für Künstliche Intelligenz - Fakultät für Informatik
We have seen sofar linear learning algorithms for regression and classification.
Most interesting problems are non-linear: classes are not linearly separable, the output is not a linear function of the input, etc…
Do we need totally new methods, or can we re-use our linear algorithms?
How many data examples can be correctly classified by a linear model in \Re^d?
In \Re^2, all dichotomies of three non-aligned examples can be correctly classified by a linear model (y = w_o + w_1 \cdot x_1 + w_2 \cdot x_2).
How many data examples can be correctly classified by a linear model in \Re^d?
In \Re^2, all dichotomies of three non-aligned examples can be correctly classified by a linear model (y = w_o + w_1 \cdot x_1 + w_2 \cdot x_2).
How many data examples can be correctly classified by a linear model in \Re^d?
In \Re^2, all dichotomies of three non-aligned examples can be correctly classified by a linear model (y = w_o + w_1 \cdot x_1 + w_2 \cdot x_2).
How many data examples can be correctly classified by a linear model in \Re^d?
In \Re^2, all dichotomies of three non-aligned examples can be correctly classified by a linear model (y = w_o + w_1 \cdot x_1 + w_2 \cdot x_2).
How many data examples can be correctly classified by a linear model in \Re^d?
However, there exists sets of four examples in \Re^2 which can NOT be correctly classified by a linear model, i.e. they are not linearly separable.
How many data examples can be correctly classified by a linear model in \Re^d?
However, there exists sets of four examples in \Re^2 which can NOT be correctly classified by a linear model, i.e. they are not linearly separable.
x_1 | x_2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
The probability that a set of 3 (non-aligned) points in \Re^2 is linearly separable is 1, but the probability that a set of four points is linearly separable is smaller than 1 (but not zero).
When a class of hypotheses \mathcal{H} can correctly classify all points of a training set \mathcal{D}, we say that \mathcal{H} shatters \mathcal{D}.
The Vapnik-Chervonenkis dimension \text{VC}_\text{dim} (\mathcal{H}) of an hypothesis class \mathcal{H} is defined as the maximal number of training examples that \mathcal{H} can shatter.
We saw that in \Re^2, this dimension is 3:
\text{VC}_\text{dim} (\text{Linear}(\Re^2) ) = 3
\text{VC}_\text{dim} (\text{Linear}(\Re^d) ) = d+1
This corresponds to the number of free parameters of the linear classifier:
Given any set of (d+1) examples in \Re^d, there exists a linear classifier able to classify them perfectly.
For other types of (non-linear) hypotheses, the VC dimension is generally proportional to the number of free parameters.
But regularization reduces the VC dimension of the classifier.
\epsilon(h) \leq \hat{\epsilon}_{\mathcal{S}}(h) + \sqrt{\frac{\text{VC}_\text{dim} (\mathcal{H}) \cdot (1 + \log(\frac{2\cdot N}{\text{VC}_\text{dim} (\mathcal{H})})) - \log(\frac{\delta}{4})}{N}}
with probability 1-\delta, if \text{VC}_\text{dim} (\mathcal{H}) << N.
Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.
\epsilon(h) \leq \hat{\epsilon}_{\mathcal{S}(h)} + \sqrt{\frac{\text{VC}_\text{dim} (\mathcal{H}) \cdot (1 + \log(\frac{2\cdot N}{\text{VC}_\text{dim} (\mathcal{H})})) - \log(\frac{\delta}{4})}{N}}
\epsilon(h) \leq \hat{\epsilon}_{\mathcal{S}(h)} + \sqrt{\frac{\text{VC}_\text{dim} (\mathcal{H}) \cdot (1 + \log(\frac{2\cdot N}{\text{VC}_\text{dim} (\mathcal{H})})) - \log(\frac{\delta}{4})}{N}}
The generalization error increases with the VC dimension, while the training error decreases.
Structural risk minimization is an alternative method to cross-validation.
The VC dimensions of various classes of hypothesis are already known (~ number of free parameters).
This bounds tells how many training samples are needed by a given hypothesis class in order to obtain a satisfying generalization error.
\epsilon(h) \approx \frac{\text{VC}_\text{dim} (\mathcal{H})}{N}
A learning algorithm should only try to minimize the training error, as the VC complexity term only depends on the model.
This term is only an upper bound: most of the time, the real bound is usually 100 times smaller.
\text{VC}_\text{dim} (\text{Linear}(\Re^d) ) = d+1
Given any set of (d+1) examples in \Re^d, there exists a linear classifier able to classify them perfectly.
For N >> d the probability of having training errors becomes huge (the data is generally not linearly separable).
However, if the space has too many dimensions, the VC dimension will increase and the generalization error will increase.
Basic principle of all non-linear methods: multi-layer perceptron, radial-basis-function networks, support-vector machines…
A complex pattern-classification problem, cast in a high dimensional space non-linearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.
The highly dimensional space where the input data is projected is called the feature space.
When the number of dimensions of the feature space increases:
the training error decreases (the pattern is more likely linearly separable);
the generalization error increases (the VC dimension increases).
y = f_{\mathbf{w}, b}(x) = w_1 \, x + w_2 \, x^2 + \ldots + w_p \, x^p + b
the vector \mathbf{x} = \begin{bmatrix} x \\ x^2 \\ \ldots \\ x^p \end{bmatrix} defines a feature space for the input x.
The elements of the feature space are called polynomial features.
We can define polynomial features of more than one variable, e.g. x^2 \, y, x^3 \, y^4, etc.
We then apply multiple linear regression (MLR) on the polynomial feature space to find the parameters:
\Delta \mathbf{w} = \eta \, (t - y) \, \mathbf{x}
\phi(\mathbf{x}) = \begin{bmatrix} \varphi(\mathbf{x} - \mathbf{x}_1) \\ \varphi(\mathbf{x} - \mathbf{x}_2) \\ \ldots \\ \varphi(\mathbf{x} - \mathbf{x}_K) \end{bmatrix}
with \varphi(\mathbf{x} - \mathbf{x}_i) = \exp - \beta \, ||\mathbf{x} - \mathbf{x}_i||^2 decreasing with the distance between the vectors.
\mathbf{y} = f(W \times \phi(\mathbf{x}) + \mathbf{b})
we obtain a smooth non-linear partition of the input space.
What happens during online Perceptron learning?
If an example \mathbf{x}_i is correctly classified (y_i = t_i), the weight vector does not change.
\mathbf{w} \leftarrow \mathbf{w}
\mathbf{w} \leftarrow \mathbf{w} + 2 \, \eta \, t_i \, \mathbf{x}_i
Primal form of the online Perceptron algorithm
for M epochs:
for each sample (\mathbf{x}_i, t_i):
y_i = \text{sign}( \langle \mathbf{w} \cdot \mathbf{x}_i \rangle + b)
\Delta \mathbf{w} = \eta \, (t_i - y_i) \, \mathbf{x}_i
\Delta b = \eta \, (t_i - y_i)
\mathbf{w} = \sum_{i=1}^N \alpha_i \, t_i \, \mathbf{x}_i
y = \text{sign}( \sum_{i=1}^N \alpha_i \, t_i \, \langle \mathbf{x}_i \cdot \mathbf{x} \rangle)
To make a prediction y, we need the dot product between the input \mathbf{x} and all training examples \mathbf{x}_i.
We ignore the bias here, but it can be added back.
Dual form of the online Perceptron algorithm
for M epochs:
for each sample (\mathbf{x}_i, t_i):
y_i = \text{sign}( \sum_{j=1}^N \alpha_j \, t_j \, \langle \mathbf{x}_j \cdot \mathbf{x}_i \rangle)
if y_i \neq t_i :
This dual form of the Perceptron algorithm is strictly equivalent to its primal form.
It needs one parameter \alpha_i per training example instead of a weight vector (N >> d), but relies on dot products between vectors.
y = \text{sign}( \sum_{i=1}^N \alpha_i \, t_i \, \langle \mathbf{x}_i \cdot \mathbf{x} \rangle)
y = \text{sign}( \sum_{i=1}^N \alpha_i \, t_i \, \langle \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}) \rangle)
K(\mathbf{x}_i, \mathbf{x}) = \langle \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}) \rangle
\begin{aligned} \forall (\mathbf{x}, \mathbf{z}) \in \Re^3 \times \Re^3 & \\ & \\ K(\mathbf{x}, \mathbf{z}) &= ( \langle \mathbf{x} \cdot \mathbf{z} \rangle)^2 \\ &= (\sum_{i=1}^3 x_i \cdot z_i) \cdot (\sum_{j=1}^3 x_j \cdot z_j) \\ &= \sum_{i=1}^3 \sum_{j=1}^3 (x_i \cdot x_j) \cdot ( z_i \cdot z_j) \\ &= \langle \phi(\mathbf{x}) \cdot \phi(\mathbf{z}) \rangle \\ \end{aligned}
\text{with:} \qquad \phi(\mathbf{x}) = \begin{bmatrix} x_1 \cdot x_1 \\ x_1 \cdot x_2 \\ x_1 \cdot x_3 \\ x_2 \cdot x_1 \\ x_2 \cdot x_2 \\ x_2 \cdot x_3 \\ x_3 \cdot x_1 \\ x_3 \cdot x_2 \\ x_3 \cdot x_3 \end{bmatrix}
\begin{align*} \forall (\mathbf{x}, \mathbf{z}) \in \Re^d \times \Re^d \qquad K(\mathbf{x}, \mathbf{z}) &= ( \langle \mathbf{x} \cdot \mathbf{z} \rangle)^p \\ &= \langle \phi(\mathbf{x}) \cdot \phi(\mathbf{z}) \rangle \end{align*}
transforms the input from a space with d dimensions into a feature space of d^p dimensions.
While the inner product in the feature space would require O(d^p) operations, the calculation of the kernel directly in the input space only requires O(d) operations.
This is called the kernel trick: when a linear algorithm only relies on the dot product between input vectors, it can be safely projected into a higher dimensional feature space through a kernel function, without increasing too much its computational complexity, and without ever computing the values in the feature space.
Kernel Perceptron
for M epochs:
for each sample (\mathbf{x}_i, t_i):
y_i = \text{sign}( \sum_{j=1}^N \alpha_j \, t_j \, K(\mathbf{x}_j, \mathbf{x}_i))
if y_i \neq t_i :
K(\mathbf{x},\mathbf{z}) = \langle \mathbf{x} \cdot \mathbf{z} \rangle
K(\mathbf{x},\mathbf{z}) = (\langle \mathbf{x} \cdot \mathbf{z} \rangle)^p
K(\mathbf{x},\mathbf{z}) = \exp(-\frac{\| \mathbf{x} - \mathbf{z} \|^2}{2\sigma^2})
k(\mathbf{x},\mathbf{z})=\tanh(\langle \kappa \mathbf{x} \cdot \mathbf{z} \rangle +c)
In practice, the choice of the kernel family depends more on the nature of data (text, image…) and its distribution than on the complexity of the learning problem.
RBF kernels tend to “group” positive examples together.
Polynomial kernels are more like “distorted” hyperplanes.
Kernels have parameters (p, \sigma…) which have to found using cross-validation.
Support vector machines (SVM) extend the idea of a kernel perceptron using a different linear learning algorithm, the maximum margin classifier.
Using Lagrange optimization and regularization, the maximal margin classifer tries to maximize the “safety zone” (geometric margin) between the classifier and the training examples.
It also tries to reduce the number of non-zero \alpha_i coefficients to keep the complexity of the classifier bounded, thereby improving the generalization:
\mathbf{y} = \text{sign}(\sum_{i=1}^{N_{SV}} \alpha_i \, t_i \, K(\mathbf{x}_i, \mathbf{x}) + b)
Coupled with a good kernel, a SVM can efficiently solve non-linear classification problems without overfitting.
SVMs were the weapon of choice before the deep learning era, which deals better with huge datasets.