Linear classification
Professur für Künstliche Intelligenz - Fakultät für Informatik
The training data \mathcal{D} is composed of N examples (\mathbf{x}_i, t_i)_{i=1..N} , with a d-dimensional input vector \mathbf{x}_i \in \Re^d and a binary output t_i \in \{-1, +1\}
The data points where t = + 1 are called the positive class, the other the negative class.



For a point \mathbf{x} \in \mathcal{D}, \langle \mathbf{w} \cdot \mathbf{x} \rangle +b is the projection of \mathbf{x} onto the hyperplane (\mathbf{w}, b).
If \langle \mathbf{w} \cdot \mathbf{x} \rangle +b > 0, the point is above the hyperplane.
If \langle \mathbf{w} \cdot \mathbf{x} \rangle +b < 0, the point is below the hyperplane.
If \langle \mathbf{w} \cdot \mathbf{x} \rangle +b = 0, the point is on the hyperplane.
By looking at the sign of \langle \mathbf{w} \cdot \mathbf{x} \rangle +b, we can predict the class of the input:
\text{sign}(\langle \mathbf{w} \cdot \mathbf{x} \rangle +b) = \begin{cases} +1 \; \text{if} \; \langle \mathbf{w} \cdot \mathbf{x} \rangle +b \geq 0 \\ -1 \; \text{if} \; \langle \mathbf{w} \cdot \mathbf{x} \rangle +b < 0 \\ \end{cases}
y = f_{\mathbf{w}, b} (\mathbf{x}) = \text{sign} ( \langle \mathbf{w} \cdot \mathbf{x} \rangle +b ) = \text{sign} ( \sum_{j=1}^d w_j \, x_j +b )


Linear classification is the process of finding an hyperplane (\mathbf{w}, b) that correctly separates the two classes.
If such an hyperplane can be found, the training set is said linearly separable.
Otherwise, the problem is non-linearly separable and other methods have to be applied (MLP, SVM…).
\mathcal{L}(\mathbf{w}, b) = \mathbb{E}_\mathcal{D} [(t_i - y_i)^2] \approx \frac{1}{N} \, \sum_{i=1}^{N} (t_i - y_i)^2
When the prediction y_i is the same as the data t_i for all examples in the training set (perfect classification), the mse is minimal and equal to 0.
We can apply gradient descent to find this minimum.
\begin{cases} \Delta \mathbf{w} = - \eta \, \nabla_\mathbf{w} \, \mathcal{L}(\mathbf{w}, b)\\ \\ \Delta b = - \eta \, \nabla_b \, \mathcal{L}(\mathbf{w}, b)\\ \end{cases}
\nabla_\mathbf{w} \, \mathcal{L}(\mathbf{w}, b) = \nabla_\mathbf{w} \, \frac{1}{N} \, \sum_{i=1}^{N} (t_i - y_i )^2 = \frac{1}{N} \, \sum_{i=1}^{N} \nabla_\mathbf{w} \, (t_i - y_i )^2 = \frac{1}{N} \, \sum_{i=1}^{N} \nabla_\mathbf{w} \, \mathcal{l}_i (\mathbf{w}, b)
\nabla_\mathbf{w} \, \mathcal{l}_i (\mathbf{w}, b) = - 2 \, (t_i - y_i) \, \nabla_\mathbf{w} \, \text{sign}( \langle \mathbf{w} \cdot \mathbf{x}_i \rangle +b)
\nabla_\mathbf{w} \, \mathcal{l}_i (\mathbf{w}, b) = - 2 \, (t_i - y_i) \, \text{sign}'( \langle \mathbf{w} \cdot \mathbf{x}_i \rangle +b) \, \mathbf{x}_i

\nabla_\mathbf{w} \, \mathcal{l}_i (\mathbf{w}, b) = - 2 \, (t_i - y_i) \, \mathbf{x}_i
\begin{cases} \Delta \mathbf{w} = \eta \, \dfrac{1}{N} \, \displaystyle\sum_{i=1}^{N} (t_i - y_i) \, \mathbf{x}_i\\ \\ \Delta b = \eta \, \dfrac{1}{N} \, \displaystyle\sum_{i=1}^{N} (t_i - y_i )\\ \end{cases}
Batch linear classification
for M epochs:
\mathbf{dw} = 0 \qquad db = 0
for each sample (\mathbf{x}_i, t_i):
y_i = \text{sign}( \langle \mathbf{w} \cdot \mathbf{x}_i \rangle + b)
\mathbf{dw} = \mathbf{dw} + (t_i - y_i) \, \mathbf{x}_i
db = db + (t_i - y_i)
\Delta \mathbf{w} = \eta \, \frac{1}{N} \, \mathbf{dw}
\Delta b = \eta \, \frac{1}{N} \, db
This is called the batch version of the Perceptron algorithm.
If the data is linearly separable and \eta is well chosen, it converges to the minimum of the mean square error.


Perceptron algorithm
for M epochs:
for each sample (\mathbf{x}_i, t_i):
y_i = \text{sign}( \langle \mathbf{w} \cdot \mathbf{x}_i \rangle + b)
\Delta \mathbf{w} = \eta \, (t_i - y_i) \, \mathbf{x}_i
\Delta b = \eta \, (t_i - y_i)
This algorithm iterates over all examples of the training set and applies the delta learning rule to each of them immediately, not at the end on the whole training set.
One could check whether there are still classification errors on the training set at the end of each epoch and stop the algorithm.
The delta learning rule depends on the learning rate \eta, the error made by the prediction (t_i - y_i) and the input \mathbf{x}_i.


\mathcal{L}(\mathbf{w}, b) = \mathbb{E}_\mathcal{D} [(t_i - y_i)^2]
\mathcal{L}(\mathbf{w}, b) \approx \frac{1}{N} \, \sum_{i=1}^{N} (t_i - y_i)^2
\Delta \mathbf{w} = \eta \, \frac{1}{N} \sum_{i=1}^{N} (t_i - y_i ) \, \mathbf{x_i}
\mathcal{L}(\mathbf{w}, b) \approx (t_i - y_i)^2
\Delta \mathbf{w} = \eta \, (t_i - y_i) \, \mathbf{x_i}
Batch learning has less bias (central limit theorem) and is less sensible to noise in the data, but is very slow.
Online learning converges faster, but can be instable and overfits (high variance).
\Delta \mathbf{w} = \eta \, \frac{1}{K} \sum_{i=1}^{K} (t_i - y_i) \, \mathbf{x_i}
If the batch size is well chosen, SGD is as stable as batch learning and as fast as online learning.
The minibatches are randomly selected at each epoch (i.i.d).

Let’s consider N samples \{x_i\}_{i=1}^N independently taken from a normal distribution X.
The probability density function (pdf) of a normal distribution is:
f(x ; \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \, \exp{- \frac{(x - \mu)^2}{2\sigma^2}}
where \mu is the mean of the distribution and \sigma its standard deviation.

L(\mu, \sigma) = P( \mathbf{x} ; \mu , \sigma ) = \prod_{i=1}^{N} f(x_i ; \mu, \sigma )

The likelihood function reflects how well the parameters \mu and \sigma explain the observations \{x_i\}_{i=1}^N.
Note: the samples must be i.i.d. so that the likelihood is a product.
\text{max}_{\mu, \sigma} \quad L(\mu, \sigma) = \prod_{i=1}^{N} f(x_i ; \mu, \sigma )
\begin{aligned} L(\mu, \sigma) & = \prod_{i=1}^{N} f(x_i ; \mu, \sigma ) \\ & = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi \sigma^2}} \, \exp{- \frac{(x_i - \mu)^2}{2\sigma^2}}\\ & = (\frac{1}{\sqrt{2\pi \sigma^2}})^N \, \prod_{i=1}^{N} \exp{- \frac{(x_i - \mu)^2}{2\sigma^2}}\\ & = (\frac{1}{\sqrt{2\pi \sigma^2}})^N \, \exp{- \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{2\sigma^2}}\\ \end{aligned}
\begin{cases} \dfrac{\partial L(\mu, \sigma)}{\partial \mu} = 0 \\ \dfrac{\partial L(\mu, \sigma)}{\partial \sigma} = 0 \\ \end{cases}
\begin{aligned} l(\mu, \sigma) & = \log(L(\mu, \sigma)) \\ & = \log \left((\frac{1}{\sqrt{2\pi \sigma^2}})^N \, \exp{- \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{2\sigma^2}} \right)\\ & = - \frac{N}{2} \log (2\pi \sigma^2) - \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{2\sigma^2}\\ \end{aligned}
\begin{aligned} l(\mu, \sigma) & = - \frac{N}{2} \log (2\pi \sigma^2) - \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{2\sigma^2}\\ \end{aligned}
\begin{aligned} \frac{\partial l(\mu, \sigma)}{\partial \mu} & = \frac{\sum_{i=1}^{N}(x_i - \mu)}{\sigma^2} = 0 \\ \frac{\partial l(\mu, \sigma)}{\partial \sigma} & = - \frac{N}{2} \frac{4 \pi \sigma}{2 \pi \sigma^2} + \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{\sigma^3} \\ & = - \frac{N}{\sigma} + \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{\sigma^3} = 0\\ \end{aligned}
\mu = \frac{1}{N} \sum_{i=1}^{N} x_i \qquad\qquad \sigma^2 = \frac{1}{N} \sum_{i=1}^{N}(x_i - \mu)^2
\mu = \frac{1}{N} \sum_{i=1}^{N} x_i \qquad\qquad \sigma^2 = \frac{1}{N} \sum_{i=1}^{N}(x_i - \mu)^2
The same principle can be applied to estimate the parameters of any distribution: normal, exponential, Bernouilli, Poisson, etc…
When a machine learning method has an probabilistic interpretation (i.e. it outputs probabilities), MLE can be used to find its parameters.
One can use global optimization like here, or gradient descent to estimate the parameters iteratively.

\begin{aligned} y = \sigma(w \, x + b ) = \frac{1}{1+\exp(-w \, x - b )} \end{aligned}
P(t = 1 | x; w, b) = y ; \qquad P(t = 0 | x; w, b) = 1 - y
f(t | x; w, b) = y^t \, (1- y)^{1-t}
If we consider our training samples (x_i, t_i) as independently taken from this distribution, our task is:
to find the parameterized distribution that best explains the data, which means:
to find the parameters w and b maximizing the likelihood that the samples t come from a Bernouilli distribution when x, w and b are given.
We only need to apply Maximum Likelihood Estimation (MLE) on this Bernouilli distribution!
\begin{aligned} L( w, b) &= P( t | x; w, b ) = \prod_{i=1}^{N} f(t_i | x_i; w, b ) \\ &= \prod_{i=1}^{N} y_i^{t_i} \, (1- y_i)^{1-t_i} \end{aligned}
\begin{aligned} l( w, b) &= \log L( w, b) \\ &= \sum_{i=1}^{N} [t_i \, \log y_i + (1 - t_i) \, \log( 1- y_i)]\\ \end{aligned}
\mathcal{L}( w, b) = - \sum_{i=1}^{N} [t_i \, \log y_i + (1 - t_i) \, \log( 1- y_i)]
\begin{aligned} \frac{\partial \mathcal{l}_i(w, b)}{\partial w} &= -\frac{\partial}{\partial w} [ t_i \, \log y_i + (1 - t_i) \, \log( 1- y_i) ] \\ &= - t_i \, \frac{\partial}{\partial w} \log y_i - (1 - t_i) \, \frac{\partial}{\partial w}\log( 1- y_i) \\ &= - t_i \, \frac{\frac{\partial}{\partial w} y_i}{y_i} - (1 - t_i) \, \frac{\frac{\partial}{\partial w}( 1- y_i)}{1- y_i} \\ &= - t_i \, \frac{y_i \, (1 - y_i) \, x_i}{y_i} + (1 - t_i) \, \frac{y_i \, (1-y_i) \, x_i}{1 - y_i}\\ &= - ( t_i - y_i ) \, x_i\\ \end{aligned}
y_i = \sigma(\langle \mathbf{w} \cdot \mathbf{x}_i \rangle + b )
P(t_i = 1 | \mathbf{x}_i; \mathbf{w}, b) = y_i ; \qquad P(t_i = 0 | \mathbf{x}_i; \mathbf{w}, b) = 1 - y_i
\mathcal{L}(\mathbf{w}, b) = - \sum_{i=1}^{N} [t_i \, \log y_i + (1 - t_i) \, \log( 1- y_i)]
\begin{cases} \Delta \mathbf{w} = \eta \, ( t_i - y_i ) \, \mathbf{x}_i \\ \\ \Delta b = \eta \, ( t_i - y_i ) \\ \end{cases}
Logistic regression
\mathbf{w} = 0 \qquad b = 0
for M epochs:
for each sample (\mathbf{x}_i, t_i):
y_i = \sigma( \langle \mathbf{w} \cdot \mathbf{x}_i \rangle + b)
\Delta \mathbf{w} = \eta \, (t_i - y_i) \, \mathbf{x}_i
\Delta b = \eta \, (t_i - y_i)

Logistic regression works just like linear classification, except in the way the prediction is done.
To know to which class \mathbf{x}_i belongs, simply draw a random number between 0 and 1:
if it is smaller than y_i (probability y_i), it belongs to the positive class.
if it is bigger than y_i (probability 1-y_i), it belongs to the negative class.
Alternatively, you can put a hard limit at 0.5:
if y_i > 0.5 then the class is positive.
if y_i < 0.5 then the class is negative.


Logistic regression also provides a confidence score:
This is particularly important in safety critical applications:
If you detect the positive class but with a confidence of 0.51, you should perhaps not trust the prediction.
If the confidence score is 0.99, you can probably trust the prediction.

Two main solutions:
One-vs-All (or One-vs-the-rest): one trains simultaneously a binary (linear) classifier for each class. The examples belonging to this class form the positive class, all others are the negative class:
If multiple classes are predicted for a single example, ones needs a confidence level for each classifier saying how sure it is of its prediction.
One-vs-One: one trains a classifier for each pair of class:
A majority vote is then performed to find the correct class.

Suppose we have C classes (dog vs. cat vs. ship vs…).
The One-vs-All scheme involves C binary classifiers (\mathbf{w}_i, b_i), each with a weight vector and a bias, working on the same input \mathbf{x}.
y_i = f(\langle \mathbf{w}_i \cdot \mathbf{x} \rangle + b_i)
\mathbf{y} = f(W \times \mathbf{x} + \mathbf{b})

\mathbf{z} = f_{W, \mathbf{b}}(\mathbf{x}) = W \times \mathbf{x} + \mathbf{b}
Each element z_j of the vector \mathbf{z} is called the logit score of the class:
The logit scores are not probabilities, as they can be negative and do not sum to 1.
How do we represent the ground truth \mathbf{t} for each neuron?
The target vector \mathbf{t} is represented using one-hot encoding.

The binary vector has one element per class: only one element is 1, the others are 0.
Example:
\mathbf{t} = [\text{cat}, \text{dog}, \text{ship}, \text{house}, \text{car}] = [0, 1, 0, 0, 0]
The labels can be seen as a probability distribution over the training set, in this case a multinomial distribution (a dice with C sides).
For a given image \mathbf{x} (e.g. a picture of a dog), the conditional pmf is defined by the one-hot encoded vector \mathbf{t}:
P(\mathbf{t} | \mathbf{x}) = [P(\text{cat}| \mathbf{x}), P(\text{dog}| \mathbf{x}), P(\text{ship}| \mathbf{x}), P(\text{house}| \mathbf{x}), P(\text{car}| \mathbf{x})] = [0, 1, 0, 0, 0]

y_j = P(\text{class = j} | \mathbf{x}) = \mathcal{S}(z_j) = \frac{\exp(z_j)}{\sum_k \exp(z_k)}

The higher z_j, the higher the probability that the example belongs to class j.
This is very similar to logistic regression for soft classification, except that we have multiple classes.
\text{mse}(W, \mathbf{b}) = \sum_j (t_{j} - \frac{\exp(z_j)}{\sum_k \exp(z_k)})^2

We actually want to minimize the statistical distance netween two distributions:
The model outputs a multinomial probability distribution \mathbf{y} for an input \mathbf{x}: P(\mathbf{y} | \mathbf{x}; W, \mathbf{b}).
The one-hot encoded classes also come from a multinomial probability distribution P(\mathbf{t} | \mathbf{x}).
We search which parameters (W, \mathbf{b}) make the two distributions P(\mathbf{y} | \mathbf{x}; W, \mathbf{b}) and P(\mathbf{t} | \mathbf{x}) close.
The training data \{\mathbf{x}_i, \mathbf{t}_i\} represents samples from P(\mathbf{t} | \mathbf{x}).
P(\mathbf{y} | \mathbf{x}; W, \mathbf{b}) is a good model of the data when the two distributions are close, i.e. when the negative log-likelihood of each sample under the model is small.
\mathcal{l}(W, \mathbf{b}) = \mathcal{H}(\mathbf{t} | \mathbf{x}, \mathbf{y} | \mathbf{x}) = \mathbb{E}_{t \sim P(\mathbf{t} | \mathbf{x})} [ - \log P(\mathbf{y} = t | \mathbf{x})]

\mathcal{l}(W, \mathbf{b}) = \mathcal{H}(\mathbf{t} | \mathbf{x}, \mathbf{y} | \mathbf{x}) = \mathbb{E}_{t \sim P(\mathbf{t} | \mathbf{x})} [ - \log P(\mathbf{y} = t | \mathbf{x})] = - \sum_{j=1}^C P(t_j | \mathbf{x}) \, \log P(y_j = t_j | \mathbf{x})
\mathcal{l}(W, \mathbf{b}) = - \log P(\mathbf{y} = t^* | \mathbf{x})
\mathcal{l}(W, \mathbf{b}) = - \log y_{j^*}

\mathcal{l}(W, \mathbf{b}) = - \log y_{j^*}
The minimum of - \log y is obtained when y =1:
Because of the softmax activation function, the probability for the other classes should become closer from 0.
y_j = P(\text{class = j}) = \frac{\exp(z_j)}{\sum_k \exp(z_k)}

\mathcal{l}(W, \mathbf{b}) = - \langle \mathbf{t} \cdot \log \mathbf{y} \rangle = - \sum_{j=1}^C t_j \, \log y_j = - \log y_{j^*}
\begin{split} \frac{\partial {l}(W, \mathbf{b})}{\partial z_i} & = - \sum_j \frac{\partial}{\partial z_i} t_j \log(y_j)= - \sum_j t_j \frac{\partial \log(y_j)}{\partial z_i} = - \sum_j t_j \frac{1}{y_j} \frac{\partial y_j}{\partial z_i} \\ & = - \frac{t_i}{y_i} \frac{\partial y_i}{\partial z_i} - \sum_{j \neq i}^C \frac{t_j}{y_j} \frac{\partial y_j}{\partial z_i} = - \frac{t_i}{y_i} y_i (1-y_i) - \sum_{j \neq i}^C \frac{t_j}{y_i} (-y_j \, y_i) \\ & = - t_i + t_i \, y_i + \sum_{j \neq i}^C t_j \, y_i = - t_i + \sum_{j = 1}^C t_j y_i = -t_i + y_i \sum_{j = 1}^C t_j \\ & = - (t_i - y_i) \end{split}
i.e. the same as with the mse in linear regression!
\frac{\partial \mathcal{l}(W, \mathbf{b})}{\partial \mathbf{z}} = - (\mathbf{t} - \mathbf{y} )
See https://peterroelants.github.io/posts/cross-entropy-softmax/ for more explanations on the differentiation.
\mathbf{z} = W \times \mathbf{x} + \mathbf{b}
we can obtain the partial derivatives:
\begin{cases} \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial W} = \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial \mathbf{z}} \times \dfrac{\partial \mathbf{z}}{\partial W} = - (\mathbf{t} - \mathbf{y} ) \times \mathbf{x}^T \\ \\ \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial \mathbf{b}} = \dfrac{\partial \mathcal{l}(W, \mathbf{b})}{\partial \mathbf{z}} \times \dfrac{\partial \mathbf{z}}{\partial \mathbf{b}} = - (\mathbf{t} - \mathbf{y} ) \\ \end{cases}
\begin{cases} \Delta W = \eta \, (\mathbf{t} - \mathbf{y} ) \times \mathbf{x}^T \\ \\ \Delta \mathbf{b} = \eta \, (\mathbf{t} - \mathbf{y} ) \\ \end{cases}
\mathbf{z} = W \times \mathbf{x} + \mathbf{b}
y_j = \frac{\exp(z_j)}{\sum_k \exp(z_k)}
\mathcal{L}(W, \mathbf{b}) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \sim \mathcal{D}} [ - \langle \mathbf{t} \cdot \log \mathbf{y} \rangle]
which simplifies into the delta learning rule:
\begin{cases} \Delta W = \eta \, (\mathbf{t} - \mathbf{y} ) \times \mathbf{x}^T \\ \\ \Delta \mathbf{b} = \eta \, (\mathbf{t} - \mathbf{y} ) \\ \end{cases}
\begin{cases} \Delta W = \eta \, (\mathbf{t} - \mathbf{y} ) \times \mathbf{x}^T \\ \\ \Delta \mathbf{b} = \eta \, (\mathbf{t} - \mathbf{y} ) \\ \end{cases}
\mathcal{L}(W, \mathbf{b}) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \sim \mathcal{D}} [ ||\mathbf{t} - \mathbf{y}||^2 ]
\mathcal{L}(W, \mathbf{b}) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \sim \mathcal{D}} [ - \langle \mathbf{t} \cdot \log \mathbf{y} \rangle]
What if there is more than one label on the image?
The target vector \mathbf{t} does not represent a probability distribution anymore:
\mathbf{t} = [\text{cat}, \text{dog}, \text{ship}, \text{house}, \text{car}] = [1, 1, 0, 0, 0]
\mathbf{t} = [\text{cat}, \text{dog}, \text{ship}, \text{house}, \text{car}] = [0.5, 0.5, 0, 0, 0]
\mathbf{y} = \sigma(W \times \mathbf{x} + \mathbf{b})
y_j = P(\text{class} = j | \mathbf{x})
\mathcal{l}_j(W, \mathbf{b}) = - t_j \, \log y_j + (1 - t_j) \, \log( 1- y_j)

The training error is the error made on the training set.
\epsilon_\mathcal{D} = \dfrac{\text{number of misclassifications}}{\text{number of examples}}
What matters is the generalization error, which is the error that will be made on new examples (not used during learning).
Much harder to measure (potentially infinite number of new examples, what is the correct answer?).
Often approximated by the empirical error on the test set: one keeps a number of training examples out of the learning phase and one tests the performance on them.
Need for cross-validation to detect overfitting.
Overfitting in regression

Overfitting in classification

Confusion matrix
Classification errors can also depend on the class:
False Positive errors (FP, false alarm, type I) is when the classifier predicts a positive class for a negative example.
False Negative errors (FN, miss, type II) is when the classifier predicts a negative class for a positive example.
True Positive (TP) and True Negative (TN) are correctly classified examples.
Is it better to fail to detect a cancer (FN) or to incorrectly predict one (FP)?
\epsilon = \frac{\text{FP} + \text{FN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}}
\text{acc} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}}
R = \frac{\text{TP}}{\text{TP} + \text{FN}} \;\; P = \frac{\text{TP}}{\text{TP} + \text{FP}}
\text{F1} = \frac{2\, P \, R}{P + R}
For multiclass classification problems, the confusion matrix tells how many examples are correctly classified and where confusion happens.
One axis is the predicted class, the other is the target class.
Each element of the matrix tells how many examples are classified or misclassified.
The matrix should be as diagonal as possible.

scikit-learn: