6.1. Linear classification¶
The goal of this exercise is to study hard and soft binary linear classification.
import numpy as np
import matplotlib.pyplot as plt
6.1.1. Hard linear classification: perceptron algorithm¶
6.1.1.1. Linearly separable data¶
We start by generating a binary linear classification dataset with 100 examples (50 in each class). You do not need to read the code generating the data, we just randomly select 100 point in \([0, 1]^2\) and assign a label depending on their position.
def create_dataset(n_samples):
rng = np.random.default_rng()
X = rng.uniform(0.0, 1.0, (n_samples, 2))
t = np.array([1 if X[i, 0] + X[i, 1] > 1. else -1 for i in range(n_samples)])
return X, t
X, t = create_dataset(100)
Let’s visualize the training set now in the following cell. Samples of the positive class (t=1) will be represented by blue points, examples of the negative class (t=-1) by orange ones.
plt.figure(figsize=(10, 6))
plt.scatter(X[t==1, 0], X[t==1, 1])
plt.scatter(X[t==-1, 0], X[t==-1, 1])
plt.show()
You will now implement the online version of the Perceptron algorithm to classify this data set.
As a reminder from the course, we will use an hyperplane \((\mathbf{w}, b)\) to predict whether an input \(\mathbf{x}_i\) belongs to the positive class (+1) or negative class (-1) using the following function:
Our goal is to minimize the mean square error (mse) of the hyperplane on the training set:
By applying gradient descent on this loss function, we obtain the delta learning rule:
The online version of the Perceptron is given by the following algorithm:
\(\text{Initialize the weight vector } \mathbf{w} \text{ and the bias } b.\)
\(\textbf{for } M \text{epochs:}\)
\(\qquad \textbf{forall } \text{examples } (\mathbf{x}_i, t_i) :\)
\(\qquad \qquad y_i = \text{sign}( \langle \mathbf{w} . \mathbf{x}_i \rangle + b)\)
\(\qquad \qquad \mathbf{w} \gets \mathbf{w} + \eta \, (t_i - y_i) \, \mathbf{x}_i\)
\(\qquad \qquad b \gets b + \eta \, (t_i - y_i)\)
Q: Implement the algorithm based on the linear regression algorithm of exercise 3. The only difference is that the weight vector is now a vector… You will need to use np.dot
. Use 20 epochs and a learning rate of 0.01 at first, but you can vary it later. Initialize the weight vector and the bias to 0. Make a plot of the mse during training.
Q: Visualize the hyperplane. If we call \(x_0\) and \(x_1\) the two coordinates of the inputs, the equation of the hyperplane is:
which takes the form:
You just need to draw a line between the two extremities of the hyperplane, for example between 0 and 1 or between X[:, 0].min()
and X[:, 0].max()
.
Before going further, let’s track the evolution of the classification error during learning, defined as the fraction of incorrectly classified examples during one epoch:
Q: Modify your algorithm to compute the training error and the mse loss for each epoch. How do the training error and loss evolve during learning? Do you really need both?
Tips: When iterating over each training example, you will need to increment a counter for misclassifications when your prediction y_i
is different from t[i]
(use !=
for “not equal”).
Q: Now is the time to play with the hyperparameters:
Vary the learning rate
eta
between extreme values (from 0.000001 to 100.0).Increase the number of epochs
nb_epochs
.
What does it change?
Q: Change the initial value of the weight vector \(\mathbf{w}\) to something different from 0 (e.g. [1, -1], [-5, 5], etc). What does it change? Vary the learning rate again and conclude on the importance of weight initialization.
6.1.1.2. Non-linearly separable data¶
The generated dataset was obviously linearly separable, because you found a linear hyperplane able to classify it… Let’s now see what happens when you apply your algorithm on a non-linearly separable dataset. It is basically the same method as before, except that we add one outlier at the end.
def create_dataset(n_samples):
rng = np.random.default_rng()
X = rng.uniform(0.0, 1.0, (n_samples-1, 2))
t = np.array([1 if X[i, 0] + X[i, 1] > 1. else -1 for i in range(n_samples-1)])
# Outlier
X = np.append(X, np.array([0.1, 0.1]).reshape((1, 2)), axis=0)
t = np.append(t, [1])
return X, t
X, t = create_dataset(100)
plt.figure(figsize=(10, 6))
plt.scatter(X[t==1, 0], X[t==1, 1])
plt.scatter(X[t==-1, 0], X[t==-1, 1])
plt.show()
Q: Apply your online Perceptron algorithm (with default values: eta = 0.1
, nb_epochs = 20
,\(\mathbf{w}\) and \(b\) initialized to 0) to the non-linear data. At the end of learning, compute the final error on the training set. What do you observe? Is it a satisfying result? Does it get better when you change the learning rate or weight initialization?
6.1.2. Soft linear classification: logistic regression¶
Let’s now see whether logistic regression helps us with outliers. The following cell implements the logistic function:
def logistic(x):
return 1. / (1 + np.exp(-x))
In logistic regression, the prediction \(y = \sigma(w \, x + b)\) represents the probability of belonging to the positive class. If \(y>0.5\), we can say that the example belongs to the positive class.
As seen in the course, there is absolutely no difference in the learning algorithm apart from using the logistic function instead of the sign. One thing to take care of, though, is that the targets \(t_i\) should be 0 and 1 in the logistic regression algorithm, while they are -1 and 1 in the current vector \(\mathbf{t}\). The following cell transforms the array t
to match the output of the logistic function.
t[t==-1] = 0
Q: Implement the logistic regression algorithm on the non-linear data. When computing the error, you will need to predict the class based on the probability \(y\). What do you observe? Conclude.