Linear regression

The goal of this exercise is to implement the least mean squares algorithm (LMS) for linear regression seen in the course.

We start by importing numpy and matplotlib.

import numpy as np
import matplotlib.pyplot as plt

Least mean squares

To generate the data for the exercise, we will use the scikit-learn library https://scikit-learn.org. It provides a huge selection of already implemented machine learning algorithms for classification, regression or clustering.

If you use Colab, scikit-learn should already be installed. Otherwise, install it with either conda or pip (you may need to restart this notebook afterwards):

conda install scikit-learn
# or:
pip install scikit-learn

We will use the method sklearn.datasets.make_regression to generate the data. The documentation of this method is available at https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html.

The following cell imports the method:

from sklearn.datasets import make_regression

We can now generate the data. We start with the simplest case where the inputs have only one dimension. We will generate 100 samples (x_i, t_i) linked by a linear relationship and some noise.

The following code generates the data:

N = 100
X, t = make_regression(n_samples=N, n_features=1, noise=15.0)

n_samples is the number of samples generates, n_features is the number of input variables and noise quantifies how the points deviate from the linear relationship.

Q: Print the shape of the arrays X and t to better understand what is generated. Visualize the dataset using matplotlib (plt.scatter). Vary the value of the noise argument in the previous cell and visualize the data again.

print(X.shape)
print(t.shape)

plt.figure(figsize=(10, 5))
plt.scatter(X, t)
plt.xlabel("x")
plt.ylabel("t")
plt.show()
(100, 1)
(100,)

Now is the time to implement the LMS algorithm with numpy.

Remember the LMS algorithm from the course:

  • w=0 \quad;\quad b=0

  • for M epochs:

    • dw=0 \quad;\quad db=0

    • for each sample (x_i, t_i):

      • y_i = w \, x_i + b

      • dw = dw + (t_i - y_i) \, x_i

      • db = db + (t_i - y_i)

    • \Delta w = \eta \, \frac{1}{N} dw

    • \Delta b = \eta \, \frac{1}{N} db

Our linear model y = w \, x + b predicts outputs for an input x. The error t-y between the prediction and the data is used to adapt the weight w and the bias b at the end of each epoch.

Q: Implement the LMS algorithm and apply it to the generated data. The Python code that you will write is almost a line-by-line translation of the pseudo-code above. You will use a learning rate eta = 0.1 at first, but you will vary this value later. Start by running a single epoch, as it will be easier to debug it, and then increase the number of epochs to 100. Print the value of the weight and bias at the end.

w = 0
b = 0

eta = 0.1

for epoch in range(100):
    dw = 0
    db = 0.0
    
    for i in range(N):
        # Prediction
        y = w * X[i] + b
        
        # LMS
        dw += (t[i] - y) * X[i]
        db += (t[i] - y)
        
    # Parameter updates
    w += eta * dw / N
    b += eta * db / N
    
print(w, b)
[35.129774] [0.43380744]

Q: Visualize the quality of the fit by superposing the learned model to the data with matplotlib.

Tip: you can get the extreme values of the xaxis with X.min() and X.max(). To visualize the model, you just need to plot a line between the points (X.min(), w*X.min()+b) and (X.max(), w*X.max()+b).

plt.figure(figsize=(10, 5))

plt.scatter(X, t)

x_axis = [X.min(), X.max()]
plt.plot(x_axis, w*x_axis + b)

plt.xlabel("x")
plt.ylabel("y")
plt.show()

Another option is to predict a value for all inputs and plot this vector y against the desired values t.

Q: Make a scatter plot where t is the x-axis and y = w\, x + b is the y-axis. How should the points be arranged in the ideal case? Also plot what this ideal relationship should be.

y = w * X + b

plt.figure(figsize=(10, 5))

plt.scatter(t, y)

x_axis = [t.min(), t.max()]
plt.plot(x_axis, x_axis)

plt.xlabel("t")
plt.ylabel("y")
plt.show()

A: The points (t, y) should be on a line with slope 1 and intercept 0 (i.e. t=y).

A much better method to analyse the result of the learning algorithm is to track the mean squared error (mse) after each epoch, i.e. the loss function which we actually want to minimize. The MSE is defined as:

\text{mse} = \frac{1}{N} \, \sum_{i=1}^N (t_i - y_i)^2

Q: Modify your LMS algorithm (either directly or copy it in the next cell) to track the mse after each epoch. After each epoch, append the mse on the training set to a list (initially empty) and plot it at the end. How does the mse evolve? Which value does it get in the end? Why? How many epochs do you actually need?

w = 0
b = 0

eta = 0.1

losses = []
for epoch in range(100):
    dw = 0
    db = 0.0
    mse = 0.0
    
    for i in range(N):
        # Prediction
        y = w * X[i] + b
        
        # LMS
        dw += (t[i] - y) * X[i]
        db += (t[i] - y)
        
        # mse
        mse += (t[i] - y)**2
    
    # Parameter updates
    w += eta*dw/N
    b += eta*db/N
    losses.append(mse/N)
    
print('mse:', mse/N)
    
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.xlabel("Epochs")
plt.ylabel("mse")
plt.show()
mse: [255.93046608]

A: The mse decreases exponentially with the epochs. It never reaches 0 because of the noise in the data (try setting the noise argument in the data generator to 0 and check that the mse reaches 0). 30 or 40 epochs seem sufficient to solve the problem, nothing happens afterwards.

Let’s now study the influence of the learning rate eta=0.1 seemed to work, but is it the best value?

Q: Iterate over multiple values of eta using a logarithmic scale and plot the final mse after 100 epochs as a function of the learning rate. Conclude.

Hint: the logarithmic scale means that you will try values such as 10^{-5}, 10^{-4}, 10^{-3}, etc. until 1.0. In Python, you can either write explictly 0.0001 or use the notation 1e-4. For the plot, use np.log10(eta) to only display the exponent on the X-axis.

losses = []

etas = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]

for eta in etas:
    w = 0
    b = 0
    
    for epoch in range(100):
        dw = 0
        db = 0.0
        mse = 0.0

        for i in range(N):
            # Prediction
            y = w*X[i] + b
            
            # LMS
            dw += (t[i] - y)*X[i]
            db += (t[i] - y)
            
            # mse
            mse += (t[i] - y)**2

        # Parameter updates
        w += eta*dw/N
        b += eta*db/N
    
    losses.append(mse/N)
    
print(losses)

plt.figure(figsize=(10, 5))
plt.plot(np.log10(etas), losses)
plt.xlabel("log(eta)")
plt.ylabel("mse")
plt.show()
[array([2502.38125727]), array([2498.82326114]), array([2463.56399936]), array([2141.11879663]), array([670.57755267]), array([276.42556726]), array([276.42533428])]

A: With small values of the learning rate, LMS does not have time to converge within 100 epochs. A learning rate of 1 is actually possible in this simple problem. Try setting eta to 2.0 and observe what happens.

Scikit-learn

The code that you have written is functional, but extremely slow, as you use for loops in Python. For so little data samples, it does not make a difference, but if you had millions of samples, this would start to be a problem.

The solution is to use optimized implementations of the algorithms, running in C++ or FORTRAN under the hood. We will use here the LMS algorithm provided by scikit-learn as you have already installed it and it is very simple to use. Note that one could use tensorflow too, but that would be killing a fly with a sledgehammer.

scikit-learn provides a LinearRegression object that implements LMS. The documentation is at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

You simply import it with:

from sklearn.linear_model import LinearRegression

You create the object with:

reg = LinearRegression()

reg is now an object with different methods (fit(), predict()) that accept any kind of data and performs linear regression.

To train the model on the data (X, t), simply use:

reg.fit(X, t)

The parameters of the model are obtained with reg.coef_ for w and reg.intercept_ for b.

You can predict outputs for new inputs using:

y = reg.predict(X)

Q: Apply linear regression on the data using scikit-learn. Check the model parameters after learning and compare them to what you obtained previously. Print the mse and make a plot comparing the predictions with the data.

from sklearn.linear_model import LinearRegression

# Linear regression
reg = LinearRegression()
reg.fit(X, t)

print(reg.coef_, reg.intercept_)

# Prediction
y = reg.predict(X)

# mse
mse = np.mean((t - y)**2)
print('mse:', mse)

plt.figure(figsize=(10, 5))
plt.scatter(t, y)
x_axis = [t.min(), t.max()]
plt.plot(x_axis, x_axis)
plt.xlabel("t")
plt.ylabel("y")
plt.show()
[50.61163981] -2.15821943523261
mse: 276.42533428232537

A: No need to select the right learning rate, fast and error-free.

Delta learning rule

Let’s now implement the online version of LMS, the delta learning rule. The only difference is that the parameter updates are applied immediately after each example is evaluated, not at the end of training.

  • w=0 \quad;\quad b=0

  • for M epochs:

    • for each sample (x_i, t_i):

      • y_i = w \, x_i + b

      • \Delta w = \eta \, (t_i - y_i ) \, x_i

      • \Delta b = \eta \, (t_i - y_i)

Q: Implement the delta learning rule for the regression problem with eta = 0.1. Plot the evolution of the mse and compare it to LMS.

w = 0
b = 0

eta = 0.1

losses = []
for epoch in range(100):
    
    mse = 0.0
    
    for i in range(N):
        # Prediction
        y = w * X[i] + b
        
        # Delta learning rule
        w += eta * (t[i] - y) * X[i]
        b += eta * (t[i] - y)
        
        # mse
        mse += (t[i] - y)**2

    losses.append(mse/N)
    
print('mse:', mse/N)
    
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.xlabel("Epochs")
plt.ylabel("mse")
plt.show() 
mse: [317.20325456]

y = w * X + b

plt.figure(figsize=(10, 5))

plt.scatter(t, y)

x_axis = [t.min(), t.max()]
plt.plot(x_axis, x_axis)

plt.xlabel("t")
plt.ylabel("y")
plt.show()

A: The delta learning rule converges faster, but the final mse is higher. This is due to the learning rate as shown in the next question.

Q: Vary the learning rate logarithmically as for LMS and conclude.

losses = []

etas = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

for eta in etas:
    
    w = 0
    b = 0

    for epoch in range(100):
    
        mse = 0.0
    
        for i in range(N):
            # Prediction
            y = w * X[i] + b
        
            # Delta learning rule
            w += eta * (t[i] - y) * X[i]
            b += eta * (t[i] - y)
        
            # mse
            mse += (t[i] - y)**2

    losses.append(mse/N)
    
print(losses)

plt.figure(figsize=(10, 5))
plt.plot(np.log10(etas), losses)
plt.xlabel("eta")
plt.ylabel("mse")
plt.show()
[array([2463.37614216]), array([2139.66406069]), array([670.31330378]), array([276.95458754]), array([281.52657112]), array([317.20325456])]

A: The optimal value of eta is now 0.001, higher learning rates lead to worse mse. One explanation is that the true learning rate of LMS is eta/N = 0.001, not eta=0.1. When using 0.001 for the learning, the delta learning rule behaves exactly like LMS:

w = 0
b = 0

eta = 0.001

losses = []
for epoch in range(100):
    
    mse = 0.0
    
    for i in range(N):
        # Prediction
        y = w * X[i] + b
        
        # Delta learning rule
        w += eta * (t[i] - y) * X[i]
        b += eta * (t[i] - y)
        
        # mse
        mse += (t[i] - y)**2

    losses.append(mse/N)
    
print('mse:', mse/N)
    
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.xlabel("Epochs")
plt.ylabel("mse")
plt.show() 
mse: [276.95458754]