import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
= fetch_california_housing()
dataset
= dataset.data
X = dataset.target
t
print(X.shape)
print(t.shape)
(20640, 8)
(20640,)
Let’s now investigate multiple linear regression (MLR), where several output depends on more than one input variable:
\begin{cases} y_1 = w_1 \, x_1 + w_2 \, x_2 + b_1\\ \\ y_2 = w_3 \, x_1 + w_4 \, x_2 + b_2\\ \end{cases}
The California Housing Dataset consists of price of houses in various places in California. Alongside with the price of over 20000 houses, the dataset provides 8 features:
MedInc
: median income in block groupHouseAge
: median house age in block groupAveRooms
: average number of rooms per householdAveBedrms
: average number of bedrooms per householdPopulation
: block group populationAveOccup
: average number of household membersLatitude
: block group latitudeLongitude
: block group longitudeThe California housing dataset can be directly downloaded from scikit-learn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
= fetch_california_housing()
dataset
= dataset.data
X = dataset.target
t
print(X.shape)
print(t.shape)
(20640, 8)
(20640,)
There are 20640 samples with 8 input features and one output (the price). The following cell decribes the dataset:
print(dataset.DESCR)
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
An household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surpinsingly large values for block groups with few households
and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.
.. topic:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
The following cell allows to visualize how each feature influences the price individually:
=(12, 15))
plt.figure(figsize
for i in range(8):
4, 2 , i+1)
plt.subplot(
plt.scatter(X[:, i], t)
plt.title(dataset.feature_names[i]) plt.show()
Q: Apply MLR on the California data using the same LinearRegression
method of scikit-learn
as last time. Print the mse, plot how the predictions predict the price for each feature, and plot the prediction y against the true value t for each sample as in the last exercise. Does it work well?
You will also plot the weights of the model (reg.coef_
) and conclude on the relative importance of the different features: which feature has the stronger weight and why?
from sklearn.linear_model import LinearRegression
# Linear regression
= LinearRegression()
reg
reg.fit(X, t)
# Prediction
= reg.predict(X)
y
# mse
= np.mean((t - y)**2)
mse print("MSE:", mse)
=(8, 6))
plt.figure(figsize
plt.scatter(t, y)min(), t.max()], [t.min(), t.max()], c="red")
plt.plot([t."t")
plt.xlabel("y")
plt.ylabel(
=(12, 15))
plt.figure(figsizefor i in range(8):
4, 2 , i+1)
plt.subplot(
plt.scatter(X[:, i], t)
plt.scatter(X[:, i], y)
plt.title(dataset.feature_names[i])
plt.show()
=(12, 5))
plt.figure(figsizeabs(reg.coef_))
plt.plot(np."Feature")
plt.xlabel("Weight")
plt.ylabel( plt.show()
MSE: 0.5243209861846072
A: Feature 3 (AveBedrms, average number of bedrooms per household) got a very strong weight, although it is not a very good predictor of the price. The main reason is that the features are not normalized: AveBedrms varies between 0 an 135, while population varies between 0 and 35000. The weight on population does not need to be very high to influence the price prediction, although it might be very important.
A good practice in machine learning is to normalize the inputs, i.e. to make sure that the features have a mean of 0 and a standard deviation of 1. The formula is:
X^\text{normalized} = \dfrac{X - \mathbb{E}[X]}{\text{std}(X)}
i.e. you compute the mean and standard deviation of each column of X
and apply the formula on each column.
Q: Normalize the dataset. Make sure that the new mean and std is correct.
Tip: X.mean(axis=0)
and X.std(axis=0)
should be useful.
= (X - X.mean(axis=0))/X.std(axis=0)
X_normalized
print("Old mean:", X.mean(axis=0))
print("New mean:", X_normalized.mean(axis=0))
print("Old std:", X.std(axis=0))
print("New std:", X_normalized.std(axis=0))
Old mean: [ 3.87067100e+00 2.86394864e+01 5.42899974e+00 1.09667515e+00
1.42547674e+03 3.07065516e+00 3.56318614e+01 -1.19569704e+02]
New mean: [ 6.60969987e-17 5.50808322e-18 6.60969987e-17 -1.06030602e-16
-1.10161664e-17 3.44255201e-18 -1.07958431e-15 -8.52651283e-15]
Old std: [1.89977569e+00 1.25852527e+01 2.47411320e+00 4.73899376e-01
1.13243469e+03 1.03857980e+01 2.13590065e+00 2.00348319e+00]
New std: [1. 1. 1. 1. 1. 1. 1. 1.]
Q: Apply MLR again on X^\text{normalized}, print the mse and visualize the weights. What has changed?
# Linear regression
= LinearRegression()
reg
reg.fit(X_normalized, t)
# Prediction
= reg.predict(X_normalized)
y
# mse
= np.mean((t - y)**2)
mse print("mse:", mse)
=(8, 6))
plt.figure(figsize
plt.scatter(t, y)min(), t.max()], [t.min(), t.max()], c="red")
plt.plot([t."t")
plt.xlabel("y")
plt.ylabel(
=(12, 15))
plt.figure(figsizefor i in range(8):
4, 2 , i+1)
plt.subplot(
plt.scatter(X_normalized[:, i], t)
plt.scatter(X_normalized[:, i], y)
plt.title(dataset.feature_names[i])
plt.show()
=(12, 5))
plt.figure(figsizeabs(reg.coef_))
plt.plot(np."Feature")
plt.xlabel("Weight")
plt.ylabel( plt.show()
mse: 0.5243209861846072
A: The mse does not change (it may in more complex problems), but the weights are more interpretable. Now strong weights in absolute value mean that the features are very predictive of the price. In particular, the feature populations
Now is time to investigate regularization: 1. MLR with L2 regularization is called Ridge regression 2. MLR with L1 regularization is called Lasso regression
Fortunately, scikit-learn
provides these methods with a similar interface to LinearRegression
. The Ridge
and Lasso
objects take an additional argument alpha
which represents the regularization parameter:
= Ridge(alpha=1.0)
reg = Lasso(alpha=1.0) reg
from sklearn.linear_model import Ridge, Lasso
Q: Apply Ridge and Lasso regression on the scaled data, vary the regularization parameter to understand its function and comment on the results. In particular, vary the regularization parameter for LASSO and identify the features which are the most predictive of the price. Does it make sense?
= Ridge(alpha=10.0)
reg
reg.fit(X_normalized, t)= reg.predict(X_normalized)
y = np.mean((t - y)**2)
mse print(mse)
print(reg.coef_)
=(8, 6))
plt.figure(figsize
plt.scatter(t, y)min(), t.max()], [t.min(), t.max()], c="red")
plt.plot([t."t")
plt.xlabel("y")
plt.ylabel(
=(12, 15))
plt.figure(figsizefor i in range(8):
4, 2 , i+1)
plt.subplot(
plt.scatter(X_normalized[:, i], t)
plt.scatter(X_normalized[:, i], y)
plt.title(dataset.feature_names[i])
plt.show()
=(12, 5))
plt.figure(figsizeabs(reg.coef_))
plt.plot(np."Feature")
plt.xlabel("Weight")
plt.ylabel( plt.show()
0.5243267377872187
[ 0.8293461 0.11939823 -0.26422311 0.30398067 -0.00427544 -0.03936068
-0.8937389 -0.86433656]
= Lasso(alpha=0.1)
reg
reg.fit(X_normalized, t)= reg.predict(X_normalized)
y = np.mean((t - y)**2)
mse print(mse)
print(reg.coef_)
=(8, 6))
plt.figure(figsize
plt.scatter(t, y)min(), t.max()], [t.min(), t.max()], c="red")
plt.plot([t."t")
plt.xlabel("y")
plt.ylabel(
=(12, 15))
plt.figure(figsizefor i in range(8):
4, 2 , i+1)
plt.subplot(
plt.scatter(X_normalized[:, i], t)
plt.scatter(X_normalized[:, i], y)
plt.title(dataset.feature_names[i])
plt.show()
=(12, 5))
plt.figure(figsizeabs(reg.coef_))
plt.plot(np."Feature")
plt.xlabel("Weight")
plt.ylabel( plt.show()
0.6741415809786988
[ 0.70571337 0.10601099 -0. -0. -0. -0.
-0.01121267 -0. ]
A: Ridge regression does not have a big effect for this data. By increasing the regularization parameter in Lasso regression, the MSE worsens slightly, but the number of non-zero weights becomes very small. With alpha=1.0
, all weights are set to 0… With alpha=0.1
, there are only 3 non-zero parameters: the corresponding features are the most predictive of the prices. You can identify them with:
for i, w in enumerate(reg.coef_):
if w != 0.0:
print(dataset.feature_names[i], w)
MedInc 0.705713372959335
HouseAge 0.10601099216033393
Latitude -0.011212667061584703