10.1. Convolutional neural networks

The goal of this exercise is to train a convolutional neural network on MNIST and better understand what is happening during training.

10.1.1. Training a CNN on MNIST

import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

print(tf.__version__)

Tip: CNNs are much slower to train on CPU than the DNN of the last exercise. It is feasible to do this exercise on a normal computer, but if you have a Google account, we suggest to use colab to run this notebook on a GPU for free (training time should be divided by a factor 5 or so).

Go then in the menu, “Runtime” and “Change Runtime type”. You can then change the “Hardware accelerator” to GPU. Do not choose TPU, it will be as slow as CPU for the small networks we are using.

We import and normalize the MNIST data like last time, except we do not reshape the images: they stay with the shape (28, 28, 1):

# Fetch the MNIST data
(X_train, t_train), (X_test, t_test) = tf.keras.datasets.mnist.load_data()
print("Training data:", X_train.shape, t_train.shape)
print("Test data:", X_test.shape, t_test.shape)

# Normalize the values
X_train = X_train.reshape(-1, 28, 28, 1).astype('float32') / 255.
X_test = X_test.reshape(-1, 28, 28, 1).astype('float32') / 255.

# Mean removal
X_mean = np.mean(X_train, axis=0)
X_train -= X_mean
X_test -= X_mean

# One-hot encoding
T_train = tf.keras.utils.to_categorical(t_train, 10)
T_test = tf.keras.utils.to_categorical(t_test, 10)

We can now define the CNN defined in the first image:

  • a convolutional layer with 16 3x3 filters, using valid padding and ReLU transfer functions,

  • a max-pooling layer over 2x2 regions,

  • a fully-connected layer with 100 ReLU neurons,

  • a softmax layer with 10 neurons.

The CNN will be trained on MNIST using SGD with momentum.

The following code defines this basic network in keras:

# Delete all previous models to free memory
tf.keras.backend.clear_session()

# Sequential model
model = tf.keras.models.Sequential()

# Input layer representing the (28, 28) image
model.add(tf.keras.layers.Input(shape=(28, 28, 1)))

# Convolutional layer with 16 feature maps using 3x3 filters
model.add(tf.keras.layers.Conv2D(16, (3, 3), padding='valid'))
model.add(tf.keras.layers.Activation('relu')) 

# Max-pooling layerover 2x2 regions
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))

# Flatten the feature maps into a vector
model.add(tf.keras.layers.Flatten())

# Fully-connected layer
model.add(tf.keras.layers.Dense(units=100))
model.add(tf.keras.layers.Activation('relu')) 

# Softmax output layer over 10 classes
model.add(tf.keras.layers.Dense(units=10))
model.add(tf.keras.layers.Activation('softmax')) 

# Learning rule
optimizer = tf.keras.optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)

# Loss function
model.compile(
    loss='categorical_crossentropy', # loss function
    optimizer=optimizer, # learning rule
    metrics=['accuracy'] # show accuracy
)

print(model.summary())

Note the use of Flatten() to transform the 13x13x16 tensor representing the max-pooling layer into a vector of 2704 elements.

Note also the use of padding='valid' and its effect on the size of the tensor corresponding to the convolutional layer. Change it to padding='same' and conclude on its effect.

Q: Which layer has the most parameters? Why? Compare with the fully-connected MLPs you obtained during exercise 5.

Let’s now train this network on MNIST for 10 epochs, using minibatches of 64 images:

# History tracks the evolution of the metrics during learning
history = tf.keras.callbacks.History()

# Training procedure
model.fit(
    X_train, T_train, # training data
    batch_size=64,  # batch size
    epochs=10, # Maximum number of epochs
    validation_split=0.1, # Perceptage of training data used for validation
    callbacks=[history] # Track the metrics at the end of each epoch
)

As in the previous exercise, the next cells compute the test loss and accuracy and display the evolution of the training and validation accuracies:

score = model.evaluate(X_test, T_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
plt.figure(figsize=(15, 6))

plt.subplot(121)
plt.plot(history.history['loss'], '-r', label="Training")
plt.plot(history.history['val_loss'], '-b', label="Validation")
plt.xlabel('Epoch #')
plt.ylabel('Loss')
plt.legend()

plt.subplot(122)
plt.plot(history.history['accuracy'], '-r', label="Training")
plt.plot(history.history['val_accuracy'], '-b', label="Validation")
plt.xlabel('Epoch #')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

Q: What do you think of 1) the final accuracy and 2) the training time, compared to the MLP of last time?

Q: When does your network start to overfit? How to recognize it?

Q: Try different values for the batch size (16, 32, 64, 128..). What is its influence?

Q: Improve the CNN to avoid overfitting. The test accuracy should be around 99%.

You can:

  • change the learning rate

  • add another block on convolution + max-pooling before the fully-connected layer to reduce the number of parameters,

  • add dropout after some of the layers,

  • use L2 regularization,

  • use a different optimizer,

  • do whatever you want.

Beware: training is now relatively slow, keep your number of tries limited. Once you find a good architecture that does not overfit, train it for 20 epochs and proceed to the next questions.

10.1.2. Analysing the CNN

Once a network has been trained, let’s see what has happened internally.

10.1.2.1. Accessing trained weights

Each layer of the network can be addressed individually. For example, model.layers[0] represents the first layer of your network (the first convolutional one, as the input layer does not count). The index of the other layers can be found by looking at the output of model.summary().

You can obtain the parameters of each layer (if any) with:

W = model.layers[0].get_weights()[0]

Q: Print the shape of these weights and relate them to the network.

Q: Visualize with imshow() each of the 16 filters of the first convolutional layer. Interpret what kind of operation they perform on the image.

Hint: subplot() is going to be useful here. If you have 16 images img[i], you can visualize them in a 4x4 grid with:

for i in range(16):
    plt.subplot(4, 4, i+1)
    plt.imshow(img[i], cmap=plt.cm.gray)

10.1.2.2. Visualizing the feature maps

Let’s take a random image from the training set and visualize it:

idx = 31727 # or any other digit
x = X_train[idx, :, :, :].reshape(1, 28, 28, 1)
t = t_train[idx]

print(t)

plt.figure(figsize=(6, 6))
plt.imshow(x[0, :, :, 0] + X_mean[:, :, 0], cmap=plt.cm.gray)
plt.colorbar()
plt.show()

This example could be a 1 or 7. That is why you will never get 100% accuracy on MNIST: some examples are hard even for humans…

Q: Print what the model predict for it, its true label, and visualize the probabilities in the softmax output layer (look at the doc of model.predict()):

Depending on how your network converged, you may have the correct prediction or not.

Q: Visualize the output of the network for different examples. Do these ambiguities happen often?

Now let’s look inside the network. We will first visualize the 16 feature maps of the first convolutional layer.

This is actually very simple using tensorflow 2.x: One only needs to create a new model (class tf.keras.models.Model, not Sequential) taking the same inputs as the original model, but returning the output of the first layer (model.layers[0] is the first convolutional layer of the model, as the input layer does not count):

model_conv = tf.keras.models.Model(inputs=model.inputs, outputs=model.layers[0].output)

To get the tensor corresponding to the first convolutional layer, one simply needs to call predict() on the new model:

feature_maps = model_conv.predict([x])

Q: Visualize the 16 feature maps using subplot(). Relate these activation with the filters you have visualized previously.

Q: Do the same with the output of the first max-pooling layer.

Hint: you need to find the index of that layer in model.summary().

Bonus question: if you had several convolutional layers in your network, visualize them too. What do you think of the specificity of some features?