Multi-layer Perceptron
Professur für Künstliche Intelligenz - Fakultät für Informatik
A Multi-Layer Perceptron (MLP) or feedforward neural network is composed of:
an input layer for the input vector \mathbf{x}
one or several hidden layers allowing to project non-linearly the input into a space of higher dimensions \mathbf{h}_1, \mathbf{h}_2, \mathbf{h}_3, \ldots.
an output layer for the output \mathbf{y}.
If there is a single hidden layer \mathbf{h}, it corresponds to the feature space.
Each layer takes inputs from the previous layer.
If the hidden layer is adequately chosen, the output neurons can learn to replicate the desired output \mathbf{t}.
\mathbf{h} = f(\textbf{net}_\mathbf{h}) = f(W^1 \, \mathbf{x} + \mathbf{b}^1) \mathbf{y} = f(\textbf{net}_\mathbf{y}) = f(W^2 \, \mathbf{h} + \mathbf{b}^2)
Fully-connected layers (FC) transform an input vector \mathbf{x} into a new vector \mathbf{h} by multiplying it by a weight matrix W and adding a bias vector \mathbf{b}.
A non-linear activation function transforms each element of the net activation.
f(x) = \max(0, x) = \begin{cases} x \quad \text{if} \quad x \geq 0 \\ 0 \quad \text{otherwise.} \end{cases}
f(x) = \begin{cases} x \quad \text{if} \quad x \geq 0 \\ \alpha \, x \quad \text{otherwise.}\end{cases}
y_j = P(\text{class = j}) = \frac{\exp(\text{net}_j)}{\sum_k \exp(\text{net}_k)}
The higher the net activation \text{net}_j, the higher the probability that the example belongs to class j.
Softmax is not per se a transfer function (not local to each neuron), but the idea is similar.
\mathbf{h} = W^1 \, \mathbf{x} + \mathbf{b}^1 \mathbf{y} = W^2 \, \mathbf{h} + \mathbf{b}^2
\begin{align*} \mathbf{y} &= W^2 \, (W^1 \, \mathbf{x} + \mathbf{b}^1) + \mathbf{b}^2 \\ &= (W^2 \, W^1) \, \mathbf{x} + (W^2 \, \mathbf{b}^1 + \mathbf{b}^2) \\ &= W \, \mathbf{x} + \mathbf{b} \\ \end{align*}
A complex pattern-classification problem, cast in a high dimensional space non-linearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.
Optimization problem
What are the free parameters \theta (weights W^1, W^2 and biases \textbf{b}^1, \textbf{b}^2) making the prediction \mathbf{y} as close as possible from the desired output \mathbf{t}?
We define a loss function \mathcal{L}(\theta) of the free parameters which should be minimized:
\mathcal{L}_\text{reg}(\theta) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \in \mathcal{D}} [ ||\mathbf{t} - \mathbf{y}||^2 ]
\mathcal{L}_\text{class}(\theta) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \sim \mathcal{D}} [ - \langle \mathbf{t} \cdot \log \mathbf{y} \rangle]
\begin{cases} \Delta W^1 = - \eta \, \nabla_{W^1} \, \mathcal{L}(\theta) \\ \\ \Delta \mathbf{b}^1 = - \eta \, \nabla_{\mathbf{b}^1} \, \mathcal{L}(\theta) \\ \\ \Delta W^2 = - \eta \, \nabla_{W^2} \, \mathcal{L}(\theta) \\ \\ \Delta \mathbf{b}^2 = - \eta \, \nabla_{\mathbf{b}^2} \, \mathcal{L}(\theta)\\ \end{cases}
We will see later that other optimizers than SGD can be used.
The question is now how to compute efficiently these gradients w.r.t all the weights and biases.
The algorithm to achieve this is called backpropagation, which is simply a smart implementation of the chain rule.
Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors”. Nature.
\mathbf{h} = f(\textbf{net}_\mathbf{h}) = f(W^1 \, \mathbf{x} + \mathbf{b}^1) \mathbf{y} = f(\textbf{net}_\mathbf{y}) = f(W^2 \, \mathbf{h} + \mathbf{b}^2)
\frac{\partial \mathcal{L}(\theta)}{\partial W^2} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial W^2}
\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{b}^2} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{b}^2}
\frac{\partial \mathcal{L}(\theta)}{\partial W^1} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial W^1}
\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{b}^1} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial \mathbf{b}^1}
We have already seen for the linear algorithms that the derivative of the loss function w.r.t the net activation of the output \textbf{net}_\mathbf{y} is proportional to the prediction error \mathbf{t} - \mathbf{y}:
\mathbf{\delta_y} = - \frac{\partial \mathcal{l}_\text{reg}(\theta)}{\partial \textbf{net}_\mathbf{y}} = - \frac{\partial \mathcal{l}_\text{reg}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} = 2 \, (\mathbf{t} - \mathbf{y}) \, f'(\textbf{net}_\mathbf{y})
\mathbf{\delta_y} = - \frac{\partial \mathcal{l}_\text{class}(\theta)}{\partial \textbf{net}_\mathbf{y}} = (\mathbf{t} - \mathbf{y})
\mathbf{\delta_y} = - \dfrac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} is called the output error.
The output error is going to appear in all partial derivatives, i.e. in all learning rules.
The backpropagation algorithm is sometimes called backpropagation of the error.
\frac{\partial \mathcal{l}(\theta)}{\partial W^2} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial W^2} = - \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial W^2}
\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^2} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{b}^2} = - \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{b}^2}
\frac{\partial \mathcal{l}(\theta)}{\partial W^2} = - \mathbf{\delta_y} \times \mathbf{h}^T \;\; \text{and} \;\; \frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^2} = - \mathbf{\delta_y}
i.e. exactly the same delta learning rule as a softmax linear classifier or multiple linear regression using the vector \mathbf{h} as an input.
\begin{cases} \Delta W^2 = \eta \, \mathbf{\delta_y} \times \mathbf{h}^T = \eta \, (\mathbf{t} - \mathbf{y} ) \times \mathbf{h}^T \\ \\ \Delta \mathbf{b}^2 = \eta \, \mathbf{\delta_y} = \eta \, (\mathbf{t} - \mathbf{y} ) \\ \end{cases}
\mathbf{\delta_h} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}} = \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}}
\frac{\partial \mathcal{l}(\theta)}{\partial W^1} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial W^1} = - \mathbf{\delta_h} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial W^1}
\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^1} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial \mathbf{b}^1} = - \mathbf{\delta_h} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial \mathbf{b}^1}
\frac{\partial \mathcal{l}(\theta)}{\partial W^1} = - \mathbf{\delta_h} \times \mathbf{x}^T
\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^1} = - \mathbf{\delta_h}
\frac{\partial \mathcal{l}(\theta)}{\partial W^1} = - \mathbf{\delta_h} \times \mathbf{x}^T
\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^1} = - \mathbf{\delta_h}
\begin{cases} \Delta W^1 = \eta \, \mathbf{\delta_h} \times \mathbf{x}^T \\ \\ \Delta \mathbf{b}^1 = \eta \, \mathbf{\delta_h} \\ \end{cases}
This is the classical form eta * error * input.
All we need to know is the backpropagated error \mathbf{\delta_h} and we can apply the delta learning rule!
\mathbf{\delta_h} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} = \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}}
\textbf{net}_\mathbf{y} = W^2 \, \mathbf{h} + \mathbf{b}^2
\mathbf{h} = f(\textbf{net}_\mathbf{h})
we obtain:
\mathbf{\delta_h} = f'(\textbf{net}_\mathbf{h}) \, (W^2)^T \times \mathbf{\delta_y}
If \mathbf{h} and \mathbf{\delta_h} have K elements and \mathbf{y} and \mathbf{\delta_y} have C elements, the matrix W^2 is C \times K as W^2 \times \mathbf{h} must be a vector with C elements.
(W^2)^T \times \mathbf{\delta_y} is therefore a vector with K elements, which is then multiplied element-wise with the derivative of the transfer function to obtain \mathbf{\delta_h}.
\mathbf{h} = f(\textbf{net}_\mathbf{h}) = f(W^1 \, \mathbf{x} + \mathbf{b}^1) \mathbf{y} = f(\textbf{net}_\mathbf{y}) = f(W^2 \, \mathbf{h} + \mathbf{b}^2)
the output error:
\mathbf{\delta_y} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} = (\mathbf{t} - \mathbf{y})
is backpropagated to the hidden layer:
\mathbf{\delta_h} = f'(\textbf{net}_\mathbf{h}) \, (W^2)^T \times \mathbf{\delta_y}
what allows to apply the delta learning rule to all parameters:
\begin{cases} \Delta W^2 = \eta \, \mathbf{\delta_y} \times \mathbf{h}^T \\ \Delta \mathbf{b}^2 = \eta \, \mathbf{\delta_y} \\ \Delta W^1 = \eta \, \mathbf{\delta_h} \times \mathbf{x}^T \\ \Delta \mathbf{b}^1 = \eta \, \mathbf{\delta_h} \\ \end{cases}
f(x) = \begin{cases} 1 \quad \text{if} \quad x \geq 0 \\ 0 \text{ or } 1 \quad \text{otherwise.} \end{cases} \qquad \rightarrow \qquad f'(x) = 1
f(x) = \frac{1}{1+\exp(-x)} \qquad \rightarrow \qquad f'(x) = f(x) \, (1 - f(x))
f(x) = \tanh(x) \qquad \rightarrow \qquad f'(x) = 1 - f(x)^2
f(x) = \max(0, x) = \begin{cases} x \quad \text{if} \quad x \geq 0 \\ 0 \quad \text{otherwise.} \end{cases} \qquad \rightarrow \qquad f'(x) = \begin{cases} 1 \quad \text{if} \quad x \geq 0 \\ 0 \quad \text{otherwise.}\end{cases}
Let’s have a closer look at what is backpropagated using single neurons and weights.
The output neuron y_k computes:
y_k = f(\sum_{j=1}^K W^2_{jk} \, h_j + b^2_k)
\Delta W^2_{jk} = \eta \, \delta_{{y}_k} \, h_j = \eta \, (t_k - y_k) \, h_j
h_j = f(\sum_{i=1}^d W^1_{ij} \, x_i + b^1_j)
\Delta W^1_{ij} = \eta \, \delta_{{h}_j} \, x_i
but we do not know the ground truth of the hidden neuron in the data:
\delta_{{h}_j} = (? - h_j)
\mathbf{\delta_h} = f'(\textbf{net}_\mathbf{h}) \, (W^2)^T \times \mathbf{\delta_y}
\delta_{{h}_j} = - \sum_{k=1}^C W^2_{jk} \, \delta_{{y}_k}
The backpropagated error is the contribution of each hidden neuron h_j to the output error:
If there is no output error, there is no hidden error.
If a hidden neuron sends strong weights |W^2_{jk}| to an output neuron y_k with a strong prediction error \delta_{{y}_k}, this means that it participates strongly to the output error and should learn from it.
If the weight |W^2_{jk}| is small, it means that the hidden neuron does not take part in the output error.
Universal approximation theorem
Cybenko, 1989
Let \varphi() be a nonconstant, bounded, and monotonically-increasing continuous function. Let I_{m_0} denote the m_0-dimensional unit hypercube [0,1]^{m_0}. The space of continuous functions on I_{m_0} is denoted by C(I_{m_0}). Then, given any function f \in C(I_{m_0}) and \epsilon > 0, there exists an integer m_1 and sets of real constants \alpha_i, b_i and w_{ij} \in \Re, where i = 1, ..., m_1 and j = 1, ..., m_0 such that we may define:
F(\mathbf{x}) = \sum_{i=1}^{m_1} \alpha_i \cdot \varphi \left( \sum_{j=1}^{m_0} w_{ij} \cdot x_j + b_i \right)
as an approximate realization of the function f; that is,
| F(\mathbf{x}) - f(\mathbf{x})| < \epsilon
for all x \in I_m.
The universal approximation theorem only proves the existence of a shallow MLP with m_1 neurons in the hidden layer that can approximate any function, but it does not tell how to find this number.
A rule of thumb to find this number is that the generalization error is empirically close to:
\epsilon = \frac{\text{VC}_{\text{dim}}(\text{MLP})}{N}
where \text{VC}_{\text{dim}}(\text{MLP}) is the total number of weights and biases in the model, and N the number of training samples.
The more neurons in the hidden layer, the better the training error, but the worse the generalization error (overfitting).
The optimal number should be found with cross-validation methods.
For most functions, the optimal number m_1 is high and becomes quickly computationally untractable. We need to go deep!
Source: David Silver, Tutorial: Deep Reinforcement Learning
\mathbf{h}_{k} = f(\textbf{net}_{\mathbf{h}^k}) = f(W^k \, \mathbf{h}_{k-1} + \mathbf{b}^k)
\frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}} = f'(\textbf{net}_{\mathbf{h}^k}) \, W^k
\frac{\partial \mathbf{h}_{k}}{\partial W^{k}} = f'(\textbf{net}_{\mathbf{h}^k}) \, \mathbf{h}_{k-1}
\frac{\partial \mathbf{h}_{k}}{\partial \mathbf{b}_{k}} = f'(\textbf{net}_{\mathbf{h}^k})
\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}}
\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k-1}} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}} = f'(\textbf{net}_{\mathbf{h}^k}) \, (W^k)^T \times \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}}
\begin{cases} \dfrac{\partial \mathcal{L}(\theta)}{\partial W^{k}} = \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \dfrac{\partial \mathbf{h}_{k}}{\partial W^{k}} = f'(\textbf{net}_{\mathbf{h}^k}) \, \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \mathbf{h}_{k-1}^T \\ \\ \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{b}_{k}} = \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \dfrac{\partial \mathbf{h}_{k}}{\partial \mathbf{b}_{k}} = f'(\textbf{net}_{\mathbf{h}^k}) \, \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \\ \end{cases}
\mathbf{x} \rightarrow \mathbf{h}_1 \rightarrow \mathbf{h}_2 \rightarrow \ldots \rightarrow \mathbf{h}_n \rightarrow \mathbf{y}
\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k-1}} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}}
\Delta W^k = - \eta \, \frac{\partial \mathcal{L}(\theta)}{\partial W^{k}} = - \eta \, \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \frac{\partial \mathbf{h}_{k}}{\partial W^{k}}
Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors”. Nature.
We can create a shallow MLP with:
Two input neurons x_1, x_2 for the two input variables.
Enough hidden neurons (e.g. 20), with a sigmoid or ReLU activation function.
One output neuron with the logistic activation function.
The cross-entropy (negative log-likelihood) loss function.
We train it on the input data using the backpropagation algorithm and the SGD optimizer.
Current:
Tensorflow https://www.tensorflow.org/ released by Google in 2015 is one of the two standard DL frameworks.
Keras https://keras.io/ is a high-level Python API over tensorflow (but also theano, CNTK and MxNet) written by Francois Chollet.
PyTorch http://pytorch.org by Facebook is the other standard framework.
Historical:
Theano http://deeplearning.net/software/theano/ released by U Toronto in 2010 is the predecessor of tensorflow. Now abandoned.
Caffe http://caffe.berkeleyvision.org/ by U Berkeley was long the standard library for convolutional networks.
CNTK https://github.com/Microsoft/CNTK (Microsoft Cognitive Toolkit) is a free library by Microsoft!
MxNet https://github.com/apache/incubator-mxnet from Apache became the DL framework at Amazon.
Let’s implement the previous MLP using keras.
We first need to generate the data using scikit-learn
:
import sklearn.datasets
X, t = sklearn.datasets.make_circles(n_samples=100, shuffle=True, noise=0.15, factor=0.3)
tensorflow
:Sequential
model in keras:model.add(
tf.keras.layers.Dense(
20, # Number of hidden neurons
activation='relu' # Activation function
)
)
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 20) 60
_________________________________________________________________
dense_1 (Dense) (None, 1) 21
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0
_________________________________________________________________
None
With keras (and the other automatic differentiation frameworks), you only need to define the structure of the network.
The rest (backpropagation, SGD) is done automatically.
To make predictions on new data, just do: