Neurocomputing

Multi-layer Perceptron

Julien Vitay

Professur für Künstliche Intelligenz - Fakultät für Informatik

1 - Multi-layer perceptron

Multi-layer perceptron

  • A Multi-Layer Perceptron (MLP) or feedforward neural network is composed of:

    • an input layer for the input vector \mathbf{x}

    • one or several hidden layers allowing to project non-linearly the input into a space of higher dimensions \mathbf{h}_1, \mathbf{h}_2, \mathbf{h}_3, \ldots.

    • an output layer for the output \mathbf{y}.

  • If there is a single hidden layer \mathbf{h}, it corresponds to the feature space.

  • Each layer takes inputs from the previous layer.

  • If the hidden layer is adequately chosen, the output neurons can learn to replicate the desired output \mathbf{t}.

Fully-connected layer

  • The operation performed by each layer can be written in the form of a matrix-vector multiplication:

\mathbf{h} = f(\textbf{net}_\mathbf{h}) = f(W^1 \, \mathbf{x} + \mathbf{b}^1) \mathbf{y} = f(\textbf{net}_\mathbf{y}) = f(W^2 \, \mathbf{h} + \mathbf{b}^2)

  • Fully-connected layers (FC) transform an input vector \mathbf{x} into a new vector \mathbf{h} by multiplying it by a weight matrix W and adding a bias vector \mathbf{b}.

  • A non-linear activation function transforms each element of the net activation.

Activation functions

Modern activation functions

  • Rectified linear function - ReLU (output is continuous and positive).

f(x) = \max(0, x) = \begin{cases} x \quad \text{if} \quad x \geq 0 \\ 0 \quad \text{otherwise.} \end{cases}

  • Parametric Rectifier Linear Unit - PReLU (output is continuous).

f(x) = \begin{cases} x \quad \text{if} \quad x \geq 0 \\ \alpha \, x \quad \text{otherwise.}\end{cases}

Softmax activation function

Source http://cs231n.github.io/linear-classify

  • For classification problems, the softmax activation function can be used in the output layer to make sure that the sum of the outputs \mathbf{y} = \{y_j\} over all output neurons is one.

y_j = P(\text{class = j}) = \frac{\exp(\text{net}_j)}{\sum_k \exp(\text{net}_k)}

  • The higher the net activation \text{net}_j, the higher the probability that the example belongs to class j.

  • Softmax is not per se a transfer function (not local to each neuron), but the idea is similar.

Why non-linear activation functions?

  • Why not use the linear function f(x) = x in the hidden layer?

\mathbf{h} = W^1 \, \mathbf{x} + \mathbf{b}^1 \mathbf{y} = W^2 \, \mathbf{h} + \mathbf{b}^2

  • The equivalent function would be linear…

\begin{align*} \mathbf{y} &= W^2 \, (W^1 \, \mathbf{x} + \mathbf{b}^1) + \mathbf{b}^2 \\ &= (W^2 \, W^1) \, \mathbf{x} + (W^2 \, \mathbf{b}^1 + \mathbf{b}^2) \\ &= W \, \mathbf{x} + \mathbf{b} \\ \end{align*}

  • Remember Cover’s theorem:

A complex pattern-classification problem, cast in a high dimensional space non-linearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.

  • In practice it does not matter how non-linear the function is (e.g PReLU is almost linear), but there must be at least one non-linearity.

Training a MLP : loss functions

  • We have a training set composed of N input/output pairs (\mathbf{x}_i, \mathbf{t}_i)_{i=1..N}.

Optimization problem

What are the free parameters \theta (weights W^1, W^2 and biases \textbf{b}^1, \textbf{b}^2) making the prediction \mathbf{y} as close as possible from the desired output \mathbf{t}?

  • We define a loss function \mathcal{L}(\theta) of the free parameters which should be minimized:

    • For regression problems, we take the mean square error (mse):

    \mathcal{L}_\text{reg}(\theta) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \in \mathcal{D}} [ ||\mathbf{t} - \mathbf{y}||^2 ]

    • For classification problems, we take the cross-entropy or negative log-likelihood on a softmax output layer:

    \mathcal{L}_\text{class}(\theta) = \mathbb{E}_{\mathbf{x}, \mathbf{t} \sim \mathcal{D}} [ - \langle \mathbf{t} \cdot \log \mathbf{y} \rangle]

Training a MLP : optimizer

  • To minimize the chosen loss function, we are going to use stochastic gradient descent iteratively until the network converges:

\begin{cases} \Delta W^1 = - \eta \, \nabla_{W^1} \, \mathcal{L}(\theta) \\ \\ \Delta \mathbf{b}^1 = - \eta \, \nabla_{\mathbf{b}^1} \, \mathcal{L}(\theta) \\ \\ \Delta W^2 = - \eta \, \nabla_{W^2} \, \mathcal{L}(\theta) \\ \\ \Delta \mathbf{b}^2 = - \eta \, \nabla_{\mathbf{b}^2} \, \mathcal{L}(\theta)\\ \end{cases}

  • We will see later that other optimizers than SGD can be used.

  • The question is now how to compute efficiently these gradients w.r.t all the weights and biases.

  • The algorithm to achieve this is called backpropagation, which is simply a smart implementation of the chain rule.

2 - Backpropagation

Backpropagation on a shallow network

\mathbf{h} = f(\textbf{net}_\mathbf{h}) = f(W^1 \, \mathbf{x} + \mathbf{b}^1) \mathbf{y} = f(\textbf{net}_\mathbf{y}) = f(W^2 \, \mathbf{h} + \mathbf{b}^2)

  • The chain rule gives us for the parameters of the output layer:

\frac{\partial \mathcal{L}(\theta)}{\partial W^2} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial W^2}

\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{b}^2} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{b}^2}

  • and for the hidden layer:

\frac{\partial \mathcal{L}(\theta)}{\partial W^1} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial W^1}

\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{b}^1} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial \mathbf{b}^1}

  • If we can compute all these partial derivatives / gradients individually, the problem is solved.

Gradient of the loss function

  • We have already seen for the linear algorithms that the derivative of the loss function w.r.t the net activation of the output \textbf{net}_\mathbf{y} is proportional to the prediction error \mathbf{t} - \mathbf{y}:

    • mse for regression:

    \mathbf{\delta_y} = - \frac{\partial \mathcal{l}_\text{reg}(\theta)}{\partial \textbf{net}_\mathbf{y}} = - \frac{\partial \mathcal{l}_\text{reg}(\theta)}{\partial \mathbf{y}} \times \frac{\partial \mathbf{y}}{\partial \textbf{net}_\mathbf{y}} = 2 \, (\mathbf{t} - \mathbf{y}) \, f'(\textbf{net}_\mathbf{y})

    • cross-entropy using a softmax output layer:

    \mathbf{\delta_y} = - \frac{\partial \mathcal{l}_\text{class}(\theta)}{\partial \textbf{net}_\mathbf{y}} = (\mathbf{t} - \mathbf{y})

  • \mathbf{\delta_y} = - \dfrac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} is called the output error.

  • The output error is going to appear in all partial derivatives, i.e. in all learning rules.

  • The backpropagation algorithm is sometimes called backpropagation of the error.

Gradient in the output layer

  • We now have everything we need to train the output layer:

\frac{\partial \mathcal{l}(\theta)}{\partial W^2} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial W^2} = - \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial W^2}

\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^2} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{b}^2} = - \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{b}^2}

  • As \textbf{net}_\mathbf{y} = W^2 \, \mathbf{h} + \mathbf{b}^2, we get for the cross-entropy loss:

\frac{\partial \mathcal{l}(\theta)}{\partial W^2} = - \mathbf{\delta_y} \times \mathbf{h}^T \;\; \text{and} \;\; \frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^2} = - \mathbf{\delta_y}

i.e. exactly the same delta learning rule as a softmax linear classifier or multiple linear regression using the vector \mathbf{h} as an input.

\begin{cases} \Delta W^2 = \eta \, \mathbf{\delta_y} \times \mathbf{h}^T = \eta \, (\mathbf{t} - \mathbf{y} ) \times \mathbf{h}^T \\ \\ \Delta \mathbf{b}^2 = \eta \, \mathbf{\delta_y} = \eta \, (\mathbf{t} - \mathbf{y} ) \\ \end{cases}

Gradient in the hidden layer

  • Let’s note \mathbf{\delta_h} the hidden error, i.e. minus the gradient of the loss function w.r.t the net activation of the hidden layer:

\mathbf{\delta_h} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}} = \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}}

  • Using this hidden error, we can compute the gradients w.r.t W^1 and \mathbf{b}^1:

\frac{\partial \mathcal{l}(\theta)}{\partial W^1} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial W^1} = - \mathbf{\delta_h} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial W^1}

\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^1} = \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial \mathbf{b}^1} = - \mathbf{\delta_h} \times \frac{\partial \textbf{net}_\mathbf{h}}{\partial \mathbf{b}^1}

  • As \textbf{net}_\mathbf{h} = W^1 \, \mathbf{x} + \mathbf{b}^1, we get:

\frac{\partial \mathcal{l}(\theta)}{\partial W^1} = - \mathbf{\delta_h} \times \mathbf{x}^T

\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^1} = - \mathbf{\delta_h}

Gradient in the hidden layer

\frac{\partial \mathcal{l}(\theta)}{\partial W^1} = - \mathbf{\delta_h} \times \mathbf{x}^T

\frac{\partial \mathcal{l}(\theta)}{\partial \mathbf{b}^1} = - \mathbf{\delta_h}

  • If we know the hidden error \mathbf{\delta_h}, the update rules for the input weights W^1 and \mathbf{b}^1 also take the form of the delta learning rule:

\begin{cases} \Delta W^1 = \eta \, \mathbf{\delta_h} \times \mathbf{x}^T \\ \\ \Delta \mathbf{b}^1 = \eta \, \mathbf{\delta_h} \\ \end{cases}

  • This is the classical form eta * error * input.

  • All we need to know is the backpropagated error \mathbf{\delta_h} and we can apply the delta learning rule!

Backpropagated error

  • The backpropagated error \mathbf{\delta_h} is a vector assigning an error to each of the hidden neurons:

\mathbf{\delta_h} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{h}} = \mathbf{\delta_y} \times \frac{\partial \textbf{net}_\mathbf{y}}{\partial \mathbf{h}} \times \frac{\partial \mathbf{h}}{\partial \textbf{net}_\mathbf{h}}

  • As :

\textbf{net}_\mathbf{y} = W^2 \, \mathbf{h} + \mathbf{b}^2

\mathbf{h} = f(\textbf{net}_\mathbf{h})

we obtain:

\mathbf{\delta_h} = f'(\textbf{net}_\mathbf{h}) \, (W^2)^T \times \mathbf{\delta_y}

  • If \mathbf{h} and \mathbf{\delta_h} have K elements and \mathbf{y} and \mathbf{\delta_y} have C elements, the matrix W^2 is C \times K as W^2 \times \mathbf{h} must be a vector with C elements.

  • (W^2)^T \times \mathbf{\delta_y} is therefore a vector with K elements, which is then multiplied element-wise with the derivative of the transfer function to obtain \mathbf{\delta_h}.

Backpropagation for a shallow MLP

  • For a shallow MLP with one hidden layer:

\mathbf{h} = f(\textbf{net}_\mathbf{h}) = f(W^1 \, \mathbf{x} + \mathbf{b}^1) \mathbf{y} = f(\textbf{net}_\mathbf{y}) = f(W^2 \, \mathbf{h} + \mathbf{b}^2)

the output error:

\mathbf{\delta_y} = - \frac{\partial \mathcal{l}(\theta)}{\partial \textbf{net}_\mathbf{y}} = (\mathbf{t} - \mathbf{y})

is backpropagated to the hidden layer:

\mathbf{\delta_h} = f'(\textbf{net}_\mathbf{h}) \, (W^2)^T \times \mathbf{\delta_y}

what allows to apply the delta learning rule to all parameters:

\begin{cases} \Delta W^2 = \eta \, \mathbf{\delta_y} \times \mathbf{h}^T \\ \Delta \mathbf{b}^2 = \eta \, \mathbf{\delta_y} \\ \Delta W^1 = \eta \, \mathbf{\delta_h} \times \mathbf{x}^T \\ \Delta \mathbf{b}^1 = \eta \, \mathbf{\delta_h} \\ \end{cases}

Derivative of the activation functions

  • Threshold and sign functions are not differentiable, we simply consider the derivative is 1.

f(x) = \begin{cases} 1 \quad \text{if} \quad x \geq 0 \\ 0 \text{ or } 1 \quad \text{otherwise.} \end{cases} \qquad \rightarrow \qquad f'(x) = 1

  • The logistic or sigmoid function has the nice property that its derivative can be expressed as a function of itself:

f(x) = \frac{1}{1+\exp(-x)} \qquad \rightarrow \qquad f'(x) = f(x) \, (1 - f(x))

  • The hyperbolic tangent function too:

f(x) = \tanh(x) \qquad \rightarrow \qquad f'(x) = 1 - f(x)^2

  • ReLU is even simpler:

f(x) = \max(0, x) = \begin{cases} x \quad \text{if} \quad x \geq 0 \\ 0 \quad \text{otherwise.} \end{cases} \qquad \rightarrow \qquad f'(x) = \begin{cases} 1 \quad \text{if} \quad x \geq 0 \\ 0 \quad \text{otherwise.}\end{cases}

What is backpropagated?

  • Let’s have a closer look at what is backpropagated using single neurons and weights.

  • The output neuron y_k computes:

y_k = f(\sum_{j=1}^K W^2_{jk} \, h_j + b^2_k)

  • All output weights W^2_{jk} are updated proportionally to the output error of the neuron y_k:

\Delta W^2_{jk} = \eta \, \delta_{{y}_k} \, h_j = \eta \, (t_k - y_k) \, h_j

  • This is possible because we know the output error directly from the data t_k.

What is backpropagated?

  • The hidden neuron h_j computes:

h_j = f(\sum_{i=1}^d W^1_{ij} \, x_i + b^1_j)

  • We want to learn the hidden weights W^1_{ij} using the delta learning rule:

\Delta W^1_{ij} = \eta \, \delta_{{h}_j} \, x_i

but we do not know the ground truth of the hidden neuron in the data:

\delta_{{h}_j} = (? - h_j)

  • We need to estimate the backpropagated error using the output error.

What is backpropagated?

\mathbf{\delta_h} = f'(\textbf{net}_\mathbf{h}) \, (W^2)^T \times \mathbf{\delta_y}

  • If we omit the derivative of the transfer function, the backpropagated error for the hidden neuron h_j is:

\delta_{{h}_j} = - \sum_{k=1}^C W^2_{jk} \, \delta_{{y}_k}

  • The backpropagated error is an average of the output errors \delta_{{y}_k}, weighted by the output weights between the hidden neuron h_j and the output neurons y_k.
  • The backpropagated error is the contribution of each hidden neuron h_j to the output error:

    • If there is no output error, there is no hidden error.

    • If a hidden neuron sends strong weights |W^2_{jk}| to an output neuron y_k with a strong prediction error \delta_{{y}_k}, this means that it participates strongly to the output error and should learn from it.

    • If the weight |W^2_{jk}| is small, it means that the hidden neuron does not take part in the output error.

MLP: the universal approximation theorem

Universal approximation theorem

Cybenko, 1989

Let \varphi() be a nonconstant, bounded, and monotonically-increasing continuous function. Let I_{m_0} denote the m_0-dimensional unit hypercube [0,1]^{m_0}. The space of continuous functions on I_{m_0} is denoted by C(I_{m_0}). Then, given any function f \in C(I_{m_0}) and \epsilon > 0, there exists an integer m_1 and sets of real constants \alpha_i, b_i and w_{ij} \in \Re, where i = 1, ..., m_1 and j = 1, ..., m_0 such that we may define:

F(\mathbf{x}) = \sum_{i=1}^{m_1} \alpha_i \cdot \varphi \left( \sum_{j=1}^{m_0} w_{ij} \cdot x_j + b_i \right)

as an approximate realization of the function f; that is,

| F(\mathbf{x}) - f(\mathbf{x})| < \epsilon

for all x \in I_m.

  • This theorem shows that for any input/output mapping function f in supervised learning, there exists a MLP with m_1 neurons in the hidden layer which is able to approximate it with a desired precision!

Properties of MLP

  • The universal approximation theorem only proves the existence of a shallow MLP with m_1 neurons in the hidden layer that can approximate any function, but it does not tell how to find this number.

  • A rule of thumb to find this number is that the generalization error is empirically close to:

\epsilon = \frac{\text{VC}_{\text{dim}}(\text{MLP})}{N}

where \text{VC}_{\text{dim}}(\text{MLP}) is the total number of weights and biases in the model, and N the number of training samples.

  • The more neurons in the hidden layer, the better the training error, but the worse the generalization error (overfitting).

  • The optimal number should be found with cross-validation methods.

  • For most functions, the optimal number m_1 is high and becomes quickly computationally untractable. We need to go deep!

3 - Deep neural networks

Deep Neural Network

  • A MLP with more than one hidden layer is a deep neural network.

Backpropagation for deep neural networks

  • Backpropagation still works if we have many hidden layers \mathbf{h}_1, \ldots, \mathbf{h}_n:

  • If each layer is differentiable, i.e. one can compute its gradient \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}}, we can chain backwards each partial derivatives to know how to update each layer:

  • Backpropagation is simply an efficient implementation of the chain rule: the partial derivatives are iteratively reused in the backwards phase.

Gradient of a fully connected layer

  • A fully connected layer transforms an input vector \mathbf{h}_{k-1} into an output vector \mathbf{h}_{k} using a weight matrix W^k, a bias vector \mathbf{b}^k and a non-linear activation function f:

\mathbf{h}_{k} = f(\textbf{net}_{\mathbf{h}^k}) = f(W^k \, \mathbf{h}_{k-1} + \mathbf{b}^k)

  • The gradient of its output w.r.t the input \mathbf{h}_{k-1} is (using the chain rule):

\frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}} = f'(\textbf{net}_{\mathbf{h}^k}) \, W^k

  • The gradients of its output w.r.t the free parameters W^k and \mathbf{b}_{k} are:

\frac{\partial \mathbf{h}_{k}}{\partial W^{k}} = f'(\textbf{net}_{\mathbf{h}^k}) \, \mathbf{h}_{k-1}

\frac{\partial \mathbf{h}_{k}}{\partial \mathbf{b}_{k}} = f'(\textbf{net}_{\mathbf{h}^k})

Gradient of a fully connected layer

  • A fully connected layer \mathbf{h}_{k} = f(W^k \, \mathbf{h}_{k-1} + \mathbf{b}^k) receives the gradient of the loss function w.r.t. its output \mathbf{h}_{k} from the layer above:

\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}}

  • It adds to this gradient its own contribution and transmits it to the previous layer:

\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k-1}} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}} = f'(\textbf{net}_{\mathbf{h}^k}) \, (W^k)^T \times \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}}

  • It then updates its parameters W^k and \mathbf{b}_{k} with:

\begin{cases} \dfrac{\partial \mathcal{L}(\theta)}{\partial W^{k}} = \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \dfrac{\partial \mathbf{h}_{k}}{\partial W^{k}} = f'(\textbf{net}_{\mathbf{h}^k}) \, \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \mathbf{h}_{k-1}^T \\ \\ \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{b}_{k}} = \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \dfrac{\partial \mathbf{h}_{k}}{\partial \mathbf{b}_{k}} = f'(\textbf{net}_{\mathbf{h}^k}) \, \dfrac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \\ \end{cases}

Training a deep neural network with backpropagation

  • A feedforward neural network is an acyclic graph of differentiable and parameterized layers.

\mathbf{x} \rightarrow \mathbf{h}_1 \rightarrow \mathbf{h}_2 \rightarrow \ldots \rightarrow \mathbf{h}_n \rightarrow \mathbf{y}

  • The backpropagation algorithm is used to assign the gradient of the loss function \mathcal{L}(\theta) to each layer using backward chaining:

\frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k-1}} = \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}}

  • Stochastic gradient descent is then used to update the parameters of each layer:

\Delta W^k = - \eta \, \frac{\partial \mathcal{L}(\theta)}{\partial W^{k}} = - \eta \, \frac{\partial \mathcal{L}(\theta)}{\partial \mathbf{h}_{k}} \times \frac{\partial \mathbf{h}_{k}}{\partial W^{k}}

4 - Example

MLP example

  • Let’s try to solve this non-linear binary classification problem:

MLP example

  • We can create a shallow MLP with:

    • Two input neurons x_1, x_2 for the two input variables.

    • Enough hidden neurons (e.g. 20), with a sigmoid or ReLU activation function.

    • One output neuron with the logistic activation function.

    • The cross-entropy (negative log-likelihood) loss function.

  • We train it on the input data using the backpropagation algorithm and the SGD optimizer.

MLP example

Automatic differentiation Deep Learning frameworks

Current:

  • Tensorflow https://www.tensorflow.org/ released by Google in 2015 is one of the two standard DL frameworks.

  • Keras https://keras.io/ is a high-level Python API over tensorflow (but also theano, CNTK and MxNet) written by Francois Chollet.

  • PyTorch http://pytorch.org by Facebook is the other standard framework.

Historical:

Example of a shallow neural network with keras

  • Let’s implement the previous MLP using keras.

  • We first need to generate the data using scikit-learn:

import sklearn.datasets
X, t = sklearn.datasets.make_circles(n_samples=100, shuffle=True, noise=0.15, factor=0.3)
  • We then import tensorflow:
import tensorflow as tf
  • The neural network is called a Sequential model in keras:
model = tf.keras.Sequential()

Example of a shallow neural network with keras

  • Creating a NN is simply stacking layers in the model. The input layer is just a placeholder for the data:
model.add( tf.keras.layers.Input(shape=(2, )) )
  • The hidden layer has 20 neurons, the ReLU activation and takes input from the previous layer:
model.add(
    tf.keras.layers.Dense(
        20, # Number of hidden neurons
        activation='relu' # Activation function
    )
)
  • The output layer has 1 neuron with the logistic/sigmoid activation function:
model.add(
    tf.keras.layers.Dense(
        1, # Number of output neurons
        activation='sigmoid' # Soft classification
    )
)

Example of a shallow neural network with keras

  • We now choose an optimizer (SGD) with a learning rate \eta = 0.001:
optimizer = tf.keras.optimizers.SGD(lr=0.001)
  • We choose a loss function (binary cross-entropy, aka negative log-likelihood):
loss = tf.keras.losses.binary_crossentropy
  • We compile the model (important!) and tell it to track the accuracy of the model:
model.compile(
    loss=loss,
    optimizer=optimizer, 
    metrics=tf.keras.metrics.categorical_accuracy
)

Example of a shallow neural network with keras

  • Et voilà! The network has been created.
print(model.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 20)                60        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 21        
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0
_________________________________________________________________
None

Example of a shallow neural network with keras

  • We now train the model on the data for 100 epochs using a batch size of 10 and wait for it to finish:
model.fit(X, t, batch_size=10, nb_epoch=100)
  • With keras (and the other automatic differentiation frameworks), you only need to define the structure of the network.

  • The rest (backpropagation, SGD) is done automatically.

  • To make predictions on new data, just do:

model.predict(X_test)