Recurrent neural networks
Professur für Künstliche Intelligenz - Fakultät für Informatik
\mathbf{y} = F_\theta(\mathbf{x})
\mathbf{y}_0 = F_\theta(\mathbf{x}_0) \mathbf{y}_1 = F_\theta(\mathbf{x}_1) \dots \mathbf{y}_t = F_\theta(\mathbf{x}_t)
x_{t+1} = F_\theta(x_0, x_1, \ldots, x_t)
\mathbf{X} = \begin{bmatrix}\mathbf{x}_{t-T} & \mathbf{x}_{t-T+1} & \ldots & \mathbf{x}_t \\ \end{bmatrix}
\mathbf{y}_t = F_\theta(\mathbf{X})
Problem 1: How long should the window be?
Problem 2: Having more input dimensions increases dramatically the complexity of the classifier (VC dimension), hence the number of training examples required to avoid overfitting.
A recurrent neural network (RNN) uses it previous output as an additional input (context).
All vectors have a time index t denoting the time at which this vector was computed.
The input vector at time t is \mathbf{x}_t, the output vector is \mathbf{h}_t:
\mathbf{h}_t = \sigma(W_x \times \mathbf{x}_t + W_h \times \mathbf{h}_{t-1} + \mathbf{b})
\sigma is a transfer function, usually logistic or tanh.
The input \mathbf{x}_t and previous output \mathbf{h}_{t-1} are multiplied by learnable weights:
W_x is the input weight matrix.
W_h is the recurrent weight matrix.
\begin{aligned} \mathbf{h}_t & = \sigma(W_x \times \mathbf{x}_t + W_h \times \mathbf{h}_{t-1} + \mathbf{b}) \\ & = \sigma(W_x \times \mathbf{x}_t + W_h \times \sigma(W_x \times \mathbf{x}_{t-1} + W_h \times \mathbf{h}_{t-2} + \mathbf{b}) + \mathbf{b}) \\ & = f_{W_x, W_h, \mathbf{b}} (\mathbf{x}_0, \mathbf{x}_1, \dots,\mathbf{x}_t) \\ \end{aligned}
A RNN is considered as part of deep learning, as there are many layers of weights between the first input \mathbf{x}_0 and the output \mathbf{h}_t.
The only difference with a DNN is that the weights W_x and W_h are reused at each time step.
\mathbf{h}_t = f_{W_x, W_h, \mathbf{b}} (\mathbf{x}_0, \mathbf{x}_1, \dots,\mathbf{x}_t) \\
The function between the history of inputs and the output at time t is differentiable: we can simply apply gradient descent to find the weights!
This variant of backpropagation is called Backpropagation Through Time (BPTT).
Once the loss between \mathbf{h}_t and its desired value is computed, one applies the chain rule to find out how to modify the weights W_x and W_h using the history (\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_t).
\begin{aligned} \mathbf{h}_{t} & = \sigma(W_x \times \mathbf{x}_{t} + W_h \times \mathbf{h}_{t-1} + \mathbf{b}) \\ \end{aligned}
\frac{\partial \mathcal{L}(W_x, W_h)}{\partial W_x} = \frac{\partial \mathcal{L}(W_x, W_h)}{\partial \mathbf{h}_t} \times \frac{\partial \mathbf{h}_t}{\partial W_x}
\frac{\partial \mathcal{L}(W_x, W_h)}{\partial W_h} = \frac{\partial \mathcal{L}(W_x, W_h)}{\partial \mathbf{h}_t} \times \frac{\partial \mathbf{h}_t}{\partial W_h}
\frac{\partial \mathcal{L}(W_x, W_h)}{\partial \mathbf{h}_t} = - (\mathbf{t}_{t}- \mathbf{h}_{t})
\begin{aligned} \mathbf{h}_{t} & = \sigma(W_x \times \mathbf{x}_{t} + W_h \times \mathbf{h}_{t-1} + \mathbf{b}) \\ \end{aligned}
\begin{aligned} \frac{\partial \mathbf{h}_t}{\partial W_x} & = \mathbf{h'}_{t} \times (\mathbf{x}_t + W_h \times \frac{\partial \mathbf{h}_{t-1}}{\partial W_x})\\ & \\ \frac{\partial \mathbf{h}_t}{\partial W_h} & = \mathbf{h'}_{t} \times (\mathbf{h}_{t-1} + W_h \times \frac{\partial \mathbf{h}_{t-1}}{\partial W_h})\\ \end{aligned}
\mathbf{h'}_{t} = \begin{cases} \mathbf{h}_{t} \, (1 - \mathbf{h}_{t}) \quad \text{ for logistic}\\ (1 - \mathbf{h}_{t}^2) \quad \text{ for tanh.}\\ \end{cases}
\begin{aligned} \frac{\partial \mathbf{h}_t}{\partial W_x} & = \mathbf{h'}_{t} \, (\mathbf{x}_t + W_h \times \mathbf{h'}_{t-1} \, (\mathbf{x}_{t-1} + W_h \times \mathbf{h'}_{t-2} \, (\mathbf{x}_{t-2} + W_h \times \ldots (\mathbf{x}_0))))\\ & \\ \frac{\partial \mathbf{h}_t}{\partial W_h} & = \mathbf{h'}_{t} \, (\mathbf{h}_{t-1} + W_h \times \mathbf{h'}_{t-1} \, (\mathbf{h}_{t-2} + W_h \times \mathbf{h'}_{t-2} \, \ldots (\mathbf{h}_{0})))\\ \end{aligned}
When updating the weights at time t, we need to store in memory:
the complete history of inputs \mathbf{x}_0, \mathbf{x}_1, … \mathbf{x}_t.
the complete history of outputs \mathbf{h}_0, \mathbf{h}_1, … \mathbf{h}_t.
the complete history of derivatives \mathbf{h'}_0, \mathbf{h'}_1, … \mathbf{h'}_t.
before computing the gradients iteratively, starting from time t and accumulating gradients backwards in time until t=0.
In practice, going back to t=0 at each time step requires too many computations, which may not be needed.
Truncated BPTT only updates the gradients up to T steps before: the gradients are computed backwards from t to t-T. The partial derivative in t-T-1 is considered 0.
This limits the horizon of BPTT: dependencies longer than T will not be learned, so it has to be chosen carefully for the task.
T becomes yet another hyperparameter of your algorithm…
But it fails to detect long-term dependencies because of:
the truncated horizon T (for computational reasons).
the vanishing gradient problem.
\begin{aligned} \frac{\partial \mathbf{h}_t}{\partial W_x} & = \mathbf{h'}_{t} \, (\mathbf{x}_t + W_h \times \frac{\partial \mathbf{h}_{t-1}}{\partial W_x})\\ & \\ \end{aligned}
At each iteration backwards in time, the gradients are multiplied by W_h.
If you search how \frac{\partial \mathbf{h}_t}{\partial W_x} depends on \mathbf{x}_0, you obtain something like:
\begin{aligned} \frac{\partial \mathbf{h}_t}{\partial W_x} & \approx \prod_{k=0}^t \mathbf{h'}_{k} \, ((W_h)^t \, \mathbf{x}_0 + \dots) \\ \end{aligned}
If |W_h| > 1, |(W_h)^t| increases exponentially with t: the gradient explodes.
If |W_h| < 1, |(W_h)^t| decreases exponentially with t: the gradient vanishes.
|| \frac{\partial \mathcal{L}(W_x, W_h)}{\partial W_x}|| \gets \min(||\frac{\partial \mathcal{L}(W_x, W_h)}{\partial W_x}||, T)
But there is no solution to the vanishing gradient problem for regular RNNs: the gradient fades over time (backwards) and no long-term dependency can be learned.
This is the same problem as for feedforward deep networks: a RNN is just a deep network rolled over itself.
Its depth (number of layers) corresponds to the maximal number of steps back in time.
In order to limit vanishing gradients and learn long-term dependencies, one has to use a more complex structure for the layer.
This is the idea behind long short-term memory (LSTM) networks.
S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f. Informatik, Technische Univ. Munich, 1991.
A LSTM layer is a RNN layer with the ability to control what it memorizes.
In addition to the input \mathbf{x}_t and output \mathbf{h}_t, it also has a state \mathbf{C}_t which is maintained over time.
The state is the memory of the layer (sometimes called context).
It also contains three multiplicative gates:
The input gate controls which inputs should enter the memory.
The forget gate controls which memory should be forgotten.
The output gate controls which part of the memory should be used to produce the output.
The state \mathbf{C}_t can be seen as an accumulator integrating inputs (and previous outputs) over time.
The input gate allows inputs to be stored.
The forget gate “empties” the accumulator
The output gate allows to use the accumulator for the output.
The gates learn to open and close through learnable weights.
By default, the cell state \mathbf{C}_t stays the same over time (conveyor belt).
It can have the same number of dimensions as the output \mathbf{h}_t, but does not have to.
Its content can be erased by multiplying it with a vector of 0s, or preserved by multiplying it by a vector of 1s.
We can use a sigmoid to achieve this:
\mathbf{f}_t = \sigma(W_f \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_f)
[\mathbf{h}_{t-1}; \mathbf{x}_t] is simply the concatenation of the two vectors \mathbf{h}_{t-1} and \mathbf{x}_t.
\mathbf{f}_t is a vector of values between 0 and 1, one per dimension of the cell state \mathbf{C}_t.
\mathbf{i}_t = \sigma(W_i \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_i)
\tilde{\mathbf{C}}_t = \text{tanh}(W_C \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_c)
\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t
\mathbf{o}_t = \sigma(W_o \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_o)
\mathbf{h}_t = \mathbf{o}_t \odot \text{tanh} (\mathbf{C}_t)
Forget gate
\mathbf{f}_t = \sigma(W_f \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_f)
Input gate
\mathbf{i}_t = \sigma(W_i \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_i)
Output gate
\mathbf{o}_t = \sigma(W_o \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_o)
Candidate state
\tilde{\mathbf{C}}_t = \text{tanh}(W_C \times [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_c)
New state
\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t
Output
\mathbf{h}_t = \mathbf{o}_t \odot \text{tanh} (\mathbf{C}_t)
Not all inputs are remembered by the LSTM: the input gate controls what comes in.
If only \mathbf{x}_0 and \mathbf{x}_1 are needed to produce \mathbf{h}_{t+1}, they will be the only ones stored in the state, the other inputs are ignored.
\mathbf{C}_t = \mathbf{C}_{t-1} \rightarrow \frac{\partial \mathbf{C}_t}{\partial \mathbf{C}_{t-1}} = 1
LSTM are particularly good at learning long-term dependencies, because the gates protect the cell from vanishing gradients.
Its problem is how to find out which inputs (e.g. \mathbf{x}_0 and \mathbf{x}_1) should enter or leave the state memory.
Truncated BPTT is used to train all weights: the weights for the candidate state (as for RNN), and the weights of the three gates.
LSTM are also subject to overfitting. Regularization (including dropout) can be used.
The weights (also for the gates) can be convolutional.
The gates also have a bias, which can be fixed (but hard to find).
LSTM layers can be stacked to detect dependencies at different scales (deep LSTM network).
Hochreiter and Schmidhuber (1997). Long short-term memory. Neural computation, 9(8).
\mathbf{f}_t = \sigma(W_f \times [\mathbf{C}_{t-1}; \mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_f)
\mathbf{i}_t = \sigma(W_i \times [\mathbf{C}_{t-1}; \mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_i)
\mathbf{o}_t = \sigma(W_o \times [\mathbf{C}_{t}; \mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_o)
Gers and Schmidhuber (2000). Recurrent nets that time and count. IJCNN.
\mathbf{z}_t = \sigma(W_z \times [\mathbf{h}_{t-1}; \mathbf{x}_t])
\mathbf{r}_t = \sigma(W_r \times [\mathbf{h}_{t-1}; \mathbf{x}_t])
\tilde{\mathbf{h}}_t = \text{tanh} (W_h \times [\mathbf{r}_t \odot \mathbf{h}_{t-1}; \mathbf{x}_t])
\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t
It does not even need biases (mostly useless in LSTMs anyway).
Much simpler to train as the LSTM, and almost as powerful.
Chung, Gulcehre, Cho, Bengio (2014). “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. arXiv:1412.3555
A bidirectional LSTM learns to predict the output in two directions:
The feedforward line learns using the past context (classical LSTM).
The backforward line learns using the future context (inputs are reversed).
The two state vectors are then concatenated at each time step to produce the output.
Only possible offline, as the future inputs must be known.
Works better than LSTM on many problems, but slower.
http://colah.github.io/posts/2015-08-Understanding-LSTMs
https://medium.com/@shiyan/understanding-lstm-and-its-diagrams-37e2f46f1714#.m7fxgvjwf