Reservoir computing
Professur für Künstliche Intelligenz - Fakultät für Informatik
The concept of Reservoir Computing (RC) was developed simultaneously by two researchers at the beginning of the 2000s.
Herbert Jaeger (Bremen) introduced echo-state networks (ESN) using rate-coded neurons (Jaeger, 2001).
Wolfgang Maass (TU Graz) introduced liquid state machines (LSM) using spiking neurons (Maass et al., 2002).
The recurrent networks from ML (LSTM, GRU) are very powerful thanks to the backpropagation through time (BPTT) algorithm.
However, they suffer from several problems:
Rate-coded neurons in the reservoir integrate inputs and recurrent connections using an ODE: \tau \, \frac{d \mathbf{x}(t)}{dt} + \mathbf{x}(t) = W^\text{IN} \times \mathbf{I}(t) + W \times \mathbf{r}(t)
The output of a neuron typically uses the tanh
function (between -1 and 1):
\mathbf{r}(t) = \tanh \, \mathbf{x}(t)
\mathbf{z}(t) = W^\text{OUT} \times \mathbf{r}(t)
In the classical version of the ESN, only the readout weights are learned, not the recurrent ones.
One can use supervised learning to train the readout neurons to reproduce desired targets.
Tanaka et al. (2019) Recent advances in physical reservoir computing: A review. Neural Networks 115, 100–123.
Inputs \mathbf{I}(t) bring the recurrent units in a given state or trajectory.
The recurrent connections inside the reservoir create different dynamics \mathbf{r}(t) depending on the strength of the weight matrix.
Readout neurons linearly transform the recurrent dynamics into temporal outputs \mathbf{z}(t).
Supervised learning (delta learning rule, OLS) trains the readout weights to reproduce a target \mathbf{t}(t).
It is similar to a MLP with one hidden layer, but the hidden layer has dynamics.
Reservoirs only need a few hundreds of units in the reservoir to learn complex functions (e.g. N=200).
The recurrent weights are initialized randomly using a normal distribution with mean 0 and deviation \frac{g}{\sqrt{N}}:
w_{ij} \sim \mathcal{N}(0, \frac{g}{\sqrt{N}})
g is a scaling factor characterizing the strength of the recurrent connections, what leads to different dynamics.
g is linked to the spectral radius of the recurrent weight matrix (highest eigenvalue).
The recurrent weight matrix is often sparse:
A subset of the possible connections N \times N has non-zero weights.
Typically, only 10% of the possible connections are created.
Depending on the value of g, the dynamics of the reservoir can exhibit different stable or cyclic attractors.
Let’s have a look at the activity of a few neurons after the presentation of a short input.
The chaotic regime appears for g > 1.5.
g=1.5 is the edge of chaos: the dynamics are very rich, but the network is not chaotic yet.
The Lorenz attractor is a famous example of a chaotic attractor.
The position x, y, z of a particle is describe by a set of 3 deterministic ordinary differential equations:
\begin{cases} \dfrac{dx}{dt} = \sigma \, (y - x) \\ \\ \dfrac{dy}{dt} = x \, (\rho - z) - y \\ \\ \dfrac{dz}{dt} = x\, y - \beta \, z \\ \end{cases}
\mathbf{z}(t) = W^\text{OUT} \times \mathbf{r}(t)
W^\text{OUT} = (\mathbf{R}^T \times \mathbf{R})^{-1} \times \mathbf{R}^T \times \mathbf{T}
Given enough neurons in the reservoir and dynamics at the edge of the chaos, a RC network can approximate any non-linear function between an input signal \mathbf{I}(t) and a target signal \mathbf{t}(t).
The reservoir projects a low-dimensional input into a high-dimensional spatio-temporal feature space where trajectories becomes linearly separable.
The reservoir increases the distance between the input patterns.
Input patterns are separated in both space (neurons) and time: the readout neurons need much less weights than the equivalent MLP: better generalization and faster learning.
The only drawback is that it does not deal very well with high-dimensional inputs (images).
Seoane (2019) Evolutionary aspects of reservoir computing. Philosophical Transactions of the Royal Society B.
See Zhang and Vargas (2023).
Pathak et al. (2018) Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach. Physical Review Letters 120, 024102–024102.
NLP: RC networks can grasp the dynamics of language, i.e. its grammar.
RC networks can be trained to produce predicates (“hit(Mary, John)”) from sentences (“Mary hit John” or “John was hit by Mary”)
Hinaut and Dominey (2013) Real-Time Parallel Processing of Grammatical Structure in the Fronto-Striatal System. PLOS ONE 8, e52946.
The cool thing with reservoirs is that they do not have to be simulated by classical von Neumann architectures (CPU, GPU).
Anything able to exhibit dynamics at the edge of chaos can be used:
This can limit drastically the energy consumption of ML algorithms (200W for a GPU).
Even biological or physical systems can be used…
Tanaka et al. (2019) Recent Advances in Physical Reservoir Computing: A Review. arXiv:1808.04962
A bucket of water can be used as a reservoir.
Different motors provide inputs to the reservoir by creating weights.
The surface of the bucket is recorded and used as an input to a linear algorithm.
It can learn non-linear operations (XOR) or even speech recognition.
Fernando and Sojakka (2003) Pattern Recognition in a Bucket. in Advances in Artificial Life Lecture Notes in Computer Science.
Frega et al. (2014) Network dynamics of 3D engineered neuronal cultures: a new experimental model for in-vitro electrophysiology. Scientific Reports 4, 1–14.
Escherichia Coli bacteria change their mRNA in response to various external factors (temperature, chemical products, etc) and interact with each other.
Their mRNA encode a dynamical trajectory reflecting the inputs.
By placing them on a microarray, one can linearly learn to perform non-linear operations on the inputs.
Jones et al. (2007) Is there a Liquid State Machine in the Bacterium Escherichia Coli? in 2007 IEEE Symposium on Artificial Life, 187–191.
Sussillo and Abbott (2009) Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4), 544–557.
\tau \, \dfrac{d \mathbf{x}(t)}{dt} + \mathbf{x}(t) = W^\text{IN} \times \mathbf{I}(t) + W^\text{REC} \times \mathbf{r}(t) + W^\text{FB} \times \mathbf{z}(t)
This makes the reservoir much more robust to perturbations, especially at the edge of chaos.
The trajectories are more stable (but still highly dynamical), making the job of the readout neurons easier.
Using feedback, there is even no need for an input I(t). The target value t(t) just needs to be used by the readout layer which should learn to minimize the difference.
A reservoir with feedback can perform autoregression, for example time series prediction.
Sussillo and Abbott (2009) Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4), 544–557.
\begin{cases} \tau \, \dfrac{d \mathbf{x}(t)}{dt} + \mathbf{x}(t) = W^\text{IN} \times \mathbf{I}(t) + W^\text{REC} \times \mathbf{r}(t) + W^\text{FB} \times \mathbf{z}(t) \\ \\ \mathbf{r}(t) = \tanh \, \mathbf{x}(t) \\ \\ \mathbf{z}(t) = W^\text{OUT} \times \mathbf{r}(t) \\ \end{cases}
\mathcal{L}(W^\text{OUT}) = \mathbb{E}_t [||\mathbf{t}(t) - W^\text{OUT} \times \mathbf{r}(t)||^2]
W^\text{OUT} = (\mathbf{R}^T \times \mathbf{R})^{-1} \times \mathbf{R}^T \times \mathbf{T}
\Delta W^\text{OUT} = \eta \, (\mathbf{t}(t) - \mathbf{z}(t)) \times \mathbf{r}^T(t)
If we wanted to take the temporal dependencies into account, we would need to apply BPTT to the recurrent weights: bad.
There exists a version of OLS which is online and allows to perform linear regression incrementally.
The math is complex, so here are some online resources if you are interested: here and there.
Let’s say we have already learned from the t-1 first samples, and get a new sample (\mathbf{z}(t), \mathbf{t}(t)).
How do we update the weights
The readout weights will be updated online using the error, the input and P:
\Delta W^\text{OUT} = \eta \, (\mathbf{t}_t - \mathbf{z}_t) \times P \times \mathbf{r}_t
where P is a running estimate of the inverse information matrix of the input :
P = (\mathbf{R}^T \times \mathbf{R})^{-1}
Haykin (2002) Adaptive filter theory. Prentice Hall.
\Delta P = - \dfrac{(P \times \mathbf{r}_t) \times (P \times \mathbf{r}_t)^T}{1 + \mathbf{r}_t^T \times P \times \mathbf{r}_t}
and the weight updates are also normalized:
\Delta W^\text{OUT} = \eta \, (\mathbf{t}_t - \mathbf{z}_t) \times \dfrac{P}{1 + \mathbf{r}_t^T \times P \times \mathbf{r}_t} \times \mathbf{r}_t
Updating P requires n \times n operations per step, compared with the n operations needed by the delta learning rule, but at least it works…
This is the formula of a single readout neuron (\mathbf{t}_t and \mathbf{z}_t are actually scalars). If you have several output neurons, you need to update several P matrices…
FORCE learning (first-order reduced and controlled error) consists of feeding back the readout into the reservoir and using RLS.
FORCE learning allows to stabilize trajectories in the chaotic reservoir and generate complex patterns in an autoregressive manner.
Sussillo and Abbott (2009) Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4), 544–557.
Laje and Buonomano (2013) Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience, 16(7), 925–933.
In classical RC networks, the recurrent weights are fixed and only the readout weights are trained.
The reservoir dynamics are fixed by the recurrent weights, we cannot change them.
Dynamics can be broken by external perturbations or high-amplitude noise.
The edge of chaos is sometimes too close.
If we could learn the recurrent weights, we could force the reservoir to have fixed and robust trajectories, while keeping interesting dynamics.
However, learning in a chaotic system, even with BPTT, is very hard.
Sussillo and Abbott (2009) Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4), 544–557. doi:10.1016/j.neuron.2009.07.018
A classical network is trained to reproduce handwriting.
The two readout neurons produce a sequence of (x, y) positions for the pen.
It works quite well when the input is not perturbed.
If some perturbation enters the reservoir, the trajectory is lost.
We have an output error signal \mathbf{t}_t - \mathbf{z}_t at each time step.
Why can’t we just apply backpropagation (through time) on the recurrent weights?
\mathcal{L}(W, W^\text{OUT}) = \mathbb{E}_{t} [(\mathbf{t}_t - \mathbf{z}_t)^2]
With FORCE learning, we have an error term for the readout weights.
For the recurrent weights, we would also need an error term.
It can computed by recording the dynamics during an initialization trial \mathbf{r}^*_t and forcing the recurrent weights to reproduce these dynamics in the learning trials:
\Delta W = - \eta \, (\mathbf{r}^*_t - \mathbf{r}_t) \times P \times \mathbf{r}_t
This is equivalent to having a fixed, external reservoir providing the targets.
See https://github.com/ReScience-Archives/Vitay-2016 for a reimplementation.
Sussillo and Abbott (2009) Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4), 544–557.
Laje and Buonomano (2013) Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience, 16(7), 925–933.
Laje and Buonomano (2013) Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience, 16(7), 925–933.
Miconi (2017) Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. eLife 6:e20899.
\tau \frac{d \mathbf{x}(t)}{dt} + \mathbf{x}(t) = W^\text{IN} \times \mathbf{I}(t) + W \times \mathbf{r}(t)
\mathbf{r}(t) = \tanh \mathbf{x}(t)
However, there are NO readout neurons: a random neuron of the reservoir is picked as output neuron.
Its activity at the end of a trial is used to provide a reward or not.
Miconi (2017) Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. eLife 6:e20899.
Delayed non-match-to-sample (DNMS) task:
It is a task involving working memory: the first item must be actively remembered in order to produce the response later.
The response is calculated as the mean activity y of the output neuron over the last 200 ms.
The “reward” used is simply the difference between the desired value (t=+1 or -1) and the response:
r = - |t - y|
Miconi (2017) Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. eLife 6:e20899.
\Delta w_{ij} = \eta \, r_i \, r_j \, (R - \bar{R})
\bar{R} \leftarrow \alpha \, \bar{R} + (1-\alpha) \, R
If more reward than usual is received, the weights between correlated neurons should be increased.
If less reward than usual is received, the weights between correlated neurons should be decreased.
This does not work well with sparse rewards (at the end of a complex sequence).
Kuśmierz et al. (2017) Learning with three factors: modulating Hebbian plasticity with errors. Current Opinion in Neurobiology 46, 170–177.
w_{ij} \rightarrow w_{ij} + \xi
and observing the change of reward intake at the end of the episode:
\Delta R = R - \bar{R}
\Delta w_{ij} = \eta \, (R - \bar{R}) \, \xi
\frac{\partial R(\theta)}{\partial w_{ij}} \approx \frac{\Delta R}{\Delta w_{ij}} = \frac{R - \bar{R}}{ w_{ij} + \xi - w_{ij} }
x_j = \ldots + w_{i,j} \, r_i + \ldots
x_j \rightarrow x_j + \xi_j
\Delta w_{ij} = \eta \, (\sum_t r_i \, \xi_j) \, (R - \bar{R})
A trace of the perturbations must be maintained, as learning occurs only at the end of the trial.
Still not biologically plausible: a synapse cannot access and store directly the perturbations, which may come from other neurons.
Fiete and Seung (2006) Gradient learning in spiking neural networks by dynamic perturbation of conductances. Physical Review Letters 97:048104.
\Delta w_{ij} = \eta \, r_i \, (x_j - \bar{x_j}) \, (R - \bar{R})
where \bar{x_j} is a running average of the postsynaptic activity (a trace of its activity).
The difference x_j - \bar{x_j} contains information about the perturbation, but is local to the synapse (biologically realistic).
However, the perturbation is canceled by the relaxation. Need for a non-linearity:
\Delta w_{ij} = \eta \, r_i \, (x_j - \bar{x_j})^3 \, (R - \bar{R})
Legenstein et al. (2010) A reward-modulated hebbian learning rule can explain experimentally observed network reorganization in a brain control task. Journal of Neuroscience.
x_j \rightarrow x_j + \xi_j
e_{ij} = e_{ij} + r_i \, (x_j - \bar{x_j})^3
\Delta w_{ij} = \eta \, e_{ij} \, (R - \bar{R})
Learning is quite slow (ca 1000 trials), but only from sparse rewards at the end of the trial.
The power of the network does not lie in the readout neurons, but in the dynamics of the reservoir: trajectories are discovered and stabilized using RL.
The only “imperfection” is that learning is actually error-driven, not success-driven:
r = - |t - y|
16 motor neurons to control the muscles of an arm.
2 inputs: left / right.
Error is the remaining distance at the end of the trial.