Introduction
Professur für Künstliche Intelligenz - Fakultät für Informatik
The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.
Good old-fashion AI approaches (GOFAI) were purely symbolic (logical systems, knowledge-based systems) or using linear neural networks.
They were able to play checkers, prove mathematical theorems, make simple conversations (ELIZA), translate languages…
Machine learning (ML) is a branch of AI that focuses on learning from examples (data-driven).
ML algorithms include:
Neural Networks (multi-layer perceptrons)
Statistical analysis (Bayesian modeling, PCA)
Clustering algorithms (k-means, GMM, spectral clustering)
Support vector machines
Decision trees, random forests
Other names: big data, data science, operational research, pattern recognition…
Deep Learning is a recent re-branding of neural networks.
Deep learning focuses on learning high-level representations of the data, using:
Deep neural networks (DNN)
Convolutional neural networks (CNN)
Recurrent neural networks (RNN)
Generative models (GAN, VAE)
Deep reinforcement learning (DQN, PPO, AlphaGo)
Transformers
Graph neural networks
Neurocomputing is at the intersection between computational neuroscience and artificial neural networks (deep learning).
Computational neuroscience studies the functioning of the brain through detailed models.
Neurocomputing aims at bringing the mechanisms underlying human cognition into artificial intelligence.
Supervised learning: The program is trained on a pre-defined set of training examples and used to make correct predictions when given new data.
Unsupervised learning: The program is given a bunch of data and must find patterns and relationships therein.
Reinforcement learning: The program explores its environment by producing actions and receiving rewards.
But also:
y_i = f_\theta(x_i)
\theta^* = \text{argmin} \sum_{i=1}^N || t_i - y_i||
When learning is successful, the model can be used on novel examples (generalisation).
The modality of the inputs and outputs does not really matter:
y = f( \sum_{i=1}^d w_i \, x_i + b)
A convolutional neural network (CNN) is a cascade of convolution and pooling operations, extracting layer by layer increasingly complex features.
It can be trained on huge datasets of annotated examples.
The MNIST database is the simplest benchmark for object recognition (> 99.5 %).
One of the early functional CNN was LeNet5, able to classify digits.
Object recognition
Object detection
Object segmentation
Classical computer vision methods obtained moderate results, with error rates around 30%.
In 2012, Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton (Uni Toronto) used a CNN (AlexNet) without any preprocessing, using directly images as inputs.
To the big surprise of everybody, they won with an error rate of 15%, half of what other methods could achieve.
Since then, everybody uses deep neural networks for object recognition.
The deep learning hype had just begun…
Computer vision
Natural language processing
Speech processing
Robotics, control
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NIPS.
It turns out object detection is both a classification (what) and regression (where) problem.
Neural networks can be trained to do it given enough annotated data.
NVIDIA trained a CNN to reproduce wheel steerings from experienced drivers using only a front camera.
After training, the CNN took control of the car.
M Bojarski, D Del Testa, D Dworakowski, B Firner (2016). End to end learning for self-driving cars. arXiv:1604.07316
Facebook used 4.4 million annotated faces from 4030 users to train DeepFace.
Accuracy of 97.35% for recognizing faces, on par with humans.
Used now to recognize new faces from single examples (transfer learning, one-shot learning).
Taigman, Yang, Ranzato, Wolf (2014), “DeepFace: Closing the Gap to Human-Level Performance in Face Verification”. CVPR.
A recurrent neural network (RNN) uses it previous output as an additional input (context).
The inputs are integrated over time to deliver a response at the correct moment.
This allows to deal with time series (texts, videos) without increasing the input dimensions.
The input to the RNN can even be the output of a pre-trained CNN.
The most efficient RNN is called LSTM (Long short-term memory networks) (Hochreiter and Schmidhuber, 1997).
Google AI research http://ai.googleblog.com/2018/05/smart-compose-using-neural-networks-to.html
PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.
DUKE VINCENTIO:
Well, your wit is in the care of side and that.
Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.
Clown:
Come, sir, I will make did behold your worship.
Characters or words are fed one by one into a LSTM.
The desired output is the next character or word in the text.
Example:
Inputs: To, be, or, not, to
Output: be
The text on the left was generated by a LSTM having read the entire writings of William Shakespeare.
Each generated word is used as the next input.
Two LSTM can be stacked to perform sequence-to-sequence translation (seq2seq).
One is the encoder, the other the decoder.
Same idea, but with much more layers…
Can translate any pair of languages!
Github and OpenAI trained a GPT-3-like architecture on the available open source code.
Copilot is able to “autocomplete” the code based on a simple comment/docstring.
CNNs are not limited to images, voice signals can also be recognized using their mel-spectrum.
Siri, Alexa, Google now, etc. use recurrent CNNs to recognize vocal commands and respond.
DeepSpeech from Baidu is one of the state-of-the-art approach.
Hannun et al (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567
The goal of unsupervised learning is to build a model or find useful representations of the data, for example:
finding groups of similar data and model their density (clustering).
reduce the redundancy of the input dimensions (dimensionality reduction).
finding good explanations / representations of the data (latent data modeling).
generate new data (generative models).
Images have a lot of dimensions (pixels), most of which are redundant.
Dimensionality reduction techniques allow to reduce this number of dimensions by projecting the data into a latent space.
Autoencoders are NN that learn to reproduce their inputs by compressing information through a bottleneck.
Classical machine learning algorithms include PCA (principal component analysis) or t-SNE.
NN autoencoders can also be used for visualization, e.g. UMAP.
McInnes L, Healy J, Melville J. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:180203426.
Quoc et al. (2012). Building high-level features using large scale unsupervised learning. ICML12
A Generative Adversarial Network (GAN) is composed of two networks:
The generator learns to produce realistic images.
The discriminator learn to differentiate real data from generated data.
Both compete to reach a Nash equilibrium:
\min_G \max_D \, V(D, G) = \mathbb{E}_{x \sim P_\text{data}(x)} [\log D(x)] + \mathbb{E}_{z \sim P_z(z)} [\log(1 - D(G(z)))]
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. 2014. Generative Adversarial Networks. arXiv:1406.2661.
Radford, Metz and Chintala (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H. 2016. Generative Adversarial Text to Image Synthesis. arXiv:160505396.
Supervised learning allows to learn complex input/output mappings, given there is enough data.
Sometimes we do not know the correct output, only whether the proposed output is correct or not (partial feedback).
Reinforcement Learning (RL) can be used to learn by trial and error an optimal policy \pi(s,a).
Each action (=output) is associated to a reward.
The goal of the system is to find a policy that maximizes the sum of the rewards on the long-term (return).
R(s_t, a_t) = \sum_{k=0}^\infty \gamma^k\, r_{t+k+1}
Sutton and Barto (1998). Reinforcement Learning: An Introduction. MIT Press. http://incompleteideas.net/sutton/book/the-book.html
A CNN takes raw images as inputs and outputs the probabilities of taking particular actions.
Learning is only based on trial and error: what happens if I do that?
The goal is simply to maximize the final score.
Mnih et al. (2015). Playing Atari with Deep Reinforcement Learning. NIPS.
Mnih et al. (2015). Playing Atari with Deep Reinforcement Learning. NIPS. https://www.youtube.com/rQIShnTz1kU
In 2015, Google Deepmind surprised everyone by publishing AlphaGo, a Go AI able to beat the world’s best players, including Lee Sedol in 2016, 19 times world champion.
The RL agent discovers new strategies by using self-play: during the games against Lee Sedol, it was able to use novel moves which were never played before and surprised its opponent.
The new version AlphaZero also plays chess and sokoban at the master level.
David Silver et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, arXiv:1712.01815.
http://www.deeplearningbook.org
http://www.pearsonhighered.com/haykin
https://www.manning.com/books/deep-learning-with-python
https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf
https://www.coursera.org/learn/machine-learning
https://www.coursera.org/specializations/deep-learning
https://www.edx.org/course/machine-learning
https://towardsdatascience.com/