Basics in mathematics
Professur für Künstliche Intelligenz - Fakultät für Informatik
Linear algebra
Calculus
Probability theory
Statistics
Information theory
Scalars x are 0-dimensional values. They can either take real values (x \in \Re, e.g. x = 1.4573, floats in CS) or natural values (x \in \mathbb{N}, e.g. x = 3, integers in CS).
Vectors \mathbf{x} are 1-dimensional arrays of length d.
The bold notation \mathbf{x} will be used in this course, but you may also be accustomed to the arrow notation \overrightarrow{x} used on the blackboard. When using real numbers, the vector space with d dimensions is noted \Re^d, so we can note \mathbf{x} \in \Re^d.
Vectors are typically represented vertically to outline their d elements x_1, x_2, \ldots, x_d:
\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}
Matrices A are 2-dimensional arrays of size (or shape) m \times n (m rows, n columns, A \in \Re^{m \times n}).
They are represented by a capital letter to distinguish them from scalars (classically also in bold \mathbf{A} but not here). The element a_{ij} of a matrix A is the element on the i-th row and j-th column.
A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}
tensorflow
library).A vector can be thought of as the coordinates of a point in an Euclidean space (such the 2D space), relative to the origin.
A vector space relies on two fundamental operations, which are that:
\mathbf{x} + \mathbf{y} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} + \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_d \end{bmatrix} = \begin{bmatrix} x_1 + y_1 \\ x_2 + y_2 \\ \vdots \\ x_d + y_d \end{bmatrix}
a \, \mathbf{x} = a \, \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} = \begin{bmatrix} a \, x_1 \\ a \, x_2 \\ \vdots \\ a \, x_d \end{bmatrix}
These two operations generate a lot of nice properties (see https://en.wikipedia.org/wiki/Vector_space for a full list), including:
\mathbf{x} + (\mathbf{y} + \mathbf{z}) = (\mathbf{x} + \mathbf{y}) + \mathbf{z}
\mathbf{x} + \mathbf{y} = \mathbf{y} + \mathbf{x}
\mathbf{x} + \mathbf{0} = \mathbf{x}
\mathbf{x} + (-\mathbf{x}) = \mathbf{0}
a \, (\mathbf{x} + \mathbf{y}) = a \, \mathbf{x} + a \, \mathbf{y}
||\mathbf{x}||_2 = \sqrt{x_1^2 + x_2^2 + \ldots + x_d^2}
||\mathbf{x}||_1 = |x_1| + |x_2| + \ldots + |x_d|
||\mathbf{x}||_p = (|x_1|^p + |x_2|^p + \ldots + |x_d|^p)^{\frac{1}{p}}
||\mathbf{x}||_\infty = \max(|x_1|, |x_2|, \ldots, |x_d|)
\langle \mathbf{x} \cdot \mathbf{y} \rangle = \langle \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} \cdot \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_d \end{bmatrix} \rangle = x_1 \, y_1 + x_2 \, y_2 + \ldots + x_d \, y_d
The dot product basically sums one by one the product of the elements of each vector. The angular brackets are sometimes omitted (\mathbf{x} \cdot \mathbf{y}) but we will use them in this course for clarity.
One can notice immediately that the dot product is symmetric:
\langle \mathbf{x} \cdot \mathbf{y} \rangle = \langle \mathbf{y} \cdot \mathbf{x} \rangle
and linear:
\langle (a \, \mathbf{x} + b\, \mathbf{y}) \cdot \mathbf{z} \rangle = a\, \langle \mathbf{x} \cdot \mathbf{z} \rangle + b \, \langle \mathbf{y} \cdot \mathbf{z} \rangle
\langle \mathbf{x} \cdot \mathbf{y} \rangle = ||\mathbf{x}||_2 \, ||\mathbf{y}||_2 \, \cos(\theta)
If you normalize the two vectors by dividing them by their norm (which is a scalar), we indeed have the cosine of the angle between them
The higher the normalized dot product, the more the two vectors point towards the same direction (cosine distance between two vectors).
\langle \displaystyle\frac{\mathbf{x}}{||\mathbf{x}||_2} \cdot \frac{\mathbf{y}}{||\mathbf{y}||_2} \rangle = \cos(\theta)
A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ a_{41} & a_{42} & a_{43} \\ \end{bmatrix}
\mathbf{a}_1 = \begin{bmatrix} a_{11} \\ a_{21} \\ a_{31} \\ a_{41} \\ \end{bmatrix} \qquad \mathbf{a}_2 = \begin{bmatrix} a_{12} \\ a_{22} \\ a_{32} \\ a_{42} \\ \end{bmatrix} \qquad \mathbf{a}_3 = \begin{bmatrix} a_{13} \\ a_{23} \\ a_{33} \\ a_{43} \\ \end{bmatrix} \qquad
A = \begin{bmatrix} \mathbf{a}_1 & \mathbf{a}_2 & \mathbf{a}_3\\ \end{bmatrix}
\alpha \, A + \beta \, B = \begin{bmatrix} \alpha\, a_{11} + \beta \, b_{11} & \alpha\, a_{12} + \beta \, b_{12} & \alpha\, a_{13} + \beta \, b_{13} \\ \alpha\, a_{21} + \beta \, b_{21} & \alpha\, a_{22} + \beta \, b_{22} & \alpha\, a_{23} + \beta \, b_{23} \\ \alpha\, a_{31} + \beta \, b_{31} & \alpha\, a_{32} + \beta \, b_{32} & \alpha\, a_{33} + \beta \, b_{33} \\ \alpha\, a_{41} + \beta \, b_{41} & \alpha\, a_{42} + \beta \, b_{42} & \alpha\, a_{43} + \beta \, b_{43} \\ \end{bmatrix}
Note: Beware, you can only add matrices of the same dimensions m\times n. You cannot add a 2\times 3 matrix to a 5 \times 4 one.
A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}, \qquad A^T = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{m1} \\ a_{12} & a_{22} & \cdots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \cdots & a_{mn} \\ \end{bmatrix}
\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}, \qquad \mathbf{x}^T = \begin{bmatrix} x_1 & x_2 & \ldots & x_d \end{bmatrix}
A=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix},\quad B=\begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1p} \\ b_{21} & b_{22} & \cdots & b_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ b_{n1} & b_{n2} & \cdots & b_{np} \\ \end{bmatrix}
we can multiply them to obtain a m \times p matrix:
C = A \times B =\begin{bmatrix} c_{11} & c_{12} & \cdots & c_{1p} \\ c_{21} & c_{22} & \cdots & c_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ c_{m1} & c_{m2} & \cdots & c_{mp} \\ \end{bmatrix}
where each element c_{ij} is the dot product of the ith row of A and jth column of B:
c_{ij} = \langle A_{i, :} \cdot B_{:, j} \rangle = a_{i1}b_{1j} + a_{i2}b_{2j} +\cdots + a_{in}b_{nj}= \sum_{k=1}^n a_{ik}b_{kj}
Note: n, the number of columns of A and rows of B, must be the same!
\mathbf{y} = A \times \mathbf{x} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix} \times \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}
The result \mathbf{y} is a vector of size m.
In that sense, a matrix A can transform a vector of size n into a vector of size m:
\mathbf{x}^T \times \mathbf{y} = \begin{bmatrix} x_1 & x_2 & \ldots & x_n \end{bmatrix} \times \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = x_1 \, y_1 + x_2 \, y_2 + \ldots + x_n \, y_n = \langle \mathbf{x} \cdot \mathbf{y} \rangle
A \times A^{-1} = A^{-1} \times A = I
where I is the identity matrix (a matrix with ones on the diagonal and 0 otherwise).
\begin{cases} a_{11} \, x_1 + a_{12} \, x_2 + \ldots + a_{1n} \, x_n = b_1 \\ a_{21} \, x_1 + a_{22} \, x_2 + \ldots + a_{2n} \, x_n = b_2 \\ \ldots \\ a_{n1} \, x_1 + a_{n2} \, x_2 + \ldots + a_{nn} \, x_n = b_n \\ \end{cases}
which is equivalent to:
A \times \mathbf{x} = \mathbf{b}
\mathbf{x} = A^{-1} \times \mathbf{b}
\begin{align} f\colon \quad \Re &\to \Re\\ x &\mapsto f(x),\end{align}
\begin{align} f\colon \quad \Re^n &\to \Re\\ \mathbf{x} &\mapsto f(\mathbf{x}),\end{align}
The variables of the function are the elements of the vector.
For low-dimensional vector spaces, it is possible to represent each element explicitly, for example:
\begin{align} f\colon \quad\Re^3 &\to \Re\\ x, y, z &\mapsto f(x, y, z),\end{align}
\begin{align} \overrightarrow{f}\colon \quad \Re^n &\to \Re^m\\ \mathbf{x} &\mapsto \overrightarrow{f}(\mathbf{x}),\end{align}
Note: The matrix-vector multiplication \mathbf{y} = A \times \mathbf{x} is a linear vector field, mapping any vector \mathbf{x} into another vector \mathbf{y}.
Differential calculus deals with the derivative of a function, a process called differentiation.
The derivative f'(x) or \displaystyle\frac{d f(x)}{dx} of a univariate function f(x) is defined as the local slope of the tangent to the function for a given value of x:
f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}
The sign of the derivative tells you how the function behaves locally:
If the derivative is positive, increasing a little bit x increases the function f(x), so the function is locally increasing.
If the derivative is negative, increasing a little bit x decreases the function f(x), so the function is locally decreasing.
It basically allows you to measure the local influence of x on f(x): if I change a little bit the value x, what happens to f(x)? This will be very useful in machine learning.
A special case is when the derivative is equal to 0 in x. x is then called an extremum (or optimum) of the function, i.e. it can be a maximum or minimum.
You can tell whether an extremum is a maximum or a minimum by looking at its second-order derivative:
If f''(x) > 0, the extremum is a minimum.
If f''(x) < 0, the extremum is a maximum.
If f''(x) = 0, the extremum is a saddle point.
\nabla_\mathbf{x} \, f(\mathbf{x}) = \begin{bmatrix} \displaystyle\frac{\partial f(\mathbf{x})}{\partial x_1} \\ \displaystyle\frac{\partial f(\mathbf{x})}{\partial x_2} \\ \ldots \\ \displaystyle\frac{\partial f(\mathbf{x})}{\partial x_n} \\ \end{bmatrix}
f(x, y) = x^2 + 3 \, x \, y + 4 \, x \, y^2 - 1
can be partially differentiated w.r.t. x and y as:
\begin{cases} \displaystyle\frac{\partial f(x, y)}{\partial x} = 2 \, x + 3\, y + 4 \, y^2 \\ \\ \displaystyle\frac{\partial f(x, y)}{\partial y} = 3 \, x + 8\, x \, y \end{cases}
J = \begin{bmatrix} \dfrac{\partial \mathbf{f}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \dfrac{\partial f_m}{\partial x_1} & \cdots & \dfrac{\partial f_m}{\partial x_n} \end{bmatrix}
h(x) = a \, f(x) + b \, g(x)
its derivative is:
h'(x) = a \, f'(x) + b \, g'(x)
(f(x) \times g(x))' = f'(x) \times g(x) + f(x) \times g'(x)
Example:
f(x) = x^2 \, e^x
f'(x) = 2 \, x \, e^x + x^2 \cdot e^x
(f \circ g) (x) = f(g(x))
(f \circ g)' (x) = (f' \circ g) (x) \times g'(x)
\frac{d (f \circ g) (x)}{dx} = \frac{d f (g (x))}{d g(x)} \times \frac{d g (x)}{dx}
\frac{d f(y)}{dx} = \frac{d f(y)}{dy} \times \frac{dy}{dx}
h(x) = \frac{1}{2 \, x + 1}
is the function composition of g(x) = 2 \, x + 1 and f(x) = \displaystyle\frac{1}{x}, whose derivatives are known:
g'(x) = 2 f'(x) = -\displaystyle\frac{1}{x^2}
h'(x) = f'(g(x)) \times g'(x) = -\displaystyle\frac{1}{(2 \, x + 1)^2} \times 2
\displaystyle\frac{\partial f \circ g (x, y)}{\partial x} = \frac{\partial f \circ g (x, y)}{\partial g (x, y)} \times \frac{\partial g (x, y)}{\partial x}
and gradients:
\nabla_\mathbf{x} \, f \circ g (\mathbf{x}) = \nabla_{g(\mathbf{x})} \, f \circ g (\mathbf{x}) \times \nabla_\mathbf{x} \, g (\mathbf{x})
F'(x) = f(x)
F(x) = \int f(x) \, dx
dx being an infinitesimal interval (similar to h in the definition of the derivative).
The most important to understand for now is maybe that the integral of a function is the area under the curve.
The area under the curve of a function f on the interval [a, b] is:
\mathcal{S} = \int_a^b f(x) \, dx
One way to approximate this surface is to split the interval [a, b] into n intervals of width dx with the points x_1, x_2, \ldots, x_n.
This defines n rectangles of width dx and height f(x_i), so their surface is f(x_i) \, dx.
The area under the curve can then be approximated by the sum of the surfaces of all these rectangles.
\int_a^b f(x) \, dx = \lim_{dx \to 0} \sum_{i=1}^n f(x_i) \, dx
Let’s note X a discrete random variable with n realizations (or outcomes) x_1, \ldots, x_n.
The probability that X takes the value x_i is defined by the relative frequency of occurrence, i.e. the proportion of samples having the value x_i, when the total number N of samples tends to infinity:
P(X = x_i) = \frac{\text{Number of favorable cases}}{\text{Total number of samples}}
The set of probabilities \{P(X = x_i)\}_{i=1}^n define the probability distribution for the random variable (or probability mass function, pmf).
By definition, we have 0 \leq P(X = x_i) \leq 1 and the probabilities have to respect:
\sum_{i=1}^n P(X = x_i) = 1
\mathbb{E}[X] = \sum_{i=1}^n P(X = x_i) \, x_i
\mathbb{E}[\text{Coin}] = \frac{1}{2} \, 0 + \frac{1}{2} \, 1 = 0.5
\mathbb{E}[\text{Dice}] = \frac{1}{6} \, (1 + 2 + 3 + 4 + 5 + 6) = 3.5
\mathbb{E}[f(X)] = \sum_{i=1}^n P(X = x_i) \, f(x_i)
\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \sum_{i=1}^n P(X = x_i) \, (x_i - \mathbb{E}[X])^2
\text{Var}(\text{Coin}) = \frac{1}{2} \, (0 - 0.5)^2 + \frac{1}{2} \, (1 - 0.5)^2 = 0.25
\text{Var}(\text{Dice}) = \frac{1}{6} \, ((1-3.5)^2 + (2-3.5)^2 + (3-3.5)^2 + (4-3.5)^2 + (5-3.5)^2 + (6-3.5)^2) = \frac{105}{36}
Continuous random variables can take an infinity of continuous values, e.g. \Re or some subset.
The closed set of values they can take is called the support \mathcal{D}_X of the probability distribution.
The probability distribution is described by a probability density function (pdf) f(x).
The pdf of a distribution must be positive (f(x) \geq 0 \, \forall x \in \mathcal{D}_X) and its integral must be equal to 1:
\int_{x \in \mathcal{D}_X} f(x) \, dx = 1
P(a \leq X \leq b) = \int_{a}^b f(x) \, dx
\mathbb{E}[X] = \int_{x \in \mathcal{D}_X} f(x) \, x \, dx
the variance:
\text{Var}(X) = \int_{x \in \mathcal{D}_X} f(x) \, (x - \mathbb{E}[X])^2 \, dx
or a function of the random variable:
\mathbb{E}[g(X)] = \int_{x \in \mathcal{D}_X} f(x) \, g(x) \, dx
\mathbb{E}[a \, X + b \, Y] = a \, \mathbb{E}[X] + b \, \mathbb{E}[Y]
Probability distributions can in principle have any form: f(x) is unknown.
However, specific parameterized distributions can be very useful: their pmf/pdf is fully determined by a couple of parameters.
The Bernouilli distribution is a binary (discrete, 0 or 1) distribution with a parameter p specifying the probability to obtain the outcome 1:
P(X = 1) = p \; \text{and} \; P(X=0) = 1 - p P(X=x) = p^x \, (1-p)^{1-x} \mathbb{E}[X] = p
P(X = x_i) = p_i
The uniform distribution has an equal and constant probability of returning values between a and b, never outside this range.
It is parameterized by two parameters:
the start of the range a.
the end of the range b.
Its support is [a, b].
f(x; a, b) = \frac{1}{b - a}
For continuous distributions, the normal distribution is the most frequently encountered one.
It is parameterized by two parameters:
the mean \mu.
the variance \sigma^2 (or standard deviation \sigma).
Its support is \Re.
f(x; \mu, \sigma) = \frac{1}{\sqrt{2\,\pi\,\sigma^2}} \, e^{-\displaystyle\frac{(x - \mu)^2}{2\,\sigma^2}}
The exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate.
It is parameterized by one parameter:
Its support is \Re^+ (x > 0).
f(x; \lambda) = \lambda \, e^{-\lambda \, x}
Let’s now suppose that we have two random variables X and Y with different probability distributions P(X) and P(Y).
The joint probability P(X, Y) denotes the probability of observing the realizations x and y at the same time:
P(X=x, Y=y)
P(X=x, Y=y) = P(X=x) \, P(Y=y)
P(X=x) = \sum_y P(X=x, Y=y)
f(x) = \int f(x, y) \, dy
Some useful information between two random variables is the conditional probability.
P(X=x | Y=y) is the conditional probability that X=x, given that Y=y is observed.
Y=y is not random anymore: it is a fact (at least theoretically).
You wonder what happens to the probability distribution of X now that you know the value of Y.
Conditional probabilities are linked to the joint probability by:
P(X=x | Y=y) = \frac{P(X=x, Y=y)}{P(Y=y)}
If X and Y are independent, we have P(X=x | Y=y) = P(X=x) (knowing Y does not change anything to the probability distribution of X).
We can use the same notation for the complete probability distributions:
P(X | Y) = \frac{P(X, Y)}{P(Y)}
You ask 50 people whether they like cats or dogs:
We consider loving cats and dogs as random variables (and that our sample size is big enough to use probabilities…)
We have P(\text{dog}) = \dfrac{18+21}{50} and P(\text{cat}) = \dfrac{18+5}{50}.
Among the 23 who love cats, which proportion also loves dogs?
The joint probability of loving both cats and dogs is P(\text{cat}, \text{dog}) = \dfrac{18}{50}.
The conditional probability of loving dogs given one loves cats is:
P(\text{dog} | \text{cat}) = \dfrac{P(\text{cat}, \text{dog})}{P(\text{cat})} = \dfrac{\dfrac{18}{50}}{\dfrac{23}{50}} = \dfrac{18}{23}
P(X, Y) = P(X | Y) \, P(Y) = P(Y | X) \, P(X)
we can obtain the Bayes’ rule:
P(Y | X) = \frac{P(X|Y) \, P(Y)}{P(X)}
It is very useful when you already know P(X|Y) and want to obtain P(Y|X) (Bayesian inference).
P(Y | X) is called the posterior probability.
P(X | Y) is called the likelihood.
P(Y) is called the prior probability (belief).
P(X) is called the model evidence or marginal likelihood.
P(D=1)= 0.1 \qquad \qquad P(D=0)=0.9
P(T=1 | D=1) = 0.8 \qquad \qquad P(T=0 | D=1) = 0.2
P(T=1 | D=0) = 0.1 \qquad \qquad P(T=0 | D=0) = 0.9
\begin{aligned} P(D=1|T=1) &= \frac{P(T=1 | D=1) \, P(D=1)}{P(T=1)} \\ &\\ &= \frac{P(T=1 | D=1) \, P(D=1)}{P(T=1 | D=1) \, P(D=1) + P(T=1 | D=0) \, P(D=0)} \\ &\\ &= \frac{0.8 \times 0.1}{0.8 \times 0.1 + 0.1 \times 0.9} \\ &\\ & = 0.47 \\ \end{aligned}
In ML, we will deal with random variables whose exact probability distribution is unknown, but we are interested in their expectation or variance anyway.
Random sampling or Monte Carlo sampling (MC) consists of taking N samples x_i out of the distribution X (discrete or continuous) and computing the sample average: \mathbb{E}[X] = \mathbb{E}_{x \sim X} [x] \approx \frac{1}{N} \, \sum_{i=1}^N x_i
Law of big numbers
As the number of identically distributed, randomly generated variables increases, their sample mean (average) approaches their theoretical mean.
MC estimates are only correct when:
the samples are i.i.d (independent and identically distributed):
independent: the samples must be unrelated with each other.
identically distributed: the samples must come from the same distribution X.
the number of samples is large enough. Usually N > 30 for simple distributions.
\mathbb{E}[f(X)] = \mathbb{E}_{x \sim X} [f(x)] \approx \frac{1}{N} \, \sum_{i=1}^N f(x_i)
Suppose we have an unknown distribution X with expected value \mu = \mathbb{E}[X] and variance \sigma^2.
We can take randomly N samples from X to compute the sample average:
S_N = \frac{1}{N} \, \sum_{i=1}^N x_i
The distribution of sample averages is normally distributed with mean \mu and variance \frac{\sigma^2}{N}.
S_N \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{N}})
If we perform the sampling multiple times, even with few samples, the average of the sampling averages will be very close to the expected value.
The more samples we get, the smaller the variance of the estimates.
Although the distribution X can be anything, the sampling averages are normally distributed.
\mathbb{E}(S_N) = \mathbb{E}(X)
An estimator is a random variable used to measure parameters of a distribution (e.g. its expectation). The problem is that estimators can generally be biased.
Take the example of a thermometer M measuring the temperature T. T is a random variable (normally distributed with \mu=20 and \sigma=10) and the measurements M relate to the temperature with the relation:
M = 0.95 \, T + 0.65
The thermometer is not perfect, but do random measurements allow us to estimate the expected value of the temperature?
We could repeatedly take 100 random samples of the thermometer and see how the distribution of sample averages look like:
\mathbb{E}[M] = \mathbb{E}[0.95 \, T + 0.65] = 0.95 \, \mathbb{E}[T] + 0.65 = 19.65 \neq \mathbb{E}[T]
Let’s note \theta a parameter of a probability distribution X that we want to estimate (it does not have to be its mean).
An estimator \hat{\theta} is a random variable mapping the sample space of X to a set of sample estimates.
The bias of an estimator is the mean error made by the estimator:
\mathcal{B}(\hat{\theta}) = \mathbb{E}[\hat{\theta} - \theta] = \mathbb{E}[\hat{\theta}] - \theta
\text{Var}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}] )^2]
Ideally, we would like estimators with:
low bias: the estimations are correct on average (= equal to the true parameter).
low variance: we do not need many estimates to get a correct estimate (CLT: \frac{\sigma}{\sqrt{N}})
Unfortunately, the perfect estimator does not exist.
Estimators will have a bias and a variance:
Bias: the estimated values will be wrong, and the policy not optimal.
Variance: we will need a lot of samples (trial and error) to have correct estimates.
One usually talks of a bias/variance trade-off: if you have a small bias, you will have a high variance, or vice versa.
In machine learning, bias corresponds to underfitting, variance to overfitting.
Information theory (Claude Shannon) asks how much information is contained in a probability distribution.
Information is related to surprise or uncertainty: are the outcomes of a random variable surprising?
Almost certain outcomes (P \sim 1) are not surprising because they happen all the time.
Almost impossible outcomes (P \sim 0) are very surprising because they are very rare.
I (x) = - \log P(X = x)
Depending on which log is used, self-information has different units:
\log_2: bits or shannons.
\log_e = \ln: nats.
But it is just a rescaling, the base never matters.
H(X) = \mathbb{E}_{x \sim X} [I(x)] = \mathbb{E}_{x \sim X} [- \log P(X = x)]
H(X) = - \sum_x P(x) \, \log P(x)
H(X) = - \int_x f(x) \, \log f(x) \, dx
The entropy of a Bernouilli variable is maximal when both outcomes are equiprobable.
If a variable is deterministic, its entropy is minimal and equal to zero.
H(X, Y) = \mathbb{E}_{x \sim X, y \sim Y} [- \log P(X=x, Y=y)]
H(X | Y) = \mathbb{E}_{x \sim X, y \sim Y} [- \log P(X=x | Y=y)] = \mathbb{E}_{x \sim X, y \sim Y} [- \log \frac{P(X=x , Y=y)}{P(Y=y)}]
H(X, Y) = H(X) + H(Y) \qquad \text{or} \qquad H(X | Y) = H(X)
H(X | Y) = H(X, Y) - H(Y)
H(Y |X) = H(X |Y) + H(Y) - H(X)
I(X, Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)
It measures how much information the variable X holds on Y:
I (X, Y) = 0
I (X, Y) > 0
If you can fully predict X when you know Y, it becomes deterministic (H(X|Y)=0) so the mutual information is maximal (I(X, Y) = H(X)).
H(X, Y) = \mathbb{E}_{x \sim X}[- \log P(Y=x)]
Beware that the notation H(X, Y) is the same as the joint entropy, but it is a different concept!
The cross-entropy measures the negative log-likelihood that a sample x taken from the distribution X could also come from the distribution Y.
More exactly, it measures how many bits of information one would need to distinguish the two distributions X and Y.
H(X, Y) = \mathbb{E}_{x \sim X}[- \log P(Y=x)]
If the two distributions are the same almost anywhere, one cannot distinguish samples from the two distributions:
If the two distributions are completely different, one can tell whether a sample Z comes from X or Y:
\text{KL}(X ||Y) = \mathbb{E}_{x \sim X}[- \log \frac{P(Y=x)}{P(X=x)}]
\text{KL}(X ||Y) = H(X, Y) - H(X)
If the two distributions are the same almost anywhere:
If the two distributions are different:
Minimizing the KL between two distributions is the same as making the two distributions “equal”.
Again, the KL is not a metric, as it is not symmetric.