Neurocomputing

Optimization

Julien Vitay

Professur für Künstliche Intelligenz - Fakultät für Informatik

1 - Optimization

Machine learning = Optimization

  • Machine learning is all about optimization:

    • Supervised learning minimizes the error between the prediction and the data.

    • Unsupervised learning maximizes the fit between the model and the data

    • Reinforcement learning maximizes the collection of rewards.

  • The function to be optimized is called the objective function, cost function or loss function.

  • ML searches for the value of free parameters which optimize the objective function on the data set.

  • The simplest optimization method is the gradient descent (or ascent) method.

Analytical optimization

  • The easiest method to find the extremum of a function f(x) is to look where its first derivative is equal to 0:

x^* = \min_x f(x) \Leftrightarrow f'(x^*) = 0 \; \text{and} \; f''(x^*) > 0

x^* = \max_x f(x) \Leftrightarrow f'(x^*) = 0 \; \text{and} \; f''(x^*) < 0

  • The sign of the second order derivative tells us whether it is a maximum or minimum.
  • There can be multiple minima or maxima (or none) depending on the function.

    • The “best” minimum (with the lowest value among all minima) is called the global minimum.

    • The others are called local minima.

Multivariate optimization

  • A multivariate function is a function of more than one variable, e.g. f(x, y).
  • A point (x^*, y^*) is an extremum of f if all partial derivatives are zero at the same time:

\begin{cases} \dfrac{\partial f(x^*, y^*)}{\partial x} = 0 \\ \\ \dfrac{\partial f(x^*, y^*)}{\partial y} = 0 \\ \end{cases}

  • The vector of partial derivatives is called the gradient of the function:

\nabla_{x, y} \, f(x, y) = \begin{bmatrix} \dfrac{\partial f(x, y)}{\partial x} \\ \\ \dfrac{\partial f(x, y)}{\partial y} \end{bmatrix}

  • Finding the extremum of f is searching for the values of (x, y) where the gradient of the function is the zero vector:

\nabla_{x, y} \, f(x^*, y^*) = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

Multivariate optimization : example

  • Let’s consider this function:

f(x, y) = (x - 1)^2 + y^2 + 1

  • Its gradient is:

\nabla_{x, y} \, f(x, y) = \begin{bmatrix} 2 (x -1) \\ 2 y \end{bmatrix}

  • The gradient is equal to 0 when:

\begin{cases} 2 \, (x -1) = 0 \\ 2 \, y = 0 \\ \end{cases}

  • \begin{bmatrix} 1 \\ 0 \end{bmatrix} is the minimum of f.

  • One should check the second order derivative to know whether it is a minimum or maximum…

2 - Gradient descent

Problem with analytical optimization

  • In machine learning, we generally do not have access to the analytical form of the objective function.

  • We can not therefore get its derivative and search where it is 0.

  • However, we have access to its value (and derivative) for certain values, for example:

f(0, 1) = 2 \qquad f'(0, 1) = -1.5

  • We can “ask” the model for as many values as we want, but we never get its analytical form.

  • For most useful problems, the function would be too complex to differentiate anyway.

Euler method

  • Let’s remember the definition of the derivative of a function. The derivative f'(x) is defined by the slope of the tangent of the function:

\begin{aligned} f'(x) & = \lim_{h \to 0} \frac{f(x + h) - f(x)}{x + h - x} \\ &\\ &= \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} \\ \end{aligned}

  • If we take h small enough, we have the following approximation:

f(x + h) - f(x) \approx h \, f'(x)

  • We are making an error, but it is negligible if h is small enough (Taylor series).

Euler method

  • First order approximation:

f(x + h) - f(x) \approx h \, f'(x)

  • If we want x+h to be closer to the minimum than x, we want:

f(x + h) < f(x)

  • We therefore want that:

h \, f'(x) < 0

  • The change h in the value of x must have the opposite sign of f'(x).

    • If the function is increasing in x, the minimum is smaller than x.

    • If the function is decreasing in x, the minimum is bigger than x.

Gradient descent

  • Gradient descent (GD) is a first-order method to iteratively find the minimum of a function f(x).
  • It creates a series of estimates [x_0, x_1, x_2, \ldots] that converges to a local minimum of f.

  • Each element of the series is calculated based on the previous element and the derivative of the function in that element:

x_{n+1} = x_n + \Delta x = x_n - \eta \, f'(x_n)

  • \eta is a small parameter between 0 and 1 called the learning rate.

Gradient descent

Gradient descent algorithm

  • We start with an initially wrong estimate of x: x_0

  • for n \in [0, \infty]:

    • We compute or estimate the derivative of the loss function in x_{n}: f'(x_{n})

    • We compute a new value x_{n+1} for the estimate using the gradient descent update rule:

    \Delta x = x_{n+1} - x_n = - \eta \, f'(x_n)

  • There is theoretically no end to the GD algorithm: we iterate forever and always get closer to the minimum.

  • The algorithm can be stopped when the change \Delta x is below a threshold.

Gradient descent

Multivariate gradient descent

  • Gradient descent can be applied to multivariate functions:

\min_{x, y, z} \qquad f(x, y, z)

  • Each variable is updated independently using partial derivatives:

\Delta x = x_{n+1} - x_{n} = - \eta \, \frac{\partial f(x_n, y_n, z_n)}{\partial x} \Delta y = y_{n+1} - y_{n} = - \eta \, \frac{\partial f(x_n, y_n, z_n)}{\partial y} \Delta z = z_{n+1} - z_{n} = - \eta \, \frac{\partial f(x_n, y_n, z_n)}{\partial z}

  • We can also use the vector notation to use the gradient operator:

\mathbf{x}_n = \begin{bmatrix} x_n \\ y_n \\ z_n \end{bmatrix} \quad \text{and} \quad \nabla_\mathbf{x} \, f(\mathbf{x}) = \begin{bmatrix} \dfrac{\partial f(x, y, z)}{\partial x} \\ \\ \dfrac{\partial f(x, y, z)}{\partial y} \\ \\ \dfrac{\partial f(x, y, z)}{\partial z} \end{bmatrix}

which gives:

\Delta \mathbf{x} = - \eta \, \nabla_\mathbf{x} \, f(\mathbf{x}_n)

Multivariate gradient descent

Influence of the learning rate

  • The parameter \eta is called the learning rate (or step size) and regulates the speed of convergence.

  • The choice of the learning rate \eta is critical:

    • If it is too small, the algorithm will need a lot of iterations to converge.

    • If it is too big, the algorithm can oscillate around the desired values without ever converging.

Optimality of gradient descent

  • Gradient descent is not optimal: it always finds a local minimum, but there is no guarantee that it is the global minimum.

  • The found solution depends on the initial choice of x_0. If you initialize the parameters near to the global minimum, you are lucky. But how?

  • This will be a big issue in neural networks.

3 - Regularization

Regularization

  • Most of the time, there are many minima to a function, if not an infinity.

  • As GD only converges to the “closest” local minimum, you are never sure that you get a good solution.

  • Consider the following function:

f(x, y) = (x -1)^2

  • As it does not depend on y, whatever initial value y_0 will be considered as a solution.

  • As we will see later, this is something we do not want.

Regularization

L2 - Regularization

  • We may want to put the additional constraint that x and y should be as small as possible.

  • One possibility is to also minimize the Euclidian norm (or L2-norm) of the vector \mathbf{x} = [x, y].

\min_{x, y} ||\mathbf{x}||^2 = x^2 + y^2

  • Note that this objective is in contradiction with the original objective: (0, 0) minimizes the norm, but not the function f(x, y).

  • We construct a new function as the sum of f(x, y) and the norm of \mathbf{x}, weighted by the regularization parameter \lambda:

\mathcal{L}(x, y) = f(x, y) + \lambda \, (x^2 + y^2)

L2 - Regularization

  • For a fixed value of \lambda, for example 0.1, we now minimize using gradient descent the following loss function function:

\mathcal{L}(x, y) = f(x, y) + \lambda \, (x^2 + y^2)

  • We just need to compute its gradient:

\nabla_{x, y} \, \mathcal{L}(x, y) = \begin{bmatrix} \dfrac{\partial f(x, y)}{\partial x} + 2\, \lambda \, x \\ \\ \dfrac{\partial f(x, y)}{\partial y} + 2\, \lambda \, y \end{bmatrix}

and apply gradient descent iteratively:

\Delta \begin{bmatrix} x \\ y \end{bmatrix} = - \eta \, \nabla_{x, y} \, \mathcal{L}(x, y) = - \eta \, \begin{bmatrix} \dfrac{\partial f(x, y)}{\partial x} + 2\, \lambda \, x \\ \\ \dfrac{\partial f(x, y)}{\partial y} + 2\, \lambda \, y \end{bmatrix}

L2 - Regularization

L2 - Regularization

  • You may notice that the result of the optimization is a bit off, it is not exactly (1, 0).

  • This is because we do not optimize f(x, y) directly, but \mathcal{L}(x, y).

  • Let’s look at the real landscape of the function.

\mathcal{L}(x, y) = f(x, y) + \lambda \, (x^2 + y^2)

L2 - Regularization

L2 - Regularization

  • The optimization with GD works, it is just that the function is different.

  • The constraint on the Euclidian norm “attracts” or “distorts” the function towards (0, 0).

  • This may seem counter-intuitive, but we will see with deep networks that we can live with it.

  • Let’s now look at what happens when we increase \lambda (to 5.0).

L2 - Regularization

L2 - Regularization

L2 - Regularization

  • Now the result of the optimization is totally wrong: the constraint on the norm completely dominates the optimization process.

\mathcal{L}(x, y) = f(x, y) + \lambda \, (x^2 + y^2)

  • \lambda controls which of the two objectives, f(x, y) or x^2 + y^2, has the priority:

    • When \lambda is small, f(x, y) dominates and the norm of \mathbf{x} can be anything.

    • When \lambda is big, x^2 + y^2 dominates, the result will be very small but f(x, y) will have any value.

  • The right value for \lambda is hard to find. We will see later methods to experimentally find its most adequate value.

L1 - Regularization

  • Another form of regularization is L1 - regularization using the L1-norm (absolute values):

\mathcal{L}(x, y) = f(x, y) + \lambda \, (|x| + |y|)

  • Its gradient only depend on the sign of x and y:

\nabla_{x, y} \, \mathcal{L}(x, y) = \begin{bmatrix} \dfrac{\partial f(x, y)}{\partial x} + \lambda \, \text{sign}(x) \\ \\ \dfrac{\partial f(x, y)}{\partial y} + \lambda \, \text{sign}(y) \end{bmatrix}

  • It tends to lead to sparser value of (x, y), i.e. either x or y will be 0.

L1 - Regularization