Neurocomputing

Diffusion Probabilistic Models

Julien Vitay

Professur für Künstliche Intelligenz - Fakultät für Informatik

1 - Denoising Diffusion Probabilistic Model (DDPM)

Generative modeling

  • Generative modeling consists in transforming a simple probability distribution (e.g. Gaussian) into a more complex one (e.g. images).

  • Learning this model allows to easily sample complex images.

VAE and GAN transform simple noise into complex distributions

Destroying information is easier than creating it

  • The task of the generators in GAN or VAE is very hard: going from noise to images in a few layers.

  • The other direction is extremely easy.

Stochastic processes can destroy information

  • Iteratively adding normal noise to a signal creates a stochastic differential equation (SDE).

X_t = \sqrt{1 - p} \, X_{t-1} + \sqrt{p} \, \sigma \qquad\qquad \text{where} \qquad\qquad \sigma \sim \mathcal{N}(0, 1)

  • Under some conditions, any probability distribution converges to a normal distribution.

Diffusion process

  • A diffusion process can iteratively destruct all information in an image through a Markov chain.

  • A Markov chain implies that each step is independent and governed by a probability distribution p(X_t | X_{t-1}).

Probabilistic diffusion models

  • It should be possible to reverse each diffusion step by removing the noise using a form of denoising autoencoder.

Reminder: Denoising autoencoder

  • A denoising autoencoder (DAE) is trained with noisy inputs but perfect desired outputs. It learns to suppress that noise.

Forward Diffusion process

  • The forward process iteratively corrupts the image using q(x_t | x_{t-1}) for T steps (e.g. T=1000).

  • The goal is to learn a reverse process p_\theta(x_{t-1} | x_t) that approximates the true q(x_{t-1} | x_t).

Forward Diffusion process

  • The forward diffusion process iteratively adds Gaussian noise with a fixed schedule \beta_t:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)

x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \beta_t \epsilon \;\; \text{where} \; \epsilon \sim \mathcal{N}(0, I)

  • \mu_t = \sqrt{1 - \beta_t} \, x_{t-1} is the mean of the distribution, \sigma_t = \beta_t I its variance.

Source: Nichol and Dhariwal (2021) Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672
  • The parameter \beta_t is annealed with a decreasing schedule, as adding more noise at the end does not destroy much more information.

  • Nice property: each image x_t is also a noisy version of the original image x_0:

q(x_t | x_{0}) = \mathcal{N}(x_t; \sqrt{1 - \bar{\alpha}_t} \, x_0, \bar{\alpha}_t I) x_t = \sqrt{\bar{\alpha}_t} \, x_0 + (1 - \bar{\alpha}_t) \, \epsilon_t \;\; \text{where} \; \epsilon_t \sim \mathcal{N}(0, I)

with \alpha_t = 1 - \beta_t and \bar{\alpha}_t = \prod_{s=1}^t \alpha_s only depending on the history of \beta_t.

  • We do not need to perform t noising steps on x_0 to obtain x_t!

Reverse diffusion model

  • The goal of the reverse diffusion process is to find a parameterized model p_\theta explaining the sequence of images backwards in time:

p_\theta(x_{0:T}) = p(x_T) \, \prod_{t=1}^T p_\theta(x_{t-1} | x_t)

where:

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

  • The reverse process is also normally distributed, given that the noise \beta_t is not too big.

Source: Ho et al. (2020) Denoising Diffusion Probabilistic Models arXiv:2006.11239

Denoising Probabilistic Diffusion Model

  • By doing some Bayesian inference on the true posterior q(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_t, \sigma_t), (Ho et al., 2020) could show that:

\begin{cases} \mu_t = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_t)\\ \\ \sigma_t = \dfrac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_{t}} \beta_t \, I = \bar{\beta}_t \, I \\ \end{cases}

  • The reverse process is also normally distributed, provided the forward noise \beta_t was not too big.

  • The reverse variance only depends on the schedule of \beta_t, it can be pre-computed.

Source: Ho et al. (2020) Denoising Diffusion Probabilistic Models arXiv:2006.11239

Denoising Probabilistic Diffusion Model

  • The reverse model p_\theta(x_{t-1} | x_t) only need to approximate the mean:

\mu_\theta(x_t, t) = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t))

  • x_t is an input to the model, it does not have to predicted.

  • All we need to learn is the noise \epsilon_\theta(x_t, t) \approx \epsilon_t that was added to the original image x_0 to obtain x_t:

x_t = \sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t

Source: Ho et al. (2020) Denoising Diffusion Probabilistic Models arXiv:2006.11239

Denoising Probabilistic Diffusion Model

  • We want to predict the added noise in the image space:

\epsilon_\theta(x_t, t) = \epsilon_\theta(\sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t, t) \approx \epsilon_t

  • We can simply minimize the mse with the true noise:

\begin{aligned} \mathcal{L}(\theta) &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(x_t, t))^2] \\ &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(\sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t, t) )^2] \\ \end{aligned}

  • We only need to sample an image x_0, a time step t, a noise \epsilon_t \sim \mathcal{N}(0, I), predict the noise \epsilon_\theta(x_t, t) and minimize the mse!

Source: Ronneberger et al. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597
  • The neural network used for the reverse diffusion is usually some kind of U-net, with attentional layers, or even a vision Transformer.

Denoising Probabilistic Diffusion Model

  • Training can be done on individual samples, no need for the whole Markov chain to create the minibatches.

Denoising Probabilistic Diffusion Model

  • The reverse diffusion occurs iteratively backwards in time:

x_{t-1} = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t)) + \sigma_t \, z

Denoising Probabilistic Diffusion Model

2 - Dall-e 2

Dall-e 2

  • Text-to-image generators such as Dall-e, Midjourney, or Stable Diffusion combine LLM for text embedding with diffusion models for image generation.

  • CLIP embeddings of texts and images are first learned using contrastive learning.

  • A conditional diffusion process (GLIDE) then uses the image embeddings to produce images.

Source: Ramesh et al. (2022)

CLIP: Contrastive Language-Image Pre-training

  • Embeddings for text and images are learned using Transformer encoders and contrastive learning.

  • For each pair (text, image) in the training set, their representation should be made similar, while being different from the others.

GLIDE

  • DDPMs generate images from raw noise, but there is no control over which image will emerge.

  • GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a DDPM conditioned on a latent representation of a caption c.

  • As for cGAN and cVAE, the caption c is provided to the learned model:

\epsilon_\theta(x_t, t, c) \approx \epsilon_t

  • Text embeddings can be obtained from any NLP model, for example a Transformer.

Dall-e 2

  • In Dall-e 2, the prior network learns to map text embeddings to a sequence of image embeddings:

  • After CLIP training, the two embeddings are already close from each other, but the authors find that the diffusion process works better when the image embeddings change during the diffusion.

  • The image embedding is then used as the condition for GLIDE.

References

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. doi:10.48550/arXiv.2006.11239.
Nichol, A., and Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. doi:10.48550/arXiv.2102.09672.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., et al. (2022). GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. doi:10.48550/arXiv.2112.10741.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. doi:10.48550/arXiv.2204.06125.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. http://arxiv.org/abs/1505.04597.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. J. Mach. Learn. Res. 11, 3371–3408.