Diffusion Probabilistic Models
Professur für Künstliche Intelligenz - Fakultät für Informatik
Ho et al. (2020) Denoising Diffusion Probabilistic Models arXiv:2006.11239
Generative modeling consists in transforming a simple probability distribution (e.g. Gaussian) into a more complex one (e.g. images).
Learning this model allows to easily sample complex images.
The task of the generators in GAN or VAE is very hard: going from noise to images in a few layers.
The other direction is extremely easy.
X_t = \sqrt{1 - p} \, X_{t-1} + \sqrt{p} \, \sigma \qquad\qquad \text{where} \qquad\qquad \sigma \sim \mathcal{N}(0, 1)
A diffusion process can iteratively destruct all information in an image through a Markov chain.
A Markov chain implies that each step is independent and governed by a probability distribution p(X_t | X_{t-1}).
Vincent et al. (2010) “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”. JMLR.
The forward process iteratively corrupts the image using q(x_t | x_{t-1}) for T steps (e.g. T=1000).
The goal is to learn a reverse model p_\theta(x_{t-1} | x_t) that approximates the true q(x_{t-1} | x_t).
q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I) x_t = \sqrt{1 - \beta_t} \, x_t + \beta_t \epsilon \;\; \text{where} \; \epsilon \sim \mathcal{N}(0, I)
The parameter \beta_t is annealed with a decreasing schedule, as adding more noise at the end does not destroy much information.
Note that each image x_t is also a Gaussian noisy version of the original image x_0:
q(x_t | x_{0}) = \mathcal{N}(x_t; \sqrt{1 - \bar{\alpha}_t} \, x_0, \bar{\alpha}_t I) x_t = \sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t \;\; \text{where} \; \epsilon_t \sim \mathcal{N}(0, I)
with \alpha_t = 1 - \beta_t and \bar{\alpha}_t = \prod_{s=1}^t \alpha_s only depending on the history of \beta_t.
p_\theta(x_{0:T}) = p(x_T) \, \prod_{t=1}^T p_\theta(x_{t-1} | x_t)
where:
p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))
\begin{cases} \mu_t = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_t)\\ \\ \sigma_t = \dfrac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_{t}} \beta_t \, I = \bar{\beta}_t \, I \\ \end{cases}
The reverse process is also normally distributed, provided the forward noise \beta_t was not too big.
The reverse variance only depends on the schedule of \beta_t, it can be pre-computed.
\mu_\theta(x_t, t) = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t))
x_t is an input to the model, it does not have to predicted.
All we need to learn is the noise \epsilon_\theta(x_t, t) \approx \epsilon_t that was added to the original image x_0 to obtain x_t:
x_t = \sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t
\epsilon_\theta(x_t, t) = \epsilon_\theta(\sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t, t) \approx \epsilon_t
\begin{aligned} \mathcal{L}(\theta) &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(x_t, t))^2] \\ &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(\sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t, t) )^2] \\ \end{aligned}
x_{t-1} = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t)) + \sigma_t \, z
PDMs generate images from raw noise, but there is no control over which image will emerge.
GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a PDM conditioned on a latent representation of a caption c.
As for cGAN and cVAE, the caption c is provided to the learned model:
\epsilon_\theta(x_t, t, c) \approx \epsilon_t
Nichol et al. (2022) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741
Ramesh et al. (2022) Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125
CLIP embeddings are first learned using contrastive learning.
A conditional diffusion process (GLIDE) uses the image embeddings to produce images.
Dall-e 3, Midjourney, Stable Diffusion, etc., work on similar principles.
Embeddings for text and images are learned using Transformer encoders and contrastive learning.
For each pair (text, image) in the training set, their representation should be made similar, while being different from the others.
Ramesh et al. (2022) Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125