Diffusion Probabilistic Models
Professur für Künstliche Intelligenz - Fakultät für Informatik
Ho et al. (2020) Denoising Diffusion Probabilistic Models arXiv:2006.11239
Generative modeling consists in transforming a simple probability distribution (e.g. Gaussian) into a more complex one (e.g. images).
Learning this model allows to easily sample complex images.
The task of the generators in GAN or VAE is very hard: going from noise to images in a few layers.
The other direction is extremely easy.
X_t = \sqrt{1 - p} \, X_{t-1} + \sqrt{p} \, \sigma \qquad\qquad \text{where} \qquad\qquad \sigma \sim \mathcal{N}(0, 1)
A diffusion process can iteratively destruct all information in an image through a Markov chain.
A Markov chain implies that each step is independent and governed by a probability distribution p(X_t | X_{t-1}).
Vincent et al. (2010) “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”. JMLR.
The forward process iteratively corrupts the image using q(x_t | x_{t-1}) for T steps (e.g. T=1000).
The goal is to learn a reverse process p_\theta(x_{t-1} | x_t) that approximates the true q(x_{t-1} | x_t).
q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)
x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \beta_t \epsilon \;\; \text{where} \; \epsilon \sim \mathcal{N}(0, I)
The parameter \beta_t is annealed with a decreasing schedule, as adding more noise at the end does not destroy much more information.
Nice property: each image x_t is also a noisy version of the original image x_0:
q(x_t | x_{0}) = \mathcal{N}(x_t; \sqrt{1 - \bar{\alpha}_t} \, x_0, \bar{\alpha}_t I) x_t = \sqrt{\bar{\alpha}_t} \, x_0 + (1 - \bar{\alpha}_t) \, \epsilon_t \;\; \text{where} \; \epsilon_t \sim \mathcal{N}(0, I)
with \alpha_t = 1 - \beta_t and \bar{\alpha}_t = \prod_{s=1}^t \alpha_s only depending on the history of \beta_t.
p_\theta(x_{0:T}) = p(x_T) \, \prod_{t=1}^T p_\theta(x_{t-1} | x_t)
where:
p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))
\begin{cases} \mu_t = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_t)\\ \\ \sigma_t = \dfrac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_{t}} \beta_t \, I = \bar{\beta}_t \, I \\ \end{cases}
The reverse process is also normally distributed, provided the forward noise \beta_t was not too big.
The reverse variance only depends on the schedule of \beta_t, it can be pre-computed.
\mu_\theta(x_t, t) = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t))
x_t is an input to the model, it does not have to predicted.
All we need to learn is the noise \epsilon_\theta(x_t, t) \approx \epsilon_t that was added to the original image x_0 to obtain x_t:
x_t = \sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t
\epsilon_\theta(x_t, t) = \epsilon_\theta(\sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t, t) \approx \epsilon_t
\begin{aligned} \mathcal{L}(\theta) &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(x_t, t))^2] \\ &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(\sqrt{1 - \bar{\alpha}_t} \, x_0 + \bar{\alpha}_t \, \epsilon_t, t) )^2] \\ \end{aligned}
x_{t-1} = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t)) + \sigma_t \, z
Ramesh et al. (2022) Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125
Text-to-image generators such as Dall-e, Midjourney, or Stable Diffusion combine LLM for text embedding with diffusion models for image generation.
CLIP embeddings of texts and images are first learned using contrastive learning.
A conditional diffusion process (GLIDE) then uses the image embeddings to produce images.
Embeddings for text and images are learned using Transformer encoders and contrastive learning.
For each pair (text, image) in the training set, their representation should be made similar, while being different from the others.
DDPMs generate images from raw noise, but there is no control over which image will emerge.
GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a DDPM conditioned on a latent representation of a caption c.
As for cGAN and cVAE, the caption c is provided to the learned model:
\epsilon_\theta(x_t, t, c) \approx \epsilon_t
Nichol et al. (2022) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741
After CLIP training, the two embeddings are already close from each other, but the authors find that the diffusion process works better when the image embeddings change during the diffusion.
The image embedding is then used as the condition for GLIDE.
Ramesh et al. (2022) Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125