Diffusion Probabilistic Models
Professur für Künstliche Intelligenz - Fakultät für Informatik

Ho et al. (2020) Denoising Diffusion Probabilistic Models arXiv:2006.11239
Generative modeling consists in transforming a simple probability distribution (e.g. Gaussian) into a more complex one (e.g. images).
Learning this model allows to easily sample complex images.
The task of the generators in GAN or VAE is very hard: going from noise to images in a few layers.
The other direction is extremely easy.
The forward diffusion process iteratively destructs all information in the image through a Markov chain that adds white noise.
A Markov chain implies that each step is independent and governed by a probability distribution q(x_t | x_{t-1}).
\begin{aligned} \text{discrete:} \qquad& x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon \\ & \qquad\qquad\text{where} \qquad \epsilon \sim \mathcal{N}(0, 1) \\ &\\ \text{continuous:} \qquad & dx = - \dfrac{1}{2} \, \beta(t) \, x \, dt + \sqrt{\beta(t)} \, dW \\ \end{aligned}
dx = \underbrace{\mu(x, t) \, dt}_{\text{drift}} + \underbrace{\sigma(x, t) \, dW}_{\text{diffusion}}

Vincent et al. (2010) “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”. JMLR.
The forward process iteratively corrupts the image using q(x_t | x_{t-1}) for T steps (e.g. T=1000).
The goal is to learn a reverse process p_\theta(x_{t-1} | x_t) that approximates the true posterior q(x_{t-1} | x_t).
x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon \;\; \text{where} \; \epsilon \sim \mathcal{N}(0, I)
q(x_t | x_{t-1}) = \mathcal{N}(\mu_t = \sqrt{1 - \beta_t} \, x_{t-1}, \Sigma_t^2 = \beta_t \, I)
\mu_t = \sqrt{1 - \beta_t} \, x_{t-1} is the mean of the distribution, \Sigma_t^2 = \beta_t \, I its variance.
The parameter \beta_t is annealed with a decreasing schedule, as adding more noise at the end does not destroy much more information.

q(x_t | x_{0}) = \mathcal{N}(\sqrt{\bar{\alpha}_t} \, x_0 , (1 - \bar{\alpha}_t) \, I) x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon_t \;\; \text{where} \; \epsilon_t \sim \mathcal{N}(0, I)
with \alpha_t = 1 - \beta_t and \bar{\alpha}_t = \prod_{s=1}^t \alpha_s only depending on the history of \beta_t.
Adding Gaussian noise to Gaussian noise is still Gaussian noise.
We do not need to perform t noising steps on x_0 to obtain x_t!
p_\theta(x_{0:T}) = p(x_T) \, \prod_{t=1}^T p_\theta(x_{t-1} | x_t)
where:
p_\theta(x_{t-1} | x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

\begin{cases} \mu_t = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_t)\\ \\ \Sigma_t = \dfrac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_{t}} \beta_t \, I = \bar{\beta}_t \, I \\ \end{cases}

\mu_\theta(x_t, t) = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t))
x_t is an input to the model, it does not have to predicted.
All we need to learn is the noise \epsilon_\theta(x_t, t) \approx \epsilon_t that was added to the original image x_0 to obtain x_t:
x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon_t

\epsilon_\theta(x_t, t) = \epsilon_\theta(\sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon_t, t) \approx \epsilon_t
\begin{aligned} \mathcal{L}(\theta) &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(x_t, t))^2] \\ &= \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [(\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon_t, t) )^2] \\ \end{aligned}

Ho et al. (2020) Denoising Diffusion Probabilistic Models. arXiv:2006.11239

p_\theta(x_{t-1} | x_t) = \mathcal{N}(\mu_\theta(x_t, t), \bar{\beta}_t \, I)
x_{t-1} = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{1-\alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t)) + \bar{\beta}_t \, \epsilon
The last step x_1 \rightarrow x_0 can be done deterministically.
It is possible to use less iterations (200) by taking bigger steps, but this stays expensive and generates lower quality images.

dx = \left[-\frac{1}{2}\beta(t) x + \frac{\beta(t)}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x, t)\right] dt + \sqrt{\beta(t)} \, d\bar{W}
x_{t-1} = \dfrac{1}{\sqrt{\alpha_t}} \, (x_t - \dfrac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t)) + \bar{\beta}_t \, \epsilon
x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \, (\dfrac{x_t - \sqrt{1- \bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{ \sqrt{\bar{\alpha}_t}}) + \sqrt{1 - \bar{\alpha}_{t-1}} \, \epsilon_\theta(x_t, t)

Let’s now see how to arrive to the same loss function from a variational point of view.
The blog post by Lilian Weng is very helpful: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/, as well as this one by Jonathan Kernes https://towardsdatascience.com/diffusion-models-91b75430ec2.
As we have a generative model (learning the distribution of images x_0), all we want to maximize is the log-likelihood of the model p_\theta(x_0) on the data, which gives us the following NLL loss:
\mathcal{L}(\theta) = \mathbb{E}_{x_0 \sim q(x_0)} [- \log p_\theta(x_0)] = \int - q(x_0) \log p_\theta(x_0) dx_0
The first trick is to rewrite the log-evidence \log p_\theta(x_0) as a function of normalized transition probabilities using the Markov assumption.
Let’s first marginalize p_\theta(x_0) using the rest of the sequence (x_1, \ldots x_T), involving the joint probability of the sequence:
\log p_\theta(x_0) = \log \int p_\theta(x_0, x_1, \ldots, x_T)\, dx_1 \, dx_2 \ldots \, dx_T
p_\theta(x_0, x_1, \ldots, x_T) = p_\theta(x_T) \prod_{t=1}^T p_\theta(x_{t-1} | x_t)
q(x_0, x_1, \ldots, x_T) = q(x_0) \prod_{t=1}^T q(x_{t} | x_{t-1})
q(x_1, \ldots, x_T | x_0) = \prod_{t=1}^T q(x_{t} | x_{t-1})
\log p_\theta(x_0) = \log \int p_\theta(x_T) \prod_{t=1}^T p_\theta(x_{t-1} | x_t) \, dx_{1..T}
\begin{aligned} \log p_\theta(x_0) & = \log \int \dfrac{q(x_1, \ldots, x_T | x_0)}{q(x_1, \ldots, x_T | x_0)} p_\theta(x_T) \prod_{t=1}^T p_\theta(x_{t-1} | x_t) \, dx_{1..T} \\ &\\ &= \log \int q(x_1, \ldots, x_T | x_0) \, p_\theta(x_T) \prod_{t=1}^T \frac{p_\theta(x_{t-1} | x_t)}{q(x_t | x_{t-1})} \, dx_{1..T} \\ &\\ &= \log \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ p_\theta(x_T) \prod_{t=1}^T \frac{p_\theta(x_{t-1} | x_t)}{q(x_t | x_{t-1})} ] \\ \end{aligned}
The importance sampling trick is a brilliant way to replace pesky integrals with expectations, which can later be sampled. This is exactly the same trick as in RL and policy gradients.
What this expectation means is that we only need to sample the sequence x_1, \ldots, x_t conditioned on x_0 (the input image) following the forward process.
\begin{aligned} \log p_\theta(x_0) &\geq \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ \log ( p_\theta(x_T) \prod_{t=1}^T \frac{p_\theta(x_{t-1} | x_t)}{q(x_t | x_{t-1})} )] \\ &\\ &= \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ \log p_\theta(x_T) + \sum_{t=1}^T \log \frac{p_\theta(x_{t-1} | x_t)}{q(x_t | x_{t-1})} ] \\ &\\ &= \text{ELBO}(p_\theta, q, x_0) \end{aligned}
The evidence lower bound (ELBO) or variational lower bound (VLB) is a classical trick in variational inference (including VAE): the goal is to maximize the evidence \log p_\theta(x_0), but it is generally intractable.
We instead maximize its lower bound, as the parameters \theta maximizing the ELBO will also maximize the evidence (see https://yunfanj.com/blog/2021/01/11/ELBO.html for more details).
The ratio of probabilities \log \dfrac{p_\theta(x_{t-1} | x_t)}{q(x_t | x_{t-1})} looks interesting, as it might lead us to a KL divergence, but the conditionals are reversed.
Let’s use the Bayes rule to invert q(x_t | x_{t-1}), remembering from earlier that it is only tractable when conditioned on x_0:
q(x_{t}| x_{t-1}) = q(x_{t}| x_{t-1}, x_0) = q(x_{t-1}| x_{t}, x_0) \, \dfrac{q(x_{t} | x_0)}{q(x_{t-1} | x_0)}
\begin{aligned} \log p_\theta(x_0) &\geq \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ \log p_\theta(x_T) + \sum_{t=2}^T \log \frac{p_\theta(x_{t-1} | x_t)}{q(x_t | x_{t-1})} + \log \dfrac{p_\theta(x_0 | x_1)}{q(x_1 | x_0)}] \\ &\\ &= \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ \log p_\theta(x_T) + \sum_{t=2}^T \log \frac{p_\theta(x_{t-1} | x_t)}{q(x_{t-1} | x_{t}, x_0)} \, \dfrac{q(x_{t-1} | x_0)}{q(x_t | x_0)}+ \log \dfrac{p_\theta(x_0 | x_1)}{q(x_1 | x_0)}] \\ &\\ &= \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ \log p_\theta(x_T) + \sum_{t=2}^T \log \frac{p_\theta(x_{t-1} | x_t)}{q(x_{t-1} | x_{t}, x_0)} + \sum_{t=2}^T \log \dfrac{q(x_{t-1} | x_0)}{q(x_t | x_0)} + \log \dfrac{p_\theta(x_0 | x_1)}{q(x_1 | x_0)}] \\ \end{aligned}
\begin{aligned} \log p_\theta(x_0) &\geq \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ \log p_\theta(x_T) + \sum_{t=2}^T \log \frac{p_\theta(x_{t-1} | x_t)}{q(x_{t-1} | x_{t}, x_0)} -\log q(x_T | x_0) + \log p_\theta(x_0 | x_1) ] \\ &\\ & = \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ \log \dfrac{p_\theta(x_T)}{q(x_T | x_0)} + \sum_{t=2}^T \log \frac{p_\theta(x_{t-1} | x_t)}{q(x_{t-1} | x_{t}, x_0)} + \log p_\theta(x_0 | x_1) ]\\ &\\ & = \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ L_T + \sum_{t=2}^T L_{t} + L_0 ]\\ \end{aligned}
\begin{aligned} \log p_\theta(x_0) & = \mathbb{E}_{x_1, \dots, x_T \sim q(x_1, \ldots, x_T | x_0)} [ L_T + \sum_{t=2}^T L_{t} + L_0 ]\\ \end{aligned}
\mathcal{L}_\text{VLB}(\theta) = \mathbb{E}_{x_0 \sim q(x_0), t \sim \mathcal{U}(1, T)} [\text{KL} (p_\theta(x_{t-1} | x_t), q(x_{t-1} | x_{t}, x_0) ) ]
Remember that L_T does not matter, and that L_0 will be a special case. We call this loss the variational lower bound VLB.
Moreover, we know that both distributions inside the KL are normal distributions. As shown earlier, the true posterior is known to be:
q(x_{t-1}| x_t, x_0) = \mathcal{N}(x_{t-1}; \hat{\mu}(x_t, t) = \dfrac{1}{\sqrt{\bar{\alpha}_t}} \, (x_t - \dfrac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_t), (\dfrac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \, \beta_t)^2 \, I)
p_\theta(x_{t-1}| x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma^2_\theta(x_t, t))
Closed form of the KL divergence between two Gaussians
The closed form of the KL divergence between X = \mathcal{N}(\mu_x, \sigma^2_x) and Y=\mathcal{N}(\mu_y, \sigma^2_y) is simply a function of their parameters:
\text{KL}(X ||Y) = \log \dfrac{\sigma_y}{\sigma_x} + \dfrac{\sigma_x^2 + (\mu_x - \mu_y)^2}{2 \sigma_y^2} - \dfrac{1}{2}
Furthermore, an assumption of DDPM is that the variance schedule \beta_t is fixed and does not depend on \theta.
All terms of the KL depending on the variances disappear when calculating the gradient of the KL.
All that remains in the loss function is the squared difference between the means, i.e. the mse!
\mathcal{L}_\text{VLB}(\theta) = \mathbb{E}_{x_0 \sim q(x_0), t \sim \mathcal{U}(1, T)} [ (\hat{\mu}(x_t, t) - \mu_\theta(x_t, t))^2 ]
\mu_\theta(x_t, t) = \dfrac{1}{\sqrt{\bar{\alpha}_t}} \, (x_t - \dfrac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t))
and have only to compute the mse between the residuals \epsilon_t and \epsilon_\theta(x_t, t). This is the simple objective we used previously:
\mathcal{L}(\theta) = \mathbb{E}_{x_0 \sim q, \epsilon_t \sim \mathcal{N}(0, I), t \sim \mathcal{U}(1, T)} [(\epsilon_t - \epsilon_\theta(x_t, t))^2]

Ramesh et al. (2022) Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125
Text-to-image generators such as Dall-e, Midjourney, or Stable Diffusion combine LLM for text embedding with diffusion models for image generation.
CLIP embeddings of texts and images are first learned using contrastive learning.
A conditional diffusion process (GLIDE) then uses the image embeddings to produce images.

Embeddings for text and images are learned using Transformer encoders and contrastive learning.
For each pair (text, image) in the training set, their representation should be made similar, while being different from the others.
DDPMs generate images from raw noise, but there is no control over which image will emerge.
GLIDE (Guided Language to Image Diffusion for Generation and Editing) is a DDPM conditioned on a latent representation of a caption c.
As for cGAN and cVAE, the caption c is provided to the learned model:
\epsilon_\theta(x_t, t, c) \approx \epsilon_t
Nichol et al. (2022) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741

After CLIP training, the two embeddings are already close from each other, but the authors find that the diffusion process works better when the image embeddings change during the diffusion.
The image embedding is then used as the condition for GLIDE.
Ramesh et al. (2022) Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125