Diffusion Probabilistic Models
Professur für Künstliche Intelligenz - Fakultät für Informatik
X_t = \sqrt{1 - p} \, X_{t-1} + \sqrt{p} \, \sigma \qquad\qquad \text{where} \qquad\qquad \sigma \sim \mathcal{N}(0, 1)
We will not get into details, but learning the reverse diffusion step implies Bayesian inference, KL divergence and so on.
As we have the images at t and t+1, it should be possible to learn, right?
Embeddings for text and images are learned using Transformer encoders and contrastive learning.
For each pair (text, image) in the training set, their representation should be made similar, while being different from the others.