Blog post written by: Rohan Singh

Based on: J. Ho, A. Jain, P. Abbeel. Denoising Diffusion Probabilistic Models, DOI:10.48550/arXiv.2006.11239, (2020) https://doi.org/10.48550/arXiv.2006.11239

1. Introduction

Diffusion Models are a class of probabilistic generative models, which can generate new images which are similar the original dateset that they train on. Denoising diffusion probabilistic models answers the problems faced by GAN's and VAE's, i.e: unstable training and not so realistic images, and they show great promise in image generation.

Diffusion Models consists of two processes:

Forward diffusion process
Reverse diffusion process

we will discuss both the aforementioned processes in depth, in the methodology.

2. Methodology

In this section, we will cover in depth how is a diffusion model able to create such realistic images. Now, we will first understand how the Forward diffusion process works.

2.1. Forward diffusion process

Given an image from a Dateset, which follows a data distribution $\begin{array}{l}x_0 \sim q(x)\end{array}$ . We define a forward diffusion process in which we add a small amount of Gaussian noise over the horizon $\begin{array}{l}T\end{array}$ noise and perturb the image, which would produce a sequence of noisy samples $\begin{array}{l}x_1, ....,x_T\end{array}$ .

The distribution $\begin{array}{l}q\end{array}$ for the forward diffusion process can be defined as a Markov chain by:

$\begin{array}{l}\displaystyle q(\mathbf{x}_1, \ldots, \mathbf{x}_T \mathbf{x}_0) := \prod_{t=1}^T q(\mathbf{x}_t \mid \mathbf{x}_{t-1})\end{array}$

where the conditional follows a Gaussian with mean as $\begin{array}{l}\sqrt{1 - \beta_t }\end{array}$ and standard deviation as $\begin{array}{l}\beta_t I\end{array}$ , that means we rescale each pixel by mean and add the standard deviation to perturb it.

$\begin{array}{l}\displaystyle q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) := \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})\end{array}$

where the size of each step, is the amount of noise added at each time step is controlled by $\begin{array}{l}\{\beta_t \in (0, 1)\}_{t=1}^T\end{array}$

Further a nice property of the above process is that we can sample a noisy image $\begin{array}{l}x_t\end{array}$ , at any arbitrary time step in a closed form using reparameterization trick . Let $\begin{array}{l}\alpha = 1\end{array}$ and $\begin{array}{l}\bar{\alpha}_t = \prod_{i=1}^t \alpha_i\end{array}$ , using which we get:

$\begin{aligned}\mathbf{x}_t & =\sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{1 - \alpha_t}{\epsilon}_{t-1} & \text{ ;where }{\epsilon}_{t-1},{\epsilon}_{t-2},.\sim\mathcal{N}(\mathbf{0},\mathbf{I}) \\ & =\sqrt{\alpha_t \alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1 - \alpha_t \alpha_{t-1}}\bar{\boldsymbol{\epsilon}}_{t-2} & \text{ ;where }\bar{\boldsymbol{\epsilon}}_{t-2}\text{ merges two Gaussians(*)}. \\ & =\ldots & \\ & =\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1 - \bar{\alpha}_t}{\epsilon} & \\ q(\mathbf{x}_t\vert\mathbf{x}_0) & =\mathcal{N}(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I}) & \end{aligned}$

The above is based on the fact that when we merge two gaussians $\begin{array}{l}\mathcal{N}(\mathbf{0}, \sigma_1^2\mathbf{I})\end{array}$ and $\begin{array}{l}\mathcal{N}(\mathbf{0}, \sigma_2^2\mathbf{I})\end{array}$ , the new distribution $\begin{array}{l}\mathcal{N}(\mathbf{0}, (\sigma_1^2 + \sigma_2^2)\mathbf{I})\end{array}$ will have the merged standard deviation, which for the above case could be explained as $\begin{array}{l}\sqrt{(1 - \alpha_t) + \alpha_t (1-\alpha_{t-1})} = \sqrt{1 - \alpha_t\alpha_{t-1}}\end{array}$ .

It is important to take a note that at the end of the horizon the perturbed input image will follow a normal distribution: $\begin{array}{l}\mathbf{q(x_T)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\end{array}$ .

2.2. Reverse diffusion process

In the forward process we added noise over the input images over the horizon $\begin{array}{l}T\end{array}$ . In the reverse diffusion process we would like to reverse the forward process and sample from $\begin{array}{l}q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\end{array}$ , where:

$\begin{array}{l}\displaystyle q(\mathbf{x_t_-_1} |\mathbf{x_t}) \propto q(\mathbf{x_t_-_1}) q(\mathbf{x_t|x_t_-_1})\end{array}$

In general $\begin{array}{l}q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)\end{array}$ is intractable, but it is shown that when during the forward process if the $\begin{array}{l}\beta_t\end{array}$ is small then the posterior can be approximated using a normal distribution. Then the joint distribution of the reverse diffusion process is also a markov process

$\begin{array}{l}\displaystyle p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \quad\end{array}$

$\begin{array}{l}\displaystyle p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \{\mu}_\theta(\mathbf{x}_t, t), \{\Sigma}_\theta(\mathbf{x}_t, t))\end{array}$

So, in the reverse process we sample a noise from $\begin{array}{l}\mathbf{q(x_T)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\end{array}$ , and then iterate backwards over the horizon and at each step we try to predict the mean of the gaussain noise that was added to the image in the forwards diffusion process using a neural network architecture(U-Net); we will discuss about the U-Net in brief in the training section

3. Learning process in Denoising Diffusion Probabilistic Model

Since diffusion models are a class of latent variable models, therefore we need to approximate the Evidence Lower bound (ELBO) is a loss function that is used for variational inference. The Variational Lower bound is (VLB and ELBO are interchangeable terms):

$\begin{array}{l}\displaystyle \mathbb{E}_{q(x_0)} \left[ - \log p_\theta(x_0) \right] \leq \mathbb{E}_{q(x_0)q(x_{1:T} \mid x_0)} \left[log \frac{q(x_{1:T} \mid x_0)} {p_\theta(x_{0:T})} \right] =: L\end{array}$

This aforementioned objective can be further written as a combination of KL Divergence and entropy terms to make it analytically computable.

$\begin{aligned} L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big]\\ &= \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ] \end{aligned}$

Every term $\begin{array}{l}L_V_L_B\end{array}$ besides $\begin{array}{l}L_t_-_1\end{array}$ , can be computed analytically, so we can ignore them. Further $\begin{array}{l}q(x_t_-_1 \vert x_t, x_0)\end{array}$ is the tractable posterior distribution, which can be defined as:

$\begin{array}{l}\displaystyle q(x_{t-1} \mid x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}(x_t, x_0), \tilde{\beta} \mathbf{I}),\end{array}$

where $\begin{array}{l}\tilde{\mu}(x_t,x_0) := \frac{\sqrt{\bar{\alpha}}_{_t_-_1}\beta_t }{1 - \alpha_t} x_0 + \frac{\sqrt{1 - \beta_t}(1 - \tilde{\alpha}_{_t_-_1})}{1 - \tilde{\alpha_t}} x_t\end{array}$ (mean )and $\begin{array}{l}\bar{\beta} := \frac{1 - \bar{\alpha}_t_-_1}{1 - \bar{\alpha}_t} \beta_t\end{array}$ (variance). Further since both $\begin{array}{l}q(x_t_-_1 \vert x_t, x_0)\end{array}$ and $\begin{array}{l}p_\theta(x_t_-_1 \vert x_t)\end{array}$ are normal distribution, the KL divergence has a simpler form

$\begin{array}{l}\displaystyle L_t =\mathop{\mathbb{E}_q} \left[\frac{1}{2\sigma^2} \vert\vert \tilde{\mu}(x_t,x_0) - \mu_\theta(x_t,t)\vert\vert^2 \right] + C\end{array}$

Further, it was observed that $\begin{array}{l}\tilde{\mu}(x_t,x_0) = \frac{1}{\sqrt{1 - \beta_t}}(x_t - \frac{\beta_t}{\sqrt{1 - \alpha_t}}\epsilon\end{array}$ and output of neural network model (U-Net) is $\begin{array}{l}\mu_\theta = \frac{1}{1 - \beta_t}\left(x_t - \frac{\beta_t}{\sqrt{1 - \tilde{\alpha}_t}}\epsilon_\theta(x_t,t)\right)\end{array}$ , and the above equation can be further simplified into

$\begin{array}{l}\displaystyle L_t_-_1 = \mathop{\mathbb{E}_{x_{0} \sim q(x_0), \epsilon \sim \mathcal{N}(0,\mathbb{I})}} \left[ \underbrace{\frac{\beta^2_{t}}{2\sigma^2(1- \tilde{\alpha})_t}}_{\lambda_t} \parallel \epsilon - \epsilon_{\theta}(\sqrt{\tilde{\alpha}_t}x_0 + \sqrt{1 -\tilde{\alpha}_t}\epsilon,t) \parallel^2 \right]\end{array}$

where $\begin{array}{l}x_t = \sqrt{\tilde{\alpha}_t}x_0 + \sqrt{1 -\tilde{\alpha}_t}\epsilon\end{array}$ , Also it was found that when $\begin{array}{l}\lambda_t = 1\end{array}$ the model produces high quality images, which the simplifies above equation to:

$\begin{array}{l}\displaystyle L_t_-_1 = \mathop{\mathbb{E}_{x_{0} \sim q(x_0), \epsilon \sim \mathcal{N}(0,\mathbb{I})}} \left[ \parallel \epsilon - \epsilon_{\theta}(\sqrt{\tilde{\alpha}_t}x_0 + \sqrt{1 -\tilde{\alpha}_t}\epsilon,t) \parallel^2 \right]\end{array}$

so after performing the forward diffusion process, in the reverse diffusion process the model takes noisy image and learns to predict the noise that was used to generate that specific noisy image.

4. Training

4.1. Training

During training while iterating through the epoch loop and the batch loop of the dataset, we randomly choose $\begin{array}{l}t\end{array}$ in the horizon $\begin{array}{l}H\end{array}$ and generate noisy images using $\begin{array}{l}x_t =\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1 - \bar{\alpha}_t}{\epsilon}\end{array}$ , where $\begin{array}{l}\epsilon \sim \mathcal{N}(0, \mathbb{I})\end{array}$ (forward process) , then we head to the reverse process and we first sample a noise $\begin{array}{l}\mathbf{x_T} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\end{array}$ and pass the noisy sample through the neural network (U-Net), which predicts the noise that must have been added to generate that noisy sample during the forward diffusion process. Then we compute the Mean Squared Error between the noise we added in the forward part of training and the noise that was predicted by the neural network (U-net) model and we then repeat this process until we converge

The U-Net architecture used is shown below:

But there are few changes made to the U-Net architecture, the model has to be conditioned on the current time step. To do so in practice, we can incorporate sinusoidal embedding and one-layer MLPs. Self-attention Blocks are also added at the 16 × 16 feature map resolution.

4.2. Sampling

During sampling, we sample a noise $\begin{array}{l}\mathbf{x_T} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\end{array}$ and iterate backward over the horizon $\begin{array}{l}H\end{array}$ and compute less noisy image $\begin{array}{l}x_{t-1} = \frac{1}{\sqrt\alpha_{t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \alpha_{t}}} \epsilon_\theta(x_t, t) + \sigma_tz \right)\end{array}$ where $\begin{array}{l}z \sim \mathcal{N}(0,\mathbb{I})\end{array}$ until we reach $\begin{array}{l}x_0\end{array}$ , which would be the sampled or generated image from the Denoising diffusion probabilistic model.

5. Results

After implementing the all the aforementioned information using PyTorch and training the model for 35 epoch using ADAM optimizer, I was able to achieve the following results on the MNIST dataset:

6. Conclusion

Neuroscience Data Augmentation: Machine learning models in neuroscience often require large amounts of labeled data, which can be difficult and expensive to obtain. DDPMs can generate synthetic brain imaging or activity data that can be used to augment existing datasets, improving the training of machine learning models for tasks like classification, segmentation, and anomaly detection.
Denoising Brain Scans: Brain imaging techniques like MRI, fMRI, and EEG are often affected by noise due to the complexities of data acquisition and the environment. DDPMs can be used to denoise these brain scans.

7. References

J. Ho, A. Jain, P. Abbeel. Denoising Diffusion Probabilistic Models, DOI:10.48550/arXiv.2006.11239, (2020) https://doi.org/10.48550/arXiv.2006.11239
Ronneberger, Olaf, Fischer, Philipp and Brox, Thomas. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” (2015). URL https://arxiv.org/abs/1505.04597.

Seitenhierarchie

9: Denoising Diffusion Probabilistic Models