Sampling Methods in Diffusion Models: A Small Survey

Latex version here: https://www.overleaf.com/read/vchfyfjsvwjk

For completeness there is a section called "Generative Models", in case of a strict word limit of 2500 words, please kindly ignore that part, it serves as a complementary section.

Introduction

Generative models have consistently been a significant focus in Machine Learning research. This trend traces back as far as the origins of probability theory itself. Suppose we have a set of observations, denoted by $\begin{array}{l}X = \{x_i\}_{i=1}^N\end{array}$ , where $\begin{array}{l}N\end{array}$ is the count of observations in the set, and each $\begin{array}{l}x_i \in \mathbb{R}^D\end{array}$ is a real-valued observation. A common task in Machine Learning involves learning a parametric probability distribution, denoted by $\begin{array}{l}P(x;\theta)\end{array}$ , where $\begin{array}{l}\theta = argmax_\theta \prod_{i=1}^N P(x_i;\theta)\end{array}$ . In simpler terms, this represents the process of identifying the probability distribution that best fits our set of observations. This inferred probability distribution can then be used to evaluate the likelihood of new observations originating from the same distribution as our original set of observations.

While not always feasible, there exists a dual aspect to this problem. Instead of only evaluating new observations, we might also want to generate new ones. For well-defined probability distributions like Gaussian, Fisher, Poisson, etc., sampling new data points given certain parameters is quite straightforward. However, real-world observations often do not align nicely with these simple families of distributions.

For both likelihood evaluation and sampling, real-world data often necessitate the use of more complex distributions. These are so intricate that we find ourselves requiring millions, and more recently billions, of parameters to accurately represent the distributions, a design commonly referred to as neural networks.

In this study, our attention will be concentrated on efforts to generate data similar to our observation set, with a particular emphasis on a category of models known as Diffusion Models. However, before delving into that, let's briefly mention some of the common designs for generative models.

Generative Models (Optional)

As previously discussed, machine learning typically involves learning distributions for various types of data. We denote sampling as $\begin{array}{l}x \sim P(x;\theta)\end{array}$ , where $\begin{array}{l}x\end{array}$ is a sample drawn from the distribution $\begin{array}{l}P(x;\theta)\end{array}$ . Often, when our dataset is complex - such as a dataset composed of images - we require specialized architectures to generate new data points. We refer to these unique architectures as generative models.

Generative Adversarial Networks

When discussing generative models, we cannot ignore Generative Adversarial Networks (GANs). Introduced by Goodfellow et al.^[1], GANs significantly transformed the landscape of deep learning. The key idea behind GANs is to reformulate the generation task into a minmax game. In this game, a generator network attempts to produce synthetic data while a discriminator network strives to differentiate between artificial and real data. The optimization function can be expressed as follows:

$\begin{array}{l}\displaystyle \min_G \max_D L(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{data}(x)} \log [D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})} \log [1 - D(G(\mathbf{z}))].\end{array}$

Here, $\begin{array}{l}G\end{array}$ represents the generator network, and $\begin{array}{l}D\end{array}$ is the discriminator network. The loss function rewards the generator for successfully deceiving the discriminator, while it also rewards the discriminator for avoiding deception. An important concept introduced in this context is the latent vector, denoted by $\begin{array}{l}\mathbf{z}\end{array}$ . We'll delve into this concept in upcoming sections. For now, it's worth noting that GANs dominated the field of generative tasks, holding the state-of-the-art position until the 2020s. However, around this time, researchers began to recognize their limitations, sparking exploration into potential alternatives.

Variational Autoencoders

The concept of autoencoders was first introduced by Rumelhart et al.^[2] in 1986, with Geoffrey Hinton as a co-author. The primary idea involves using neural networks in an unsupervised setup, aiming to learn a function $\begin{array}{l}f_\theta(x') = x\end{array}$ . Here, $\begin{array}{l}x'\end{array}$ can represent various things, but it originally denoted $\begin{array}{l}x\end{array}$ itself or a noisy version of $\begin{array}{l}x\end{array}$ . The task may appear trivial, with $\begin{array}{l}\theta\end{array}$ defined as:

$\begin{array}{l}\displaystyle \theta = argmin_\theta \sum_{i=1}^N L(f_\theta(x'_i), x_i),\end{array}$

where the objective function is typically selected as $\begin{array}{l}L_2\end{array}$ or $\begin{array}{l}L_1\end{array}$ loss. However, this formulation doesn't fully depict the underlying essence of autoencoders. The key feature making autoencoders highly valuable is the bottleneck layer. As illustrated in Figure 1, information from a high-dimensional input is compressed into merely thousands of dimensions. Given that the decoder must reconstruct the image from this compact bottleneck layer, the encoder is compelled to learn an efficient representation of the input space. It turns out that in an optimal setup, the bottleneck effectively captures the significant properties of the input space. For example, in a dataset of human faces, this could include features such as gender, face orientation, hair length, and so forth.

Figure 1. Basic scheme of an autoencoder.

However, the autoencoder alone isn't sufficient. While autoencoders can technically serve as generative models by employing the decoder with random inputs in the bottleneck, this approach is fundamentally flawed. Mathematically, there is no guarantee that a vanilla autoencoder can function effectively in this manner, often resulting in suboptimal outcomes. Instead, we leverage Variational Autoencoders (VAEs), introduced by Kingma et al.^[3]

With the same underlying neural network architecture, merely altering the formulation can enable autoencoders to operate as generative models. The precise derivation of the new objective is interesting but lengthy, so for brevity, we will only cover the basics here. Instead of using our autoencoder as an end-to-end function, we introduce small modifications. The encoder $\begin{array}{l}f_\theta(\mathbf{x})\end{array}$ , parameterized by $\begin{array}{l}\theta\end{array}$ , takes $\begin{array}{l}\mathbf{x}\end{array}$ as input and outputs two sets of vectors, $\begin{array}{l}\hat{\mu} \in \mathbb{R}^B\end{array}$ and $\begin{array}{l}\hat{\sigma}^2 \in \mathbb{R}^B\end{array}$ , where $\begin{array}{l}B\end{array}$ represents the size of the bottleneck vector.

These outputs, $\begin{array}{l}\hat{\mu}\end{array}$ and $\begin{array}{l}\hat{\sigma}^2\end{array}$ , are then utilized as parameters of a Multivariate Gaussian Distribution with a diagonal covariance matrix. As a result, we generate a distribution of bottleneck vectors, termed $\begin{array}{l}\mathbf{z}\end{array}$ or latent vectors, rather than relying on a static bottleneck vector for an input. The objective of the decoder, $\begin{array}{l}g_\phi(\mathbf{z})\end{array}$ , parameterized by $\begin{array}{l}\phi\end{array}$ , is subsequently adapted in line with a probabilistic approach, aiming to learn the output as a distribution. Consequently, the final objective function is:

$\begin{array}{l}\displaystyle L(x', x) = || \mathbf{x} - g_\phi(\hat{\mu}_x + \hat{\sigma}^2_x \epsilon) ||_2 + D_K( \mathcal{N}(\hat{\mu}_x, \hat{\sigma}_x) || \mathcal{N}(0, \mathbf{I})),\end{array}$

where $\begin{array}{l}f_\theta(x') = (\hat{\mu}_x, \hat{\sigma}^2_x)\end{array}$ , $\begin{array}{l}\epsilon \sim \mathcal{N}(0, \mathbf{I})\end{array}$ and $\begin{array}{l}D_K\end{array}$ represents the KL divergence. This formulation enables us to use a trained VAE model as a generative model by feeding $\begin{array}{l}z \sim \mathcal{N}(0, \mathbf{I})\end{array}$ random latent vectors into the decoder.

Latent Vectors

Having encountered the term 'latent vectors' repeatedly thus far, it's appropriate to provide a more formal definition. A latent vector is a vector drawn from a simple distribution, such as a Multivariate Gaussian distribution, which we use to model more complex distributions, such as a distribution of images of human faces. Modeling $\begin{array}{l}p_\theta(\mathbf{x})\end{array}$ is typically challenging, but it turns out that modeling $\begin{array}{l}p_\theta(\mathbf{x}|\mathbf{z})\end{array}$ is considerably easier, where $\begin{array}{l}z \sim \mathcal{N}(0, \mathbf{I})\end{array}$ , for instance. As you might anticipate, the Diffusion Models we discuss next will also employ this concept of latent vectors to represent a complex data distribution.

Diffusion Models

The concept of diffusion models was first introduced by Sohl-Dickstein et al.^[4] However, they gained popularity after Ho et al.^[5] formally outlined their use as generative models. To provide an informal description of what diffusion models do, imagine we have an image $\begin{array}{l}\mathbf{x}\end{array}$ . Let's then iteratively add Gaussian noise to this image. We denote the image at the $\begin{array}{l}0\end{array}$ th timestamp (the original image where no noise has been added yet) as $\begin{array}{l}\mathbf{x}_0\end{array}$ . With each iteration, we increment the timestamp such that $\begin{array}{l}\mathbf{x}_t = \mathbf{x}_{t-1} + \epsilon\end{array}$ , where $\begin{array}{l}\epsilon \in \mathbb{R}^D\end{array}$ . We'll delve into the parameters for $\begin{array}{l}\epsilon\end{array}$ later. After a sufficient number of iterations, we posit that $\begin{array}{l}\textbf{x}_T \sim \mathcal{N}(0, \mathbf{I})\end{array}$ . In other words, if we add enough noise, the resulting image will be indistinguishable from one drawn from a unit normal distribution. A visual of the process can be seen in Figure 2.

Figure 2. The graphical model of the Diffusion Models introduced by Ho et al.^[5] Figure is taken from the paper mentioned.

Now, let's assume that for each step, we had a function to remove the added noise. For instance, if $\begin{array}{l}f_\theta(\mathbf{x}_t) = \mathbf{x}_{t-1}\end{array}$ , we could have employed this function to start from a completely random image, $\begin{array}{l}\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})\end{array}$ , and gradually remove the noise step by step, ultimately yielding a denoised image, $\begin{array}{l}\mathbf{x}_0\end{array}$ . Diffusion models employ this exact strategy: they sample an $\begin{array}{l}\mathbf{x}_T\end{array}$ from a unit normal distribution, which serves as a latent vector, and then denoise it over $\begin{array}{l}T\end{array}$ steps to obtain an image from the original distribution. Let's now delve into the specifics.

Formal Definition

As outlined by Ho et al.^[5], Diffusion Models are latent variable models. The likelihood of $\begin{array}{l}x_0\end{array}$ is formally defined as: $\begin{array}{l}p_\theta(\mathbf{x}_0) = \int p_\theta(\mathbf{x}_{0:T})d\mathbf{x}_{1:T}\end{array}$ , where $\begin{array}{l}\mathbf{x}_1, ..., \mathbf{x}_T\end{array}$ are the latent variables and $\begin{array}{l}\mathbf{x}_0, \mathbf{x}_1, ..., \mathbf{x}_T \in \mathbb{R}^D\end{array}$ . Contrary to our informal description, the latent variables are not simply fixed versions of the noisy $\begin{array}{l}x_0\end{array}$ . We assume that they follow a distribution conditioned on the previous step, denoted as $\begin{array}{l}\mathbf{x}_t \sim q(\mathbf{x}_t | \mathbf{x}_{t-1})\end{array}$ , with the singular exception being that our original data, denoted as $\begin{array}{l}\mathbf{x}_0 \sim q(x_0)\end{array}$ , is not conditioned on anything.

Modelling $\begin{array}{l}q(\mathbf{x}_t)\end{array}$ without requiring learnable parameters is straightforward. We can define it as:

$\begin{array}{l}\displaystyle q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1 - \beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I}),\end{array}$

For clarity, we can also express it as:

$\begin{array}{l}\displaystyle q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \sqrt{1 - \beta_t}\mathbf{x}_{t-1} + \epsilon,\end{array}$

where $\begin{array}{l}\epsilon \sim \mathcal{N}(0, \beta_t\mathbf{I})\end{array}$ represents noise drawn from a multivariate Gaussian distribution with covariance $\begin{array}{l}\beta_t\mathbf{I}\end{array}$ . In other words, we first scale $\begin{array}{l}\mathbf{x}_{t-1}\end{array}$ by $\begin{array}{l}\sqrt{1 - \beta_t}\end{array}$ and then add the noise. Here, we can consider $\begin{array}{l}\beta_t\end{array}$ to be a predetermined, fixed hyperparameter. One can clearly see that this is the noise adding process we described previously, only with the addition of scaling, which helps to "diffuse" the original input step-by-step.

Now, let's approach the more challenging part: how can we recover the de-noised version of the input? Let's remember that the final step $\begin{array}{l}T\end{array}$ of added noise will result in an entirely random image. We can represent this as $\begin{array}{l}p(\mathbf{x}_T) = \mathcal{N}(0, \mathbf{I})\end{array}$ , where $\begin{array}{l}p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)\end{array}$ denotes the distribution of the de-noised $\begin{array}{l}\mathbf{x}_t\end{array}$ . This represents the previous step in the noise addition process, $\begin{array}{l}\mathbf{x}_{t-1}\end{array}$ .

We can further justify why this latent variable should follow a distribution by noting that there could be several previous steps leading to the same noise-added image, given that the addition of noise is a random process. The question now is: how can we model the de-noised image?

As each latent variable follows a Multivariate Gaussian distribution with some unknown parameters, we can represent $\begin{array}{l}p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)\end{array}$ as $\begin{array}{l}\mathcal{N}(\mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))\end{array}$ . Here, both $\begin{array}{l}\mu_\theta\end{array}$ and $\begin{array}{l}\Sigma_\theta\end{array}$ are functions of $\begin{array}{l}\mathbf{x}_t\end{array}$ and the timestamp $\begin{array}{l}t\end{array}$ , predicting the mean and covariance vector for $\begin{array}{l}\mathbf{x}_{t-1}\end{array}$ , respectively.

The symbol $\begin{array}{l}\theta\end{array}$ represents a set of trainable parameters. With the correct optimization scheme, we can learn a function to de-noise the latent variables.

Modeling the Whole Process

The Markov chain concept plays a significant role in the formulation of diffusion models. A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends solely on the state attained in the previous event. In the context of diffusion models, the sequence of images $\begin{array}{l}\mathbf{x}_{0:T}\end{array}$ forms a Markov chain.

In more formal terms, if we have a sequence of random variables $\begin{array}{l}\mathbf{x}_{0:T}\end{array}$ , the sequence is said to follow a Markov chain if the probability of moving to the next state $\begin{array}{l}\mathbf{x}_t\end{array}$ depends solely on the current state $\begin{array}{l}\mathbf{x}_{t-1}\end{array}$ and not on the sequence of states that preceded it. This can be expressed as:

$\begin{array}{l}\displaystyle p(\mathbf{x}_t|\mathbf{x}_{0:t-1}) = p(\mathbf{x}_t|\mathbf{x}_{t-1}).\end{array}$

In the context of diffusion models, this means that the state at each timestep during the diffusion process depends only on the state at the previous timestep. This property greatly simplifies the process of modeling the evolution of the image over time.

The joint distributions $\begin{array}{l}p_\theta(\mathbf{x}_{0:T})\end{array}$ and $\begin{array}{l}q(\mathbf{x}_{1:T}|\mathbf{x}_0)\end{array}$ can then be defined. Hereafter, we refer to $\begin{array}{l}p_\theta(\mathbf{x}_{0:T})\end{array}$ as the "reverse process" and $\begin{array}{l}q(\mathbf{x}_{1:T}|\mathbf{x}_0)\end{array}$ as the "forward process". Furthermore, we denote $\begin{array}{l}\beta_1, ..., \beta_T\end{array}$ as predetermined variance variables for the forward process, which are parameters that control the "diffusion strength" at each timestep. As a result:

$\begin{array}{l}\displaystyle p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{i=1}^T p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t), \qquad p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t)),\end{array}$

and,

$\begin{array}{l}\displaystyle q(\mathbf{x}_{1:T}|\mathbf{x}_0) = \prod_{i=1}^T q(\mathbf{x}_t|\mathbf{x}_{t-1}), \qquad q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1 - \beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I}).\end{array}$

Using the property of Gaussian distributions, we can further simplify the process where instead of going through the noising steps one-by-one, we can collapse them into a single process where $\begin{array}{l}\alpha_t = 1 - \beta_t\end{array}$ and $\begin{array}{l}\overline{\alpha}_t = \prod_{s=1}^t \alpha_s\end{array}$ and,

$\begin{array}{l}\displaystyle q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\sqrt{\overline{\alpha}_t}\mathbf{x}_0, (1 - \overline{\alpha}_t)\mathbf{I}).\end{array}$

As previously stated, the likelihood of the original image $\begin{array}{l}p_\theta(\mathbf{x}_0) = \int p_\theta(\mathbf{x}_{0:T})d\mathbf{x}_{1:T}\end{array}$ . We also defined $\begin{array}{l}p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{i=1}^T p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)\end{array}$ . However, these specifications are not sufficient to resolve this problem. The likelihood still includes an intractable integration and, unfortunately, there is no feasible method to directly optimize it.

This is precisely where latent variables become essential. The Evidence Lower Bound (ELBO) presents a popular alternative for optimizing these challenging formulations. The use of latent variables is also the cornerstone principle behind Variational Autoencoders (VAEs)^[3]. As the name suggests, ELBO provides a lower bound for $\begin{array}{l}p_\theta(\mathbf{x}_0)\end{array}$ . Given our task is to maximize the likelihood, we can also aim to maximize the ELBO. Since the likelihood is strictly greater than the ELBO, by maximizing the ELBO, we indirectly optimize the likelihood as well. This yields the following formulazation:

$\begin{array}{l}\displaystyle \mathbb{E}[-\log p_\theta(\mathbf{x}_0)] \leq \mathbb{E}_q \left[ -\log \dfrac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \right] = \mathbb{E}_q \left[ -\log p(\mathbf{x}_T) - \sum_{t \geq 1} \log \dfrac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_t|\mathbf{x}_{t-1})} \right]\end{array}$

The authors^[5] introduce some modifications to the formulation to simplify the overall process. They implement certain tricks: for instance, they fix the variance to 1, so the only remaining learnable parameter is $\begin{array}{l}\mu_\theta(\mathbf{x}_t, t)\end{array}$ . They also replace the $\begin{array}{l}\mu\end{array}$ predictor with $\begin{array}{l}\epsilon_\theta\end{array}$ , where the predictor is trained to predict the reverse noise, in order to denoise the image. Ultimately, we are left with a simpler formulation:

$\begin{array}{l}\displaystyle L_{simple}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \epsilon} \left[ || \epsilon - \epsilon_\theta(\sqrt{\overline{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \overline{\alpha}_t}\epsilon, t)||^2 \right].\end{array}$

Indeed, as demonstrated in the original paper, the aim was to simplify this Markov Chain process and ultimately facilitate the sampling of images without requiring unnecessary computations. Lastly, we are providing the pseudo-code for both the training and sampling routines of the Diffusion Models in Figure 3. We will then proceed with our main topic: discussing potential alternatives for the sampling process within Diffusion Models.

Figure 3. The training and sampling routine provided by Ho et al.^[5] Figure is taken from the paper mentioned.

The typical process for Diffusion Models, as we've previously discussed, involves Markov Chains, which can sometimes require up to a thousand steps to generate an image from a random image $\begin{array}{l}\mathbf{x}_T\end{array}$ . The need to sequentially traverse these steps makes Diffusion Models considerably less straightforward compared to their counterparts, despite the undeniably high quality of the results. This inherent complexity presents opportunities to explore potential enhancements or alternatives to the sampling process. In the following sections, we will delve into some published works aimed at increasing the efficiency of the sampling step by adapting the formulation to our advantage.

Denoising Diffusion Implicit Models

While Denoising Diffusion Probabilistic Models (DDPMs) have shown exceptional image generation capabilities, they suffer from lengthy sampling times due to their reliance on Markov chains. To counter this issue, this paper^[6] introduces Denoising Diffusion Implicit Models (DDIMs), a more efficient model that leverages non-Markovian diffusion processes. These DDIMs not only maintain the same quality of sample generation as DDPMs, but they do so up to 50 times faster, offering a better trade-off between computation and sample quality. Furthermore, DDIMs provide semantically meaningful image interpolation in the latent space and achieve superior sample consistency.

The main discovery here is that when working with Diffusion Models, which are typically set up as Markov Chains, the loss function or "goal" of the model doesn't really rely on the full history of data points. Instead, it only cares about individual points in isolation, called "marginals", which are represented by $\begin{array}{l}q(\mathbf{x}_t|\mathbf{x}_0)\end{array}$ , not the whole joint distribution $\begin{array}{l}q(\mathbf{x}_{1:T}|\mathbf{x}_0)\end{array}$ .

The researchers also suggest that you can get the same "marginals" from a non-Markovian process. In other words, you can achieve the same outcome with a process where a given state isn't just dependent on the previous state. This means that the model can be built just considering $\begin{array}{l}\mathbf{x}_t\end{array}$ and $\begin{array}{l}\mathbf{x}_0\end{array}$ , and we don't really need $\begin{array}{l}\mathbf{x}_{t-1}\end{array}$ . This insight can potentially make Diffusion Models more efficient to use.

Instead of using the original inference distribution $\begin{array}{l}q\end{array}$ , they consider a family of inference distribution $\begin{array}{l}q_\sigma\end{array}$ , parameterized by $\begin{array}{l}\sigma\end{array}$ where $\begin{array}{l}\sigma \in \mathbb{R}_{\geq 0}^T\end{array}$ . Then $\begin{array}{l}q_\sigma(\mathbf{x}_T|\mathbf{x}_0) = \mathcal{N}(\sqrt{\alpha_T}\mathbf{x}_0, (1 - \alpha_T)\mathbf{I})\end{array}$ and for all $\begin{array}{l}t > 1\end{array}$ , $\begin{array}{l}q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\sqrt{\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \dfrac{\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \mathbf{I})\end{array}$ , then the forward process $\begin{array}{l}q_\sigma(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{x}_0)\end{array}$ can be derived with Bayes' rule. It should be noted that the resulting process is no longer Markovian since $\begin{array}{l}\mathbf{x}_t\end{array}$ depends on both $\begin{array}{l}\mathbf{x}_{t-1}\end{array}$ and $\begin{array}{l}\mathbf{x}_0\end{array}$ . The effect of the newly introduced parameter $\begin{array}{l}\sigma\end{array}$ will be explained in the following.

After some alterations in the process, the authors finally define a new formulation for the denoising step:

(1)	$\begin{array}{l}\displaystyle \mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}} \left( \dfrac{\mathbf{x}_t - \sqrt{1 - \alpha_t} \epsilon_\theta^{(t)} (\mathbf{x}_t)}{\sqrt{\sigma_t}} + \sqrt{1 - \alpha_{t-1} - \sigma_t^2}.\epsilon_\theta^{(t)}(\mathbf{x}_t) + \sigma_t \epsilon_t \right).\end{array}$

Here $\begin{array}{l}\epsilon_t \sim \mathcal{N}(0, \mathbf{I})\end{array}$ , if we set $\begin{array}{l}\sigma_t = \sqrt{(1 - \alpha_{t-1})/(1 - \alpha_t)}\sqrt{1 - \alpha_t/\alpha_{t-1}}\end{array}$ the forward process becomes Markovian and the generative process becomes equivalent to the Diffusion Models. If we set $\begin{array}{l}\sigma_t = 0\end{array}$ for all t, then the process becomes deterministic, and yields the \textit{denoising diffusion implicit model} (DDIM) that the authors propose.

After going through all these details, we finally reach the authors' key point: they found a way to make image sampling significantly quicker. By using their approach, we can employ a trained Diffusion Model as is, and instead of methodically moving through the steps from $\begin{array}{l}\mathbf{x}_T\end{array}$ all the way to $\begin{array}{l}\mathbf{x}_0\end{array}$ , we can cherry-pick a subset of these steps, like $\begin{array}{l}\tau = [1, 100, 200, ..., 1000]\end{array}$ , and just skip the others at will. This flexibility lets us produce images of comparable quality, but with a considerable reduction in the number of required iterations.

Figure 4. Comparison between $\begin{array}{l}\eta = 1.0\end{array}$ and $\begin{array}{l}\eta = 0.0\end{array}$ and custom processes in-between. The measure given in the table are FID scores for both CIFAR10 and CelebA datasets. S is the number of steps taken in the sampling process. Finally $\begin{array}{l}\hat{\sigma}\end{array}$ denotes the original DDPM methodology. The table is taken from DDIM paper^[6].

We previously discussed how the $\begin{array}{l}\sigma_t\end{array}$ parameter serves as a bridge between the original Diffusion Model and the Denoising Diffusion Implicit Model (DDIM). To standardize this into a scale ranging from zero to one, we introduce a new parameter, $\begin{array}{l}\eta\end{array}$ , and define:

$\begin{array}{l}\sigma_{\tau_i}(\eta) = \eta \sqrt{(1 - \alpha_{\tau_{i-1}})/(1 - \alpha_{\tau_i})}\sqrt{1 - \alpha_{\tau_i}/\alpha_{\tau_{i-1}}}\end{array}$ .

Here, when $\begin{array}{l}\eta = 0\end{array}$ , the generative process is that of DDIM, and when $\begin{array}{l}\eta = 1\end{array}$ , the process resembles the traditional Diffusion Models. Using this adjustment, we can now present an evaluation table for DDIM in Figure 4. The table shows that DDIM considerably outperforms traditional diffusion model for number of steps $\begin{array}{l}T < 1000\end{array}$ .

Pseudo Numerical Methods For Diffusion Models on Manifolds

The paper on Pseudo Numerical Methods For Diffusion Models (PNDMs)^[7] introduces PNDMs as an extended version of DDIMs. The authors posit that DDIMs are merely a specific instance within the broader framework of PNDMs. This perspective not only promises a deeper comprehension of DDIMs but also delivers an enhancement in output quality coupled with a reduction in necessary computation steps.

Recall the parameter $\begin{array}{l}\beta_t\end{array}$ , which governed the 'diffusion strength' or the 'diffusion speed' within the original DDPM procedure. As noted by ^[6], the timestamp $\begin{array}{l}t \in [0, 1, ..., T]\end{array}$ can be transformed into a continuous form where $\begin{array}{l}t \in [0, 1]\end{array}$ . This transformation implies that $\begin{array}{l}\beta(t)\end{array}$ and $\begin{array}{l}x(t)\end{array}$ evolve into functions of $\begin{array}{l}t\end{array}$ instead of remaining discrete entities. Such a conversion paves the way for a more flexible generation process, reducing its reliance on discrete iterations.

The authors change Equation 1 into a continuous form:

(2)

$\begin{array}{l}\displaystyle \mathbf{x}_{t - \delta} - \mathbf{x}_t = (\overline{\alpha}_{t - \delta} - \overline{\alpha}_t) \left( \dfrac{\mathbf{x}_t}{\sqrt{\overline{\alpha}_t}(\sqrt{\overline{\alpha}_{t-\delta}} + \sqrt{\overline{\alpha}}_t )} - \dfrac{\epsilon_\theta(\mathbf{x}_t, t)}{\sqrt{\overline{\alpha_t}}(\sqrt{(1 - \overline{\alpha}_{t - \delta})\overline{\alpha}_t} + \sqrt{(1 - \overline{\alpha}_t)\overline{\alpha}_{t - \delta}})} \right).\end{array}$

In this context, $\begin{array}{l}\mathbf{x}_{t - \delta}\end{array}$ serves as a continuous counterpart to $\begin{array}{l}\mathbf{x}_{t-1}\end{array}$ . This leads to a series of numerical methods, the intricate details of which we will not delve into at this point. Ultimately, they derived a pseudo numerical algorithm for the continuous diffusion process. This key derivation forms the main body of the paper's contribution, the details can be seen in Figure 5.

Figure 5. Main equations and algorithm derived in the paper^[7]. The section directly taken from the paper.

Concluding their methodological discussion, the authors label their methods according to the number of time steps utilized in a prediction. As such, the version employing four time steps, depicted in Figure 5, is designated as F-PNDMs. Correspondingly, the variant which uses two time steps is termed S-PNDMs.

Figure 6. The evaluation table taken from ^[7]. Time column denotes the average time per step. Linear is the main method, cosine is a small alteration on the original method, explained in the Appendix. PF is acquired by making Equation 2 solved by standard "ordinary differential equation" solvers, without any numerical method approach by the authors. FON is the fourth order numerical methods working again on the same formulation.

The authors highlight in Figure 6 that PNDMs can generate image samples using merely 50 time steps, without significant quality loss, compared to results using 1000 time steps. This notable accomplishment enables the authors to execute their method 20 times faster while maintaining equivalent image quality, in comparison to their DDIM counterpart.

Fast Sampling of Diffusion Models with Exponential Integrator

Zhang et al.^[8] introduced the Diffusion Exponential Integrator Sampler (DEIS) to accelerate sampling in Diffusion Models. DEIS leverages the semilinear structure of the learned diffusion process to minimize discretization errors, and thus enhances sample quality generation while requiring fewer steps than traditional DMs.

Figure 7. The table, directly sourced from ^[8], presents each column as a distinct ODE solution method, with AB representing the Adams-Bashforth method for solving ODEs. The tests were performed on a Diffusion Model trained on the 256x256 ImageNet dataset.

Building on a similar principle as PNDMs, they formulated the sampling process for diffusion models as an "ordinary differential equation", but with a novel linear transformation approach:

$\begin{array}{l}\displaystyle \dfrac{d\hat{\mathbf{x}}}{dt} = F_t\hat{\mathbf{x}} + \dfrac{1}{2} G_t G_t^T L_T^{-T} \epsilon_\theta(\hat{\mathbf{x}}, t).\end{array}$

In this equation, $\begin{array}{l}G_t\end{array}$ represents the diffusion coefficient, $\begin{array}{l}F_t\end{array}$ symbolizes the linear shift coefficient, and $\begin{array}{l}L_t\end{array}$ stands for any matrix fulfilling the condition $\begin{array}{l}L_t L_t^T = \sum_t\end{array}$ .

Owing to the intricate nature of the work, we'll forgo detailed explanations of the paper. However, in the end, they employ various methods to solve the resulting ordinary differential equation, taking advantage of their semilinear structure formulation. Their results are depicted in Figure 7.

As we can observe, the authors further improved on the DDIM and PNDM methodologies. While PNDM was generating promising results with 50 timesteps, the $\begin{array}{l}t\end{array}$ AB3 method here can achieve high fidelity results with a 15-timestep process, marking a significant leap in efficiency.

Conclusion

In this article, we delved into the realm of generative models, focusing on a specific subset known as Diffusion Models (DMs). While DMs have superior performance in image generation, their efficiency is hampered by a demanding denoising process requiring hundreds, if not thousands, of steps. Consequently, efforts to enhance the speed of the sampling process in DMs have been intensifying within the scientific community.

Three significant contributions we explored include the DDIM, PNDM, and DEIS methodologies. DDIM decontextualizes DMs from the Markov Chain, recasting the sampling process as non-Markovian, thereby reducing iterations while maintaining comparable quality. This work also highlighted the link between ordinary differential equations (ODEs) and diffusion models. The subsequent PNDM paper elaborated on this ODE approach, treating the generation process as a pseudo-approximate differential problem. They achieved results comparable to traditional DMs but with a drastically reduced requirement of 50 timesteps compared to the original 1000.

Most recently, DEIS, building on the ODE approach, reformulated the DM process leveraging its semilinear structure—linear transformations with a singular exception of prediction. This clever manipulation significantly enhanced stability and efficiency, enabling high-quality results in a mere 15 timesteps.

In conclusion, while these advances are undeniably significant, I am an advocate for simplicity. The best solutions are often the simplest, and I firmly believe that it's just a matter of time before a novel, efficient, and elegant concept emerges. Such a concept, reminiscent of the original DMs in its simplicity, could potentially render these complex approaches obsolete. Until then, the quest for more efficient generative models continues.

References

[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error Propagation, page 318–362. MIT Press, Cambridge, MA, USA, 1986.
[3] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022.
[4] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, abs/1503.03585, 2015.
[5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. CoRR, abs/2006.11239, 2020.
[6] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. CoRR, abs/2010.02502, 2020.
[7] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds, 2022.
[8] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator, 2023

Seitenhierarchie

Sampling Methods in Diffusion Models