Author: Weixuan Yuan

Supervisor: Mohammad Farid Azampour

Outline:

Motivation
Denoising Diffusion Probabilistic Models
1. Definition
2. Inference
3. Training
4. Characteristic
Inspiring Diffusion Models
1. Image Generation Examples
  1. Conditioned Sampling
  2. Style Transfer
  3. Repaint (Inpainting)
2. Inspiring Diffusion Models: Specialized Tasks Examples
  1. A Physics-informed Diffusion Model for High-fidelity Flow Field Reconstruction
  2. PhysDiff: Physics-Guided Human Motion Diffusion Model
  3. Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Potential Medical Applications
Conclusions

1. Introduction

In recent years, diffusion model has risen to prominence as one of the most popular generative model, securing significant achievements in areas such as imaging and sound, thanks to distinctive iterative generation process. In this blog, we explore the state-of-the-art uses of diffusion models, particularly highlighting their advantages in leveraging guidance and pre-knowledge when contrasted with other non-iterative generative models. To aid understanding, we start with a basic introduction to the foundations of diffusion models and present several scenarios of their application in image generation tasks. Subsequently, we discuss the application of diffusion models to specialized tasks in physics through three shared academic papers and consider their prospective medical applications.

2. Denoising Diffusion Probabilistic Models

2.1. Definition

Denoising Diffusion Probabilistic Models[1] are a type of generative model that defines a sequence of latent variables of equal dimensions to the input data. The formal definition of denoising process (reverse process) is given by

$\begin{array}{l}\displaystyle p_{\theta}(x_{0:T}) := p(x_T) \prod_{t=1}^{T} p_{\theta}(x_{t-1}|x_t),\end{array}$

a Markov chain with Gaussian transitions, which starts from pure noise $\begin{array}{l}p(x_T) = \mathcal{N}(x_T; 0, \mathbf{I})\end{array}$ and ends at ground truth data distribution $\begin{array}{l}x_0 \sim q(x_0)\end{array}$ .

Observing its inverse, which is the diffusion process or forward process, is somewhat more straightforward. Intuitively, it involves gradually adding noise to the data until only pure noise remains. This is defined as:

$\begin{array}{l}\displaystyle q(x_{1:T}|x_0):=\prod_{t=1}^Tq(x_t|x_{t-1}),\quad q(x_t|x_{t-1}):=\mathcal{N}(x_t;\sqrt{1 - \beta_t}x_{t-1},\beta_t\mathbf{I}),\end{array}$

where $\begin{array}{l}0 < \beta_1 < \beta_2 < \cdots < \beta_N < 1\end{array}$ are noise scales. This can be simplified to $\begin{array}{l}x_n = \sqrt{\bar{a}_n} x_0 + \sqrt{1 - \bar{a}_n} \varepsilon \quad\end{array}$ where $\begin{array}{l}\quad \varepsilon \sim \mathcal{N}(0, \mathbf{I})\end{array}$ for acceleration during training.

Note: Please be aware that there are additional definitions/implementations of diffusion models, such as score-based generative modeling[2] or through the use of Ordinary Differential Equations[3], among others, which we will not explore here.

2.2. Inference

To generate data using diffusion models, we need to address the transition from $\begin{array}{l}x_t\end{array}$ to $\begin{array}{l}x_{t-1}\end{array}$ , which is defined as

$\begin{array}{l}\displaystyle \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t)),\end{array}$

where variance is manually configured as $\begin{array}{l}\Sigma_{\theta}(x_t, t) = \sigma_t^2 \mathbf{I}, \sigma_t^2=\frac{1 - \tilde{\alpha}_{t-1}}{1 - \tilde{\alpha}_t}\beta_t\end{array}$ and $\begin{array}{l}\mu_{\theta}(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \tilde{\alpha}_t}} \varepsilon \right)\end{array}$ . The only thing left is to training a neural network that predicts $\begin{array}{l}\varepsilon = \varepsilon_{\theta}(x_t, t)\end{array}$ . In this case, $\begin{array}{l}x_{t-1}\end{array}$ can be computed through reparametrization trick $\begin{array}{l}x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\alpha_t}} \varepsilon_{\theta}(x_t, t) \right) + \sigma_t z\end{array}$ , where $\begin{array}{l}z \sim \mathcal{N}(0, \mathbf{I})\end{array}$ .

2.3. Training

As mentioned, a neural network is trained to predict the added noise from the noisy data at time step $\begin{array}{l}t\end{array}$ . This is straightforward to implement since one can create infinitely many noisy data instances by adding randomly sampled Gaussian noise to the training data. Intuitively, the model can be optimized using the mean squared loss between the predicted noise and the actual noise, which aligns with the results provided by the formal mathematical derivation[1].

2.4. Characteristic

Compared to generative models that infer in a single forward propagation, diffusion models are an iterative process and provide a well-defined sequence of hidden states. Due to this characteristic, diffusion models have greater flexibility in utilizing guidance and pre-knowledge.

3. Inspiring Diffusion Models

In this section, SOTA uses of diffusion models will be introduced firstly through more accessible cases of image generation, followed by a discussion of three recommended papers that apply diffusion models to specialized tasks in physics.

3.1. Image Generation Examples

3.1.1. Conditioned Sampling

3.1.1.1. Conditional Generation

Just as with other generative models, the denoising process of diffusion models can be conditioned by inputting conditional content into the neural network. Formally, the model learns the conditional probability distribution $\begin{array}{l}p_{\theta}(x_{0:T}|c) := p(x_T|c) \prod_{t=1}^{T} p_{\theta}(x_{t-1}|x_t,c)\end{array}$ , where $\begin{array}{l}p(x_T|c)=p(x_T) = \mathcal{N}(x_T; 0, \mathbf{I})\end{array}$ . In the case demonstrated above, image generation is directed by text prompts, which can be represented by textual or multimodal embeddings. Furthermore, conditioning in diffusion models offers greater flexibility. For instance, one might use prompts such as "moon surface and kangaroo" in the initial steps to establish the fundamental elements of the image, and then apply prompts like "pixel style" in the subsequent steps to refine the visual details.

3.1.1.2. Classifier-based Guidance

Unlike models that perform inference in one pass, diffusion models can also ultilize the iterative generation process for conditional control. That is, the denoising process is treated as an optimization run, where gradients are computed with respect to intermediate results $\begin{array}{l}x_t\end{array}$ (noisy images), instead of model parameters. This is divided into classifier-based[4] and classifier-free conditional guidance[5].

In case of classifier-based guidance, a classifier that computes the likelihood of labels given data $\begin{array}{l}x_t\end{array}$ is trained separately, denoted as $\begin{array}{l}p_{\phi}(y|x_{t})\end{array}$ . During the inference phase, this classifier computes the gradient of the log-likelihood for the desired condition with respect to the current noisy image. Aiming to maximize the log-likelihood, as highlighted in the algorithm, the the gradient is weighted and added to the noisy image $\begin{array}{l}x_t\end{array}$ . Intuitively, the gradient of the log-likelihood is a measure of how the pixel values of the current noisy image can be adjusted to make the image more likely to be classified as the desired label.

3.1.1.3. Classifier-free Guidance

The latter operates similarly to standard conditional generative modeling, except that during training, the condition is omitted at a certain probability (e.g. 0.9 or 0.8 [5]). During inference, at each step, the noise for both conditional and unconditional scenarios are predicted, and a linear combination of the two is used for denoising, as highlighted in the algorithm.

Intuitively, in the classfier-free setting, the classifier is merged into the noise prediction model and trained together, where $\begin{array}{l}\varepsilon _{\theta}(x_t, c) - \varepsilon _{\theta}(x_t)\end{array}$ plays a similar role to $\begin{array}{l}\nabla_{x_t} \log p_{\phi}(y|x_t)\end{array}$ .

3.1.2. Style Transfer

The DDPM is an iterative process that transforms pure Gaussian noise into real data, but it is not necessary to always start from scratch. In other words, we can provide a noisy image as an intermediate state and begin the denoising process from there:

$\begin{array}{l}\displaystyle p_{\theta}(x_{0:t} | c) := p(x_t) \prod_{i=1}^{i} p_{\theta}(x_{i-1}|x_i,c),\end{array}$

where $\begin{array}{l}x_t = \sqrt{\bar{\alpha}_t} x^{(g)} + \sqrt{1 - \bar{\alpha}_t} \varepsilon_t, 0 \leq t\leq T\end{array}$ is the noisy guidance image. As illustrated in the above figure, noise is initially introduced to the original image, and the denoising process commences from the time step that corresponds to the level of noise. Consequently, a new image is obtained that, while bearing resemblance to the original, is subtly altered to align with the provided textual description.

3.1.3. Repaint (Inpainting)

RePaint is a denoising diffusion probabilistic model-based approach for free-form image inpainting that adeptly handles a wide range of masking scenarios by sampling from unmasked regions during the generative process, ensuring the production of diverse, high-quality outputs without fine-tuning the core DDPM network[6]. Intuitively, RePaint involves replacing parts of the current hidden state that need to remain consistent with the original image with the original image plus noise corresponding to the level, at every denoising step. The formal definition is as followed:

$\begin{array}{l}\displaystyle x_{t-1}^{\text{known}} \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)\mathbf{I}) ,\end{array}$

$\begin{array}{l}\displaystyle x_{t-1}^{\text{unknown}} \sim \mathcal{N}(\mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t)),\end{array}$

$\begin{array}{l}\displaystyle x_{t-1} = m \odot x_{t-1}^{\text{known}} + (1 - m) \odot x_{t-1}^{\text{unknown}},\end{array}$

where $\begin{array}{l}m\end{array}$ is the mask. Diffusion-based RePaint excels in achieving state-of-the-art results through simple techniques without the need for any additional training.

3.2. Specialized Tasks Examples

Next, we will integrate our previously acquired knowledge to analyze three recommended papers that employ diffusion models to address specialized problems in physics.

3.2.1. A physics-informed diffusion model for high-fidelity flow field (JCP 2022)

High-fidelity computational fluid dynamics (CFD) data is essential for understanding how engineering systems interact with fluid flows. However, traditional high-fidelity CFD numerical methods are computationally expensive and time-consuming. Therefore, the task of accurately reconstructing high-fidelity data from low-fidelity data holds significant value. The paper "A Physics-Informed Diffusion Model for High-Fidelity Flow Field Reconstruction" addresses this issue through diffusion models and has been accepted by the "Journal of Computational Physics 2022" [7].

3.2.1.1. Problem and Challenges

The objective is to reconstruct high-fidelity data from low-fidelity sources. Flow field data can be analogized to one-channel images, where 'high-fidelity' denotes clarity and precision, and 'low-fidelity' indicates poor resolution or contaminated data, as exemplified in the image.

The main challenge lies in the fact that low-fidelity data may originate from various distributions, that is to say, we encounter different types of low-quality images. For instance, we have low-resolution images, blurry ones, and those with distortions. It may be evident that training neural networks to predict high-fidelity data in a single attempt is impractical, primarily because the specific nature of low-fidelity data that will be encountered during inference is unpredictable. This implies that one would need to anticipate all possible forms of perturbation prior to training and prepare data accordingly, which is not feasible.

3.2.1.2. Method

The solution lies similarly as in style transfer. All forms of low-fidelity data are essentially images with varying styles, and our goal is to obtain a similar image but rendered in a clear and clean style. Thus, the sampling method employed is analogous to that used in style transfer for image generation. Initially, Gaussian noise is introduced to the low-fidelity data to generate a noisy variant. Subsequently, the denoising process begins with this noisy version and ultimately yields the high-fidelity data. It is important to note that, in this scenario, only high-fidelity data are required during the training phase. There is no longer a need to prepare samples of low-fidelity data.

In addition to the sampling technique, this paper introduces another innovative aspect: the use of partial differential equations (PDEs) to guide the model. It's not necessary to comprehend the physical principles underlying these PDEs. We can simply regard the residual of the PDE as a metric that assesses whether flow data adheres to physical laws. For example, if the PDE evaluates to zero for a dataset, it signifies that the data perfectly complies with the physical rules; if not, the data is considered imperfect.

This closely parallels the scenario of classifier-based guidance, where the PDE residual is analogous to the likelihood predicted by a classifier. Since the objective is to minimize the PDE residual as much as possible, one simply needs to perform gradient descent with respect to the noisy flow data after each denoising step.

3.2.1.3. Results

The authors compare their method to traditional super-resolution techniques, such as bicubic interpolation, and achieve superior results. As previously mentioned, since conventional neural network approaches struggle to handle arbitrary types of noise, there is no need to compare with neural network baselines outside of diffusion models.

3.2.2. PhysDiff: Physics-Guided Human Motion Diffusion Model (CVPR 2023)

Generating lifelike human motions holds significant application value in fields such as film production and human-computer interaction. "PhysDiff: Physics-Guided Human Motion Diffusion Model" addresses the issue of past generative results that did not adhere to physical laws by leveraging diffusion models and physical simulation [8]. This work has been accepted by CVPR 2023.

3.2.2.1. Problem and Challenges

The goal is to synthesize human motion from textual descriptions. Essentially, this entails producing a discrete sequence of frames representing the continuous movement, where each frame is characterized by the three-dimensional coordinates and angular orientations of the body's joints.

The challenge lies in creating motions that are physically plausible; that is, avoiding scenarios where figures appear embedded in the ground or suspended in mid-air, as depicted on the right.

3.2.2.2. Method

The Nvidia team attempted to address this issue with a projector. This device inputs a sequence of movements and outputs a version that is physically plausible. Essentially, the projector ensures that the feet in the motion make correct contact with the ground. It is applied after each diffusion step during sampling to guarantee the final motion sequence is physically plausible. One might wonder why not just apply the projection once after the final step? To address this valid question, let's recall our earlier discussion about inpainting. Nvidia's method here bears a resemblance to inpainting; after each denoising step, the hidden state is adjusted towards the desired outcome. With inpainting, we replace a part of the hidden state with the original, and here, we correct the intermediate motions to ensure they are physically plausible. To bolster this approach, the research team has provided corroborating data.

3.2.2.3. Results

The graphs demonstrate that applying the projector during the sampling process yields superior results compared to using it as post-processing, according to various metrics.

3.2.3. Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos (ICCV 2023)

Impact sounds are crucial for industries such as gaming, and the transition from silent videos to high-quality impact sounds can be advantageous, as it may reduce dependence on sophisticated recording equipment. The paper "Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos" addresses this issue by employing physical parameters as guidance to achieve the highest quality audio generation results [10]. Their work has been accepted by ICCV 2023.

3.2.3.1. Problem and Challenges

The aim of this work is to enhance the quality of impact sound as much as possible. The primary challenge is learning the multimodal relationship between video and impact sound, which is quite complex. In other words, sound generation involves intricate physical processes, including vibration and aerodynamics, while video provides only limited visual information. Intuitively, offering more explicit and information-rich guidance would be beneficial.

3.2.3.2. Method

The paper proposes that, on top of video embeddings, additional guidance information should be incorporated, such as physics parameters like frequency (f), excited power (p), and decaying rate (lambda). It also suggests including some residual parameters that describe the background noise and reverberation. This can provide the diffusion model with more precise directives on the characteristics of the sound to be generated.

Such parameters are obtained by an auto-encoder like architecture, where encoders are neural networks, but decoder is a specially designed synthesizer. This module is trained in a supervised fashion because audio access is available during training. The remaining issue is how to obtain physics priors without access to ground truth sound during inference.

The authors have approached this issue in a straightforward manner. During inference, they start by obtaining the test video embedding and then iterate through the entire training set to find a training sample with a video embedding similar to the test sample's. Following this, they utilize the physics embedding of the identified training sample to guide the generation process.

3.2.3.3. Results

The table indicates that the diffusion model significantly outperforms the ConvNet and Transformer baselines, as shown in the top section. The lower section of the table suggests that incorporating robust physical priors into training is beneficial. As evidenced by the results, the model's performance improves with the addition of more comprehensive physical priors.

3.2.3.4. Discussions

I am not entirely convinced by their methodology. Why not project both video and audio embeddings into a multimodal space? Given that they already possess video and audio encoders, adding projection heads and employing contrastive learning—a technique that is both state-of-the-art and not overly complex—seems like a logical step. Why have they chosen not to pursue this approach (or at least as a baseline method)?

4. Potential Medical Applications

In any generative task within the medical domain, diffusion models have room to help. For example:

Medical image augmentation: As discussed in Section 3.2.1, diffusion models are capable of generating high-quality medical images regardless of the form of perturbation affecting the original images. Additionally, if diagnostic reports of the images are available, they can be utilized to inform the diffusion model during training or inference.
Drug design: While the use of graph neural networks to generate molecular structures is a common choice, diffusion models could potentially offer a more powerful alternative if an appropriate encoding form, such as adjacency matrices, is found. Alternatively, exploring the definition of a diffusion process on discrete data structures like graphs is a research avenue worth pursuing. Moreover, guiding the generation process with plugins similar to those introduced in Section 3.2.2 could direct the synthesized drugs towards desired characteristics. Or, generating similar moleculars and doing partial modification given an existing molecular, as introduced in Sections 3.1.2 and 3.1.3.
Modify diffusion process for special interests: For example, generating vessels where blood flows well through optimization during inference (as introduced in 3.1.1.3), or alternatively, by introducing projectors and filters (as introduced in 3.2.2.2).

Challenges:

Cost: Diffusion models are iterative and therefore more expensive and not suitable for real-time scenarios.
Risk: Using generative models in clinical settings is risky.
Data representation: Defining diffusion models on non-trivial data forms (, like graphs and TTP, etc.) is challenging (but also interesting).

5. Conclusions

Due to the iterative generation process, diffusion models exhibit increased flexibility in utilizing pre-knowledge and guidance, as well as inherent functionality without the need for additional training. Generally speaking, the main methods include:

Setting the initial time and state for the denoising process,
Modifying each step of the iteration.

This blog, limited by length, omits many significant aspects, including accelerated samplers (DDIM [10], DPM [11], etc.), parameter-efficient fine-tuning, NN architectures, and evaluation metrics on diffusion models.

These various advantages have already yielded notable results in both academic and commercial domains and possess the potential for future applications in medicine.

Reference

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.

[2] Song, Yang, et al. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).

[3] Chen, Ricky TQ, et al. "Neural ordinary differential equations." Advances in neural information processing systems 31 (2018).

[4] Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in neural information processing systems 34 (2021): 8780-8794.

[5] Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." arXiv preprint arXiv:2207.12598 (2022).

[6] Lugmayr, Andreas, et al. "Repaint: Inpainting using denoising diffusion probabilistic models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[7] Shu, Dule, Zijie Li, and Amir Barati Farimani. "A physics-informed diffusion model for high-fidelity flow field reconstruction." Journal of Computational Physics 478 (2023): 111972.

[8] Zhang, Fan, et al. "DiffMotion: Speech-driven gesture synthesis using denoising diffusion model." International Conference on Multimedia Modeling. Cham: Springer International Publishing, 2023.

[9] Su, Kun, et al. "Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[10] Song, Jiaming, Chenlin Meng, and Stefano Ermon. "Denoising diffusion implicit models." arXiv preprint arXiv:2010.02502 (2020).

[11] Lu, Cheng, et al. "Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps." Advances in Neural Information Processing Systems 35 (2022): 5775-5787.

Seitenhierarchie

Physics-inspired diffusion model