11: SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation

Blog post written by: Ivan Chzhao
Based on: Jee Seok Yoon, Chenghao Zhang, Heung-Il Suk, Jia Guo, Xiaoxiao Li. SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation.
Proceedings of Information Processing in Medical Imaging, 2023, pp. 388-400, https://arxiv.org/abs/2212.08228.

Introduction

The paper introduces the Sequence-Aware Diffusion Model (SADM), a novel framework designed to generate high-quality longitudinal medical images. Longitudinal images, which consist of sequences of scans taken over time from the same patient, are crucial for monitoring disease progression and evaluating treatment responses. SADM leverages sophisticated techniques such as attention mechanisms, 3D convolutions, and autoregressive sampling to ensure that the generated images maintain both temporal and spatial coherence.

Method Description

The architecture of SADM integrates several advanced components to enhance its performance. At the core of the model is the attention mechanism, which includes self-attention and multi-head attention. Self-attention allows the model to capture dependencies within the same image sequence, enabling it to focus on the most relevant parts of the input data. Multi-head attention, on the other hand, enables the model to capture different types of dependencies in parallel, enriching the generated images with detailed and nuanced features.

In addition to the attention mechanism, SADM employs 3D convolutions to process volumetric data, which is essential for capturing spatial features within each image as well as temporal changes across the sequence of images. This ensures that the anatomical structures in the generated images are both accurate and detailed, providing high-resolution outputs that are vital for clinical applications.

A key feature of SADM is its use of autoregressive sampling for sequential generation. In this approach, each image in the sequence is generated based on previously generated images, which ensures temporal consistency and a realistic progression of medical conditions. The diffusion process, which involves the controlled addition and removal of noise, further refines the images iteratively, enhancing their quality.

Another innovative aspect of SADM is the classifier-free guidance technique. This approach combines conditional and unconditional predictions, balancing specific guidance from input signals with the flexibility of exploratory generation. By reducing the reliance on explicit classifiers, the model improves its adaptability and robustness across various scenarios.

In this article, we will not delve into all the specifics but will focus on the core component of the Sequence-Aware Diffusion Model (SADM) – the Attention mechanism, and will also discuss the concept of 3D Convolution.

Self-Attention Mechanism

The self-attention mechanism is a powerful method used within the Sequence-Aware Diffusion Model (SADM) to dynamically focus on different parts of the input data. This mechanism is crucial for capturing complex dependencies both within and across sequences of longitudinal medical images. In SADM, self-attention ensures that the generated images maintain temporal and spatial coherence, providing high-quality outputs essential for medical analysis.

Key Components of Self-Attention

The self-attention mechanism operates by transforming the input data into three distinct sets of vectors: queries, keys, and values. Each of these vectors plays a critical role in determining the importance of different parts of the input sequence.

Queries (Q): These vectors represent the elements for which the model seeks to determine relevance to other elements.
Keys (K): These vectors are used to match against the queries to compute relevance scores.
Values (V): These vectors contain the actual information that will be weighted and summed to produce the final output.

The Mechanism of Self-Attention

Input Representation:

The input to the self-attention mechanism is a sequence of tokens, where each token represents a specific part of the input data. In the context of SADM, these tokens could represent different time points in a sequence of medical images or different spatial locations within an image.

Linear Transformations:

The input tokens are linearly transformed into queries, keys, and values using learned weight matrices:

$\begin{array}{l}\displaystyle Q=XW_Q,\; K=XW_K, \; V=XW_V\end{array}$

Here, X is the input sequence, and $\begin{array}{l}W_Q, W_K\end{array}$ and $\begin{array}{l}W_V\end{array}$ are the weight matrices.

Dot-Product Calculation:

For each query, the model calculates the dot product with all keys to determine a similarity score. This score indicates how relevant each key is to the given query:

$\begin{array}{l}\displaystyle score(Q_i,K_j)=Q_i⋅K_j\end{array}$

Scaling:

To stabilize the gradients and improve numerical stability, the dot product scores are scaled by the square root of the dimension of the keys ( $\begin{array}{l}d_k\end{array}$ ):

$\begin{array}{l}\displaystyle {score}(Q_i, K_j) = \frac{Q_i \cdot K_j}{\sqrt{d_k}}\end{array}$

Softmax Normalization:

The raw attention scores are then normalized using the softmax function to convert them into probabilities:

$\begin{array}{l}\displaystyle \alpha_{ij} = \cfrac{exp(\text{score}(Q_i, K_j))}{\sum_{k} exp(\text{score}(Q_i, K_k))}\end{array}$

This normalization ensures that the attention scores for each query sum to 1, making them interpretable as probabilities.

Weighted Sum of Values:

Each value is weighted by the corresponding attention score, and a weighted sum is computed to produce the final output for each query:

$\begin{array}{l}\displaystyle {output}_i = \sum_j \alpha_{ij} V_j\end{array}$

In the context of SADM, the self-attention mechanism is applied in two primary ways: temporal attention and spatial attention. Temporal attention captures dependencies across different time points in longitudinal medical images. Each token, representing a specific time point, attends to all other tokens from different time points. This allows the model to understand how features evolve over time, ensuring that the generated images reflect a realistic temporal progression.

Spatial attention, on the other hand, focuses on dependencies within the same image across different spatial locations. Each token, representing a specific spatial location within an image, attends to all other tokens within the same image. This helps maintain spatial coherence, preserving important anatomical structures and details.

Overall, the self-attention mechanism enables the Sequence-Aware Diffusion Model to dynamically focus on the most relevant parts of the input data, capturing complex dependencies and ensuring that the generated images are both temporally consistent and spatially coherent. This makes the self-attention mechanism an indispensable component of SADM, contributing significantly to its ability to generate high-quality longitudinal medical images.

3D Convolution Mechanism

In the Sequence-Aware Diffusion Model (SADM), 3D convolution is a crucial technique used for processing and analyzing volumetric medical images, such as MRI and CT scans. Unlike traditional 2D convolutions that operate on two-dimensional images, 3D convolutions extend the operation into three dimensions, enabling the model to capture both spatial and temporal dependencies within the data. This is particularly important for longitudinal medical imaging, where sequences of scans over time provide insights into disease progression and treatment response.

3D convolution involves a three-dimensional filter (or kernel) that slides across the input volume in three dimensions: depth, height, and width. At each position, the filter performs element-wise multiplications with the corresponding input values and sums the results to produce a single output value. This operation is repeated across the entire input volume, generating a new volumetric output (or feature map) that captures local patterns in three dimensions.

Key Components of 3D Convolution

3D Kernel (Filter):

The 3D kernel is a small, three-dimensional matrix of weights that is learned during the training process. It has dimensions $\begin{array}{l}k_d \times k_h \times k_w\end{array}$ (depth, height, width).

Stride:

The stride determines how much the filter moves at each step along the depth, height, and width dimensions. Larger strides reduce the spatial dimensions of the output, while smaller strides preserve more detail.

Padding:

Padding involves adding zeros around the borders of the input volume to control the spatial dimensions of the output. This can help preserve the original size of the input volume in the output feature map.

Example

Consider a 3D convolution with a $\begin{array}{l}3 \times 3 \times 3\end{array}$ kernel applied to an input volume of size $\begin{array}{l}10 \times 10 \times 10\end{array}$ with a stride of 1 and no padding. The output volume will have dimensions $\begin{array}{l}8 \times 8 \times 8\end{array}$ , calculated as:

$\begin{array}{l}\displaystyle \text{Output Dimension} = \frac{\text{Input Dimension} - \text{Kernel Dimension}}{\text{Stride}} + 1\end{array}$

This operation captures local 3D patterns within the input volume, providing richer feature representations compared to 2D convolutions.

In the Sequence-Aware Diffusion Model (SADM), 3D convolutions are integral for processing sequences of volumetric medical images, effectively capturing spatial structures within each image and temporal changes across sequences. This dual capability ensures the generation of high-quality, consistent images over time.

By leveraging 3D convolutions, the model can identify detailed spatial features, including anatomical structures and potential anomalies, ensuring that generated images are both realistic and medically relevant. Additionally, 3D convolutions capture the evolution of these spatial features over time, which is crucial for monitoring disease progression and treatment response, providing a dynamic and comprehensive view of the patient’s condition.

This fundamental technique in SADM ensures the generation of temporally consistent and spatially coherent images, making it essential for clinical applications where detailed and accurate anatomical representations are necessary for effective diagnosis and treatment monitoring

Experiments

Datasets

Researchers utilized two primary datasets for evaluating the Sequence-Aware Diffusion Model: a Cardiac MRI dataset and a Brain MRI dataset. The Cardiac MRI dataset, sourced from the Automated Cardiac Diagnosis Challenge (ACDC), included MRI scans from 100 training subjects and 50 testing subjects. The primary task was to synthesize the final frame of a cardiac cycle (end-systolic or ES) given the initial frame (end-diastolic or ED). Preprocessing involved resizing and normalizing the intermediate frames from ED to ES to ensure consistency.

The Brain MRI dataset was an in-house synthesized dataset consisting of 2,851 subject scans across various age groups. The objective was to simulate longitudinal brain images to study aging. The scans were divided into age groups, age-specific templates were generated, and all scans were registered to these templates.

Evaluation Metrics

The model's performance was evaluated using three key metrics:

Structural Similarity Index (SSIM): This metric measures the similarity between the synthesized images and the ground truth images. Higher values, closer to 1, indicate better similarity and image quality.
Peak Signal-to-Noise Ratio (PSNR): This metric assesses the quality of the synthesized images by comparing the maximum possible signal to the noise. Higher values, typically above 30 dB, signify good image quality.
Normalized Root Mean Square Error (NRMSE): This metric measures the deviation between the synthesized images and the ground truth images. Lower values are better, indicating more accurate synthesis.

Quantitative comparisons between the baseline models and the proposed SADM are illustrated in the Figure 1.

Figure 1: Quantitative comparisons between the baseline models and the proposed SADM

Ablation Study

The ablation study conducted in the paper aimed to assess the impact of different components of the Sequence-Aware Diffusion Model. By systematically removing individual components, such as the diffusion process and the attention mechanism, the study evaluated their contributions to the overall model performance.

The model was tested under three different settings for input sequences:

Single Image: Only the ED frame was given as input.
Missing Data: The input sequence had randomly missing frames.
Full Sequence: A fully loaded input sequence with all conditioning images.

The results in Fig.1,2 indicated that the Sequence-Aware Diffusion Model (SADM) outperformed the baseline models in generating high-quality longitudinal medical images. The SADM showed superior performance compared to GAN-based and diffusion-based models. In addition, the diffusion component of SADM contributed significantly more to the model's performance than the attention mechanism. This indicates that while the attention mechanism aids in refining the model's outputs, the diffusion process is crucial for capturing the complex temporal and spatial dependencies inherent in longitudinal medical imaging.

Figure 2: An ablation study of SADM components with single image, missing data, and full sequence settings using the cardiac dataset

Discussion

The Sequence-Aware Diffusion Model (SADM) offers significant advancements in generating high-quality longitudinal medical images by leveraging 3D convolutions and attention mechanisms. These features allow the model to capture both spatial and temporal dependencies, ensuring the generated images are temporally consistent and spatially coherent. This makes SADM particularly valuable for clinical applications such as monitoring disease progression and evaluating treatment responses.

However, the method has certain disadvantages. One of the primary challenges is the high computational cost associated with 3D convolutions and attention mechanisms. These processes require substantial computational resources, which may limit the model's practicality in environments with limited access to high-performance computing infrastructure. Additionally, the model's performance is highly dependent on the quality and quantity of the training data. In scenarios where high-quality longitudinal medical data is scarce, the model may not perform optimally, potentially limiting its generalizability and robustness across diverse medical imaging tasks.

Another limitation is the complexity of the model, which may present challenges in implementation and maintenance. The integration of multiple sophisticated components, such as 3D convolutions and attention mechanisms, necessitates a high level of expertise for proper tuning and operation. This complexity can be a barrier to widespread adoption in clinical settings, where simplicity and ease of use are critical.

Future research might focus on optimizing the efficiency of SADM, exploring methods to reduce computational demands without compromising image quality. Additionally, developing strategies to enhance the model's robustness and adaptability to varying data quality and quantity will be essential for broader application. By addressing these limitations, the SADM can be further refined to provide even more reliable and accessible solutions for medical imaging challenges.

Appendix

List of Abbreviations

SADM: Sequence-Aware Diffusion Model;
MRI: Magnetic Resonance Imaging;
ACDC: Automated Cardiac Diagnosis Challenge;
ES: End-Systolic;
ED: End-Diastolic;
SSIM: Structural Similarity Index Measure;
PSNR: Peak Signal-to-Noise Ratio;
NRMSE: Normalized Root Mean Square Error;
GAN: Generative Adversarial Network.

ChatGPT Prompts

Explain in details about <something>;
How does <temporal encoder> work;
What is the purpose of <MLP> in the model;
What is the relationship between classifier-free guidance and autoregressive sampling;
What does <ϵ> mean in equations of the diffusion model;
What conclusions can you draw from this paper;
What is the difference between Diffusion Model with Deep Registration and Diffusion Model in SADM;
What disadvantages of SADM can you name?

References

1. Arnab, A., et al.: Vivit: A video vision transformer (2021)

2. Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022)

3. Ho, J., et al.: Imagen video: High definition video generation with diffusion models (2022)

4. Kingma, D., et al.: Variational diffusion models. In: Advances in Neural Information Processing Systems. vol. 34, pp. 21696–21707 (2021)

5. Yi, X., Walia, E., Babyn, P.: Generative adversarial network in medical imaging: A review. Medical Image Analysis 58, 101552 (dec 2019)

6. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: Advances in Neural Information Processing Systems. vol. 34, pp. 8780–8794 (2021)

Seitenhierarchie