12:DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion

Blog post written by: Boyang ZHONG

Based on: Zhao, Z., Bai, H., Zhu, Y., Zhang, J., Xu, S., Zhang, Y., Zhang, K., Meng, D., Timofte, R., & Van Gool, L. (2023). DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion (arXiv:2303.06840). arXiv. https://doi.org/10.48550/arXiv.2303.06840

Introduction

In today's digital age, the ability to integrate and analyze data from multiple sources is more important than ever. From enhancing the clarity of satellite images to improving diagnostic accuracy in medical imaging, multi-modality data fusion has become a crucial tool across various fields. However, the challenge lies in effectively combining different types of data to create a cohesive and informative output. This blog post delves into a groundbreaking approach to this challenge—the Denoising Diffusion Model for Multi-Modality Image Fusion (DDFM). This innovative method offers a robust solution to the limitations faced by traditional and existing fusion techniques, promising significant advancements in both technology and application.

The Relevance of Multi-Modality Image Fusion

General Importance Across Fields

The fusion of multiple data modalities is not limited to one field but spans several domains, including remote sensing, medical imaging, and industrial quality control. For instance, in remote sensing, combining radar and optical satellite images can provide comprehensive information about the Earth's surface, aiding in disaster management and environmental monitoring. In industrial settings, fusing data from different sensors can improve defect detection and process monitoring.

Medical Imaging as a Key Example

Medical imaging is a cornerstone of modern diagnostics. Techniques like MRI and PET scans provide complementary information—MRI delivers high-resolution images of the body's internal structures, while PET scans show metabolic activity, which is crucial for identifying abnormalities like tumors or neurodegenerative diseases. By fusing these images, clinicians can obtain a more comprehensive view, combining anatomical and functional data. This is particularly important in early diagnosis and treatment planning for conditions such as Alzheimer's disease, where detecting subtle changes early on can make a significant difference in patient outcomes.

Method: The Heart of DDFM

The DDFM framework, proposed in the paper "DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion," leverages the strengths of denoising diffusion probabilistic models (DDPM)[1] to enhance the process of image fusion. This innovative approach consists of two main components: an Unconditional Generation Module and a Conditional Likelihood Rectification Module. Together, these modules work iteratively to refine and improve the fused image, ensuring that it retains critical information from each source while reducing artifacts and enhancing overall quality. The key to the algorithm is to model the posterior distribution of the fused image $\begin{array}{l}p(f|i,v)\end{array}$ , (notation f as fused image, i as infrared image, and v as visible image) given the stepwise probabilistic estimation of DDPM.

Unconditional Generation Module

At the cornerstone of DDFM is the Unconditional Generation Module. This module generates initial fused images by learning the data distribution from the source images. It samples from the diffusion model to create plausible fused images that serve as a starting point for further refinement. This process ensures that the initial fused image is a good representation of the combined information from the different modalities. The generation module here is interpreted in DDIM fashion[2], where the score function can also be considered to be a denoiser and predict the denoised $\begin{array}{l}\tilde{x}_{0 \vert t}\end{array}$ from any state $\begin{array}{l}x_t\end{array}$ at iteration t:

̃ $\begin{array}{l}\tilde{x}_{0|t} = 1√ ̄α_t(x_t + (1 − α_t)s_θ (x_t, t))\end{array}$ ,

and ̃ $\begin{array}{l}\tilde{x}_{0 | t}\end{array}$ denotes the estimation of $\begin{array}{l}x_0\end{array}$ given $\begin{array}{l}x_t\end{array}$ . With this predicted $\begin{array}{l}\tilde{x}_{0 |t}\end{array}$ and the current state $\begin{array}{l}x_t\end{array}$ , $\begin{array}{l}x_{t-1}\end{array}$ is updated from

$\begin{array}{l}x_{t-1}=\sqrt{\alpha_t}\frac{(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}x_t+\sqrt{\bar{\alpha}_{t-1}}\beta_t\frac{1}{1 - \bar{\alpha}_t}\tilde{x}_{0|t}+\tilde{\sigma}_tz\end{array}$

where z ∼ N (0, I) and ̃ $\begin{array}{l}\tilde{\sigma}_t^2\end{array}$ is the variance which is usually set to 0. This sampled $\begin{array}{l}x_{t-1}\end{array}$ is then fed into the next sampling iteration until the final image $\begin{array}{l}x_0\end{array}$ is generated. In the update process, $\begin{array}{l}\alpha_t = 1 - \beta_t \quad \text{and} \quad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s\end{array}$ .

Conditional Likelihood Rectification Module

Then a conditional likelihood rectification module is introduced to embed infrared image i and visible image v to address the posterior distribution of the fused image. The modeling of the probabilistic model is the core of the sampling of fused images. Once the initial fused images are generated, the Conditional Likelihood Rectification Module steps in to refine them. This module uses a likelihood model to correct the generated images based on the observed input images, ensuring the final image closely matches the input data.

The derivation of the posterior is based on the objective function for the MMIF task to optimize, which converts the objective function optimization problem to a likelihood maximization problem by introducing latent variables m, n to formulate the hierarchical Bayesian model as presented in Fig 1.

Fig 1: Hierarchical Bayesian Model in Likelihood Rectification module

Integration with the EM Algorithm

DDFM leverages the Expectation-Maximization (EM) algorithm to estimate the parameters iteratively, optimizing the fusion process to utilize the likelihood maximization problem with latent variables and perform the model inference. The E-M steps' implementation details are shown below:

E-step: Compute the expected log-likelihood of the data given the current estimate of the parameters.

$\begin{array}{l}Q(θ∣θ(t))=E_{xt∣x0}_{,θ(t)}[log⁡p_{θ}(x0,xt)]\end{array}$

M-step: Maximize this expected log-likelihood to update the parameters.

$\begin{array}{l}θ_{t+1}==argmax_{θ}Q(θ∣θ(t))\end{array}$

The EM algorithm refines the model parameters by iteratively performing these two steps, leading to high-quality image fusion.

Summary of DDFM workflow

The integration of these components into an iterative framework is what makes DDFM particularly powerful. By alternating between unconditional generation and conditional likelihood rectification, the DDFM algorithm ensures that the fused image is both realistic and accurately represents the information from the input images. In the sections above, we briefly introduced the concept of each module, and then the overview work scheme of DDFM will be depicted below, which is as presented in Fig 2.

Fig 2: Computational graph of our DDFM in one iteration[3]

Unconditional Diffusion Sampling (UDS) Module:

To elaborate further, the Unconditional Diffusion Sampling (UDS) module, as outlined in Algorithm 1[2], is divided into two main processes:

Estimation of $\begin{array}{l}\tilde{f}_{0|t}\end{array}$ from $\begin{array}{l}f_t\end{array}$ :
- The first part of the UDS module estimates the initial natural image prior $\begin{array}{l}\tilde{f}_{0|t}\end{array}$ using the current state $\begin{array}{l}f_t\end{array}$ .
Estimation of $\begin{array}{l}f_{t-1}\end{array}$ using both $\begin{array}{l}f_t\end{array}$ and $\begin{array}{l}\tilde{f}_{0|t}\end{array}$ :
- The second part of the UDS module updates the state to $\begin{array}{l}f_{t-1}\end{array}$ using the current state $\begin{array}{l}f_t\end{array}$ and the refined estimate $\begin{array}{l}\hat{f}_{0|t}\end{array}$ .

From the perspective of score-based Denoising Diffusion Probabilistic Models (DDPM), a pre-trained DDPM can provide the natural image priors that improve the fused image's visual plausibility.

EM Module:

The EM module's role is to refine $\begin{array}{l}\tilde{f}_{0|t}\end{array}$ to c. This process is integrated within the UDS module, as illustrated in Algorithm 1 and Figure 2. Specifically, the EM algorithm (highlighted in blue and yellow) is nested within the UDS process (highlighted in grey).

Initial Estimation by DDPM Sampling:
- The initial estimate $\begin{array}{l}\tilde{f}_{0|t}\end{array}$ is generated by the DDPM sampling process (Algorithm 1, line 5). This estimate serves as the starting point for the EM algorithm.
Refinement via EM Algorithm:
- The EM algorithm then refines to produce $\begin{array}{l}\hat{f}_{0|t}\end{array}$ (Algorithm 1, lines 6-13). This step involves likelihood rectification, ensuring that the final fused image $\begin{array}{l}\hat{f}_{0|t}\end{array}$ better meets the likelihood criteria.

In essence, the EM module performs the crucial task of rectifying $\begin{array}{l}\tilde{f}_{0|t}\end{array}$ to $\begin{array}{l}\tilde{f}_{0|t}\end{array}$ , thus enhancing the accuracy and reliability of the fused image by aligning it more closely with the likelihood model, which means that the fused image preserves as much information from the source images.

Algorithm 1: DDFM[3]

Experiments and Results: Proving the Power of DDFM

Experimental Setup

To evaluate the effectiveness of DDFM, the authors conducted experiments on two main tasks: IVF (Infrared and Visible Fusion) and MIF (Multi-Modality Image Fusion). They used in total 6 metrics, including SSIM VIF and MI to compare the performance of DDFM against state-of-the-art models[4]. IVF task was tested on four datasets: TNO RoadScene MSRS M^3FD and 50 pairs of medical images from the Harvard Medical Image Dataset for the MIF experiments, including image pairs of MRI-CT, MRI-PET and MRI-SPECT. The metrics selected are shown in Table 1.

Entropy (EN)	a measure of the amount of information or randomness in an image
Standard Deviation (SD)	Standard deviation measures the amount of variation or dispersion of pixel intensity values in an image.
Mutual Information (MI)	Mutual Information quantifies the amount of information obtained about one image through another image.
Visual Information Fidelity (VIF)	VIF measures the fidelity of visual information in the fused image relative to the source images, considering human visual perception.
$\begin{array}{l}Q^A^B^/^F\end{array}$	Qabf measures the quality of the fused image based on its ability to retain edge information from the source images.
Structural Similarity Index Measure (SSIM)	SSIM measures the structural similarity between the fused image and the source images, taking into account luminance, contrast, and structural information.

Table 1: Evaluation Metrics

Comparison with State-of-the-Art Models

In both IVF and MIF tasks, the DDFM model showed superior performance across multiple metrics. For instance, in the IVF task, DDFM achieved higher metric scores compared to existing methods, indicating better structural similarity and signal quality. In the MIF task, DDFM significantly improved VIF and MI scores, demonstrating its ability to preserve visual information and mutual information from the source images. The quantitative results of IVF task and MIF task are respectively presented in Fig 3 and Fig 4 using radar charts, which are generated based on the normalized evaluation metrics to intuitively and broadly prove the superiority of DDFM.

Quantitative results of IVF task on MSRS Fusion Dataset	Quantitative results of IVF task on $\begin{array}{l}M^3FD\end{array}$ Fusion Dataset

Quantitative results of IVF task on RoadScene Fusion Dataset	Quantitative results of IVF task on TNO Fusion Dataset

Fig 3: Quantitative results of IVF task

Quantitative result of MIF task

Fig 4: Quantitative results of MIF task

To showcase the qualitative performance of DDFM, we present three test examples from the RoadScene Dataset and a pair of early-stage Alzheimer's patient's MRI and PET-FDG images in Figure 5.

In the IVF task, DDFM excels at combining the thermal radiation information from infrared images with the detailed texture information from visible images. This fusion results in several key improvements:

Enhanced Object Detection: Objects in dimly lit environments are significantly accentuated, making it easier to distinguish foreground objects from the background.

Improved Background Features: Background features that were previously indistinct due to low illumination now have clearly defined edges and rich contour information. This enhancement greatly improves our ability to comprehend the scene as a whole.

Regarding the MIF task, DDFM demonstrates its capability to retain intricate textures while emphasizing structural information. The fusion of MRI and PET-FDG images of an early-stage Alzheimer's patient illustrates this well:

Texture and Structure Retention: DDFM effectively retains the fine textures from the MRI images while highlighting the structural information from the PET-FDG images.
Performance Across Metrics: This balanced approach leads to outstanding performance not only visually but also across almost all numerical metrics, underscoring the model's robustness and effectiveness in multi-modality image fusion.

	Infrared Image/mr-t1	Visible Image/pet-fdg	Fused Image
IVF task


MIF task

Fig 5: Visual Comparison for MIF task

Ablation Studies

Ablation studies were conducted to verify the importance of each module in the DDFM framework. The results showed that each component significantly contributes to the overall performance, validating the necessity of the full DDFM model for optimal fusion results. Removing or altering any component led to a noticeable drop in quality, highlighting the critical role each part plays in the fusion process. Several experimental group was conducted based on Table 2.

Experiment		Content	Goal
Exp. I	UDS	eliminate the denoising diffusion generative framework, only the EM algorithm is employed to solve the optimization Eq	Verify the effectiveness of DDPM
Exp. II Exp. III Exp. IV	EM	removed the total variation penalty item r(x) in Eq. (13). Then, removed the Bayesian inference model.	Verify Components of EM Module
		manually set φ to 0.1
		manually set φ to 1

Tab 2: Ablation experiment groups

the results presented in Table. 3 demonstrate that none of the experimental groups is able to achieve fusion results comparable to our DDFM, further emphasizing the effectiveness and rationality of our approach.

Table 3: Ablation study results[3]

Conclusion and Discussion: A Leap Forward in Image Fusion

Significant Advancements in Image Fusion

The DDFM model represents a significant advancement in the field of multi-modality image fusion. By leveraging the strengths of denoising diffusion models, DDFM offers a robust and effective solution for combining images from different modalities. This approach enhances the quality and reliability of fused images, making it particularly valuable in applications requiring high precision and detail, such as medical diagnostics and remote sensing.

Strengths of the DDFM Model

Innovative Use of Diffusion Models:
- The application of denoising diffusion probabilistic models (DDPMs) to image fusion is novel and demonstrates the versatility of diffusion models beyond their traditional use cases. This innovation could pave the way for new methodologies in data integration and analysis.
Enhanced Image Quality:
- The iterative framework combining unconditional generation and conditional likelihood rectification ensures that the fused images are both realistic and accurately represent the information from the input images. This dual-step approach mitigates common issues such as artifacts and loss of critical details.
Robust Performance Across Metrics:
- Extensive experiments show that DDFM outperforms state-of-the-art models across multiple metrics, including SSIM, PSNR, VIF, and MI. This consistent performance across diverse metrics highlights the robustness and reliability of the DDFM approach.
Versatility, Flexibility and Interpretability:
- Interpretable generation process
- Inference-only method, capable of dealing with various tasks without specific fine-tuning
- While the paper focuses on medical imaging, the DDFM model's underlying principles can be applied to various domains, such as remote sensing, industrial inspection, and even creative fields like digital art and photography.

Areas for Further Improvement

Computational Efficiency:
- One of the main challenges with advanced deep learning models, including DDFM, is their computational demand. Optimizing the computational efficiency of DDFM could make it more accessible for real-time applications and resource-constrained environments.
Parameter Sensitivity:
- The performance of DDFM may be sensitive to the choice of hyperparameters. Developing adaptive or automated hyperparameter tuning methods could enhance its usability and robustness, ensuring optimal performance across different datasets and tasks.
Broader Application and Validation:
- Further studies could explore the application of DDFM in other domains beyond medical imaging. Validating its effectiveness in fields like environmental monitoring, autonomous driving, and security could demonstrate its broader utility and impact.

Future Directions

Integration with Clinical Workflows:
- For medical imaging applications, integrating DDFM into clinical workflows and evaluating its impact on diagnostic accuracy and decision-making would be valuable. Real-world trials and collaborations with healthcare providers could provide insights into its practical benefits and challenges.
Exploring Multi-Scale Fusion:
- Investigating multi-scale approaches within the DDFM framework could enhance its ability to capture details at different resolutions, further improving the quality of fused images. Multi-scale fusion could be particularly beneficial in applications requiring high levels of detail, such as pathology and geospatial analysis.
Hybrid Models:
- Combining DDFM with other advanced techniques, such as generative adversarial networks (GANs) or variational autoencoders (VAEs), could yield hybrid models that leverage the strengths of multiple approaches. These hybrid models could offer even greater improvements in image fusion quality and versatility.

Discussion

DDFM model is found to be a remarkable innovation. Its ability to integrate and refine multi-modality images using a probabilistic framework is both novel and highly effective. The approach takes advantage of the power of the pre-trained diffusion model building up a probabilistic model of fused image by means of integrating likelihood rectification into the original iterative scheme to tackle fusion task without fine-tuning on specific tasks, providing a more flexible and efficient solution for image fusion. However, the journey does not end here. The areas for improvement identified in this discussion highlight the ongoing challenges and opportunities for further research and development.

The DDFM model sets a new standard in image fusion, demonstrating the potential of diffusion models in this context. It opens up exciting possibilities for enhancing various applications, from medical diagnostics to environmental monitoring. As technology advances and computational methods evolve, I look forward to seeing how the DDFM model and its underlying principles continue to shape the future of multi-modality image fusion.

In conclusion, the DDFM model represents a significant leap forward, offering a promising solution to a complex problem. Its innovative approach, robust performance, and potential for broad applicability make it a valuable contribution to the field. As researchers and practitioners explore and refine this model, we can expect to see continued advancements and new breakthroughs in multi-modality image fusion.

List of Abbreviations

Positron Emission Tomography (PET)

Infrared-Visible image Fusion (IVF)

Medical Image Fusion (MIF)

Multimodality Image Fusion (MMIF)

Denoising Diffusion Model for Multi-Modality image Fusion (DDFM)

Fusion Generative Adversarial Network (FGAN)

Guided Multi-Modal Convolutional Network (GMcC)

Unsupervised Two-Stream Fusion Network (U2F)

Residual Fusion Network (RFN)

Target Detection Assisted Deep Adversarial Learning (TarDAL)

Deep Exposure Fusion (DEF)

Unsupervised Multi-scale Fusion (UMF)

Entropy (EN)

Standard Deviation (SD)

Mutual Information (MI)

Visual Information Fidelity (VIF)

Structural Similarity Index Measure (SSIM)

Chatgpt prompts

In preparing this blog post, Chat GPT was used to assist in brainstorming, writing support, and ensuring grammatical accuracy. Key prompts used include:

"Explain the significance of multi-modality image fusion in medical imaging."
"Describe the methodology of the DDFM model as an expert."
"Summarize the experimental results of the DDFM paper." <image question with table>
"Can you provide me the latex code macro for this equation?" <image question>
"As an expert in MMIF, could you try to make the derivation process of the likelihood model more concise and precise to the public?"
"You are an expert in data analytics, could you give me advice on using which diagram technique to visualize the quantitative metrics result?"
"There exists an imbalance in numerical values across different dimensions of the metric values, should I continue using radar chart and should I conduct normalizations on the data?"

References

[1]Denoising Diffusion Probabilistic Models. (n.d.). Retrieved July 5, 2024, from https://readpaper.com/pdf-annotate/note?pdfId=656289675895246848&noteId=2388373010987468032

[2]Denoising Diffusion Implicit Models. (n.d.). Retrieved July 5, 2024, from https://readpaper.com/pdf-annotate/note?pdfId=4675937509968986113&noteId=2388344772617668352

[3]Zhao, Z., Bai, H., Zhu, Y., Zhang, J., Xu, S., Zhang, Y., Zhang, K., Meng, D., Timofte, R., & Van Gool, L. (2023). DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion (arXiv:2303.06840). arXiv. https://doi.org/10.48550/arXiv.2303.06840

[4]Medical image fusion: A survey of the state of the art. (n.d.). Retrieved July 4, 2024, from https://readpaper.com/pdf-annotate/note?pdfId=4498142341101150209&noteId=2359714857374532096

Seitenhierarchie