Abstract:
In the form of a blog post, we will provide an overview of the most recent image-to-image translation diffusion models. There will be a presentation on three distinct approaches, presenting the outcomes of these methods as well as their limitations.

Introduction and Motivation

1- Definitions

  • Translation: In the field of computer vision, translation refers to the process of converting images from a given source domain to a desired target domain, while ensuring that the semantic content of the images remains intact. By different domains, we refer to distinct visual attributes such as day/night, grayscale/color, and so forth.

different domains of the image

                                                                      Figure 1: Some domains of the image [1]     

  • Diffusion: It is a stochastic process that describes the evolution of a random variable over time. The technique of diffusion is employed in image-to-image translation to gradually transform the input and output images, thereby generating novel images.

                                                                      Figure 2: Overview of the diffusion process

                                                                      Source:  https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166


The diffusion models that have been reported exhibit superior performance compared to previously utilized models such as Generative Adversarial Networks (GANs), Cycle GANs, and Variational Autoencoders (VAEs).                                                                                                                                      

2- Application in the medical field


This section aims to address the question regarding the potential applications of said translations in the medical domain.

Why? 

To begin, we observe that the medical image could be obtained via different modalities such as X-Ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Ultrasound(US), etc.. . Every imaging modality possesses distinct visual characteristics, and sometimes within each modality, there are different sequences (T1, Flair,... ) that enable specific tissue contrasts, thereby emphasizing various aspects of the anatomy. 

As the most noticeable characteristic of medical images is their scarce availability, the present study employs diffusion models to facilitate the translation between an existing modality and a missing one, thereby resulting in the augmentation of data. The utilization of synthetic images can provide supplementary training samples, thereby augmenting the dataset's size, addressing the issue of imbalanced data distribution, and minimizing data scarcity.

Furthermore, clinicians can utilize these simulated images as a reference to aid in the diagnosis and comprehension of various disorders. The created images can provide additional visual information, allowing them to make more informed decisions and deliver better patient care. 

However...Precautions!

Nonetheless, the implementation of translation in the medical domain ought to adhere to specific criteria.

On the one hand, the procedure should keep the overall structure and anatomical features intact. It comprises the sizes and shapes of the organs, tissues, and the original image's structure. On the other hand, it is crucial for the process to be robust to noise and variations commonly found in medical imaging, such as acquisition protocols, patient motions, the scanner and Equipment's variations,... Furthermore, given that medical images cover confidential patient data, it is crucial to guarantee the preservation of data privacy throughout the translation process.

Methodology


This section will examine three distinct papers that have introduced three diverse diffusion-based model architectures. The analysis includes a description of each model, a presentation of the obtained results, and a personal feedback.

1- Dual Diffusion Implicit Bridges

The first paper introduced Dual Diffusion Implicit Bridges (DDIB) [2]. 

Model

The dual diffusion implicit bridges approach utilizes two separate diffusion models: one model used for transforming source images to a latent space and another defined for generating target images from the obtained latent space. While the term "dual diffusion" is derived from this characteristic, the authors specifically refer to implicit bridges as "Schrödinger Bridges".

The term "Schrödinger bridges" is employed metaphorically in this context. The researchers have successfully illustrated in their study that, under specific conditions, the translation between the source and latent distributions, as well as between the latent and target distributions, can be accurately described by two diffusion models, that have the properties and definition of two Schrödinger bridges.

To solve the ordinary differential equations (ODEs) defining the data distributions, the model relies on the PF ODE framework that incroporates probabilistic measurments. In PF ODE, the probability density function (PDF) is commonly discretized and then updated based on the ODE system's differential equations. Thus, it experiences temporal evolution in line with the dynamics of the underlying ODE system by numerically solving this transport equation.

To learn more about the math behind this comparaison, please refer to section 2.2 of [2] under Schrödinger Bridge Problem. 

Upon concluding their mathematical analysis, the authors demonstrated that the process of flowing through PF ODEs in DDIBs is equivalent to flowing through Schrödinger Bridges, a form of entropy-regularized optimal transport, to explain the efficacy of the method.

Consequently, the authors present the code's high-level structure in the format as follows: 

  1. DDIBs first apply ODESolve in the source domain to obtain the encoding x(s) of the image at the end time t = 1
  2. We refer to x(s) as the latent code
  3. The source latent code is fed as the initial condition to ODESolve with the target model  v_θ^{(t)} to obtain the target image x(t )


                                                    Figure 3: Visualisation of the dual models and the diffusion process [2]

Results

Despite their simplicity, DDIBs have shown significant advantages over previous techniques, which I will address below.


  • Cycle Consistency: The cycle consistency property is guaranteed because PF ODEs are employed in the DDIBs. The framework ensures that, when a data point is being translated from the source domain to the target domain and afterward back to the source domain, its original state in the source domain is recovered. When compared to GAN-based models, which must include additional training terms to optimize for cycle consistency, DDIBs do not need extra conditioning or optimization to prevent the cycle inconsistency. 
  • Multi-domain Translation: The utilization of DDIBs extends beyond paired domain translation, as it enables the translation between any source-target domain pairs without requiring additional finetuning or adaptation, relying only on a single pre-trained conditional diffusion model. The figure presented below shows the result of the transformation of a single image featuring a roaring lion into multiple different ImageNet classes. 
  • Data privacy: The training procedure of the DDIBs can be executed in a manner that is sensitive to privacy concerns, as the training of both diffusion models is not reliant on any prior knowledge of the domain pair. Given the illustrated example below, Alice and Bob are the owners of the source and target domains. Alice  trains a diffusion model using the source data, eventually encoding the data into the latent space and transmitting it to Bob. Bob then applies his trained diffusion model to the data and returns the results to Alice. Both parties took measures to ensure that their data is not disclosed directly.
Cycle Consistency [2]Multi-domain Translation [2]Data privacy [2]


Review

Strenghts: The authors did indeed explain deeply the Schrödinger bridge problem, the Probability Flow of ODEs and how the ODE Solver helps solving the Schrödinger bridge problem. Though their method is simple, it is original and did present promising advantages. For me the data privacy property comes in first place, as it is important to preserve the patient's personal informations when conducting the image translation between different clinics or research centers. 

Weakness: However, one can notice some occasional strange translation behavior: for example in the multi domain translation, a part from the tiger and jaguar, all other animals tend to show their tongue instead of roaring.  Another limitation of their study is the absence of actual proof to support their discussion on data privacy in medical images. While natural image patterns contain different foreground and background contexts even within one class, medical image translation focuses on the  detail preservation. Therefore, it is not clear if such a separately training method is capable to handle the task.

2- Diffusion-based Translation Using Disentangled Style and Content Representation

The authors of the paper [3] have introduced a new unsupervised image translation technique called DiffuseIT, which is based on the diffusion process and utilizes disentangled representations of style and content. Initially,  the term disentangled style is defined as the capacity of the model to separate and control the stylistic characteristics of the image independently. Whereas, the term content representation refers to the process of representing high-level features in a manner that distinguishes them from unnecessary details. The model they have proposed is suitable for employment in both image-guided and text-guided translation tasks.

Model

To build the overall architecture of diffuseIT, the authors added to the forward and reverse diffusion a pre-trained Dino ViT[4] model to facilitate the generation process of the diffusion models and proposed also novel loss functions.

Following a systematic addition of Gaussian noise to the training data (forward diffusion), the denoising process acquires the ability to restore the data and subsequently transfer it to the target domain. In this context, the authors direct the denoising procedure utilizing the pre-trained Dino-ViT model and establish the overall loss function, which comprises five distinct components.

As demonstrated in the paper, the pre-trained dino ViT suceed to separate the semantic and the structural information of the image, where:

  • the keys k^l of the multi-head self-attention layer contain the structure information, 
  • and the CLS token of the last layer contains the semantic information of the same image.

To guide the learning, the authors introduced 5 losses. With the help of the figure 4 below, we describe the role of each loss.

                                                                                      Figure 4: Visualisation of the losses in DiffuseIT [3]

  1. Structure Loss: also known as infoNCE Loss, $l_{cont}$ is derived from patch contrastive loss [5]. It compares the structure of the input and output networks by checking for similarities between the keys. The authors included a regularization term on the key of the same positions to achieve a closer distance while maximizing the distances between the keys at various positions.
    round pushpin Minimizing this loss ensures the maintaining of the structure in the reverse diffusion.
  2. Semantic style Loss: The loss between the target tokens and the tokens of the denoised image at time step $t$, which dictates the style, is denoted by $l_{sty}$. A regulated MSE loss is introduced to avoid misaligning the color values as well.  
    round pushpin The diffusion process is guided to match the target image semantic by minimizing $l_{sty}$.
  3. Clip Loss: the $l_{clip}$ loss is inspired by the work of CLIP-guided diffusion [6].  However, when basing the translation on the text, the authors made minor alterations to compensate for the poor image quality. This loss is introduced in the case of text-guided image translation.
  4. Semantic Divergence Loss: By maximizing the loss $l_{reg}$ , the model tends to make the semantic information of two denoised images as dissimilar as possible
    round pushpin Speed up the diffusion process. 
  5. Regularisation loss: $l_{reg}$ prevents the irregular step of the reverse diffusion process.


Results

Aside from the outstanding numerical results, the research predicted promising graphical visualization of the various image-guided and text-guided translations. 
Compared to the baseline outputs, the acquired outputs show a higher level of perception, as shown in the displayed figure 5 below. When the case of the Dog-Tiger pair is examined, it is evident that the visual traits (eyes for example) of the generated specie maintained the form of the source (the dog). Nonetheless, they inherited the tiger's style (color of the eyes).

Unlike the baseline results, the DiffuseIT technique produced a structurally coherent output image with no facial distortion or abnormal behavior. This is because DiffuseIT separates the style and semantic of the image. 

Figure 5: Qualitative comparaison of image-guided image translation

The outcomes of the text-guided image translation compared to recent methods are shown below in Table 1. Through quantitative comparison, the authors employed the metrics SFID, LPIPS, and CSFID to compare the images quality. The DiffuseIT model showed the best performance, and sometimes the second-best performance among all baseline methods. 
To further evaluate the perceptual quality of generated samples, they conducted a user study that relies on an opinion scoring system.


Review

Strengths: The proposed method demonstrates superior performance compared to state-of-the-art models in terms of both score comparison and convergence speed. What I find particularly valuable is that,  DiffuseIT extensively utilizes recent techniques, pre-trained models, and existing losses. However, the authors take it a step further by providing their own contributions to each loss, addressing previous bugs and limitations in the proposed mathematical formulations. Moreover, the paper presents realistic results and is well-written, with clear explanations.

Code available in Source: https://github.com/cyclomon/DiffuseIT

Weakness: Although the authors acknowledge a limitation of the model's performance when unrealistic text conditions are applied to the source domain, I believe this limitation would have minimal impact on the application of DiffuseIT in the medical field. Generally, the guiding text descriptions in medical scenarios are expected to be realistic. Moreover, the preservation of structural information, which DiffuseIT has shown to excel at, is crucial in medical images. It is noteworthy that the paper, unfortunately, did not conduct any experiments on medical modalities.

3- Adversial Diffusion Model: SynDiff

In this third study, the authors introduces a new adversarial diffusion model called SynDiff, designed for the purpose of medical image synthesis. The model is capable of achieving both high-fidelity and efficient modality translation by employing conditional diffusion on unpaired images. In contrast to conventional diffusion models, SynDiff uses a rapid diffusion mechanism to enhance computational efficiency.

Model

In this analysis, we will go over the diverse parts that constitute the architecture in the training phase, relying on figure 6 below.

The goal of unsupervised image translation is to create a mapping between two distinct image domains without the need for paired training photos. Consider the following scenario: we have images representing slices from Modality A (MRI) and Modality B (CT), but there is no intrinsic correlation between them. By including adversarial training in the diffusion process, the model hopes to learn and recreates the associated features in the target modality.

  • SynDiff fuses a non-diffusive module with a diffusive one during the training phase. However, during Inference, it neglects the non-diffusive module.
  • Non-diffusive module: The model considers the image $x_0^A$ from Modality A to be the target image and uses the Generation/Discriminator competition to estimate its associated source image $\widetilde{y^B}$ from Modality B.
  • Diffusive module: In the reverse diffusion process, the newly synthesized source image  $\widetilde{y^B}$ is combined with the diffused and noisy target image $x_t^A$. This combined input is then fed into another Generation/Discriminator pair, resulting in a deterministic estimation of the denoised image $\widetilde{x_0^A}$. After sampling from the resulting distribution, the model feeds the denoised target image $x_{(t-k)}^A$ back into the Generation/Discriminator pair, keeping up the iterative process.
  • By comparing $x_0^A$ with the final denoised target image, the cycle consistency is maintained, which is a motivating factor for target image reconstruction. 
  • Combining the adversarial losses from all modules and the cycle loss yields the overall loss.
 

Figure 6: Structure of SynDiff [7]





During the inference, the following algorithm is executed:
  1. Start from image $x^A$ of Modality A
  2. Define the target image $x_T^B$ by a gaussian noise
  3.  Apply the diffusive module on the pair $(x^A,x_T^B)$, iteratively to get $x^B$: the corresponding image in the modality B.



 

                                                                                                             

Results

  • The paper demonstrated SynDiff for unsupervised MRI contrasts translation against state-of-the-art non-attentional GAN (cGAN, UNIT, MUNIT), attentional GAN (AttGAN,SAGAN), and regular diffusion (DDPM, UNIT-DDPM) models. The figure below shows a section comapraison between Syndiff and the porposed methods in the image translation from a) T1 to T2 and from b) T2 to PD. Syndiff yields lower noise and maintains higher anatomical fidelity , whereas GANs showed local inaccuracies in tissue contrast and the reglar diffusion models suffered from blurring.

                                                                          Figure 7: Translation between MRI constrasts

                                                                        

  • The authors also demonstrated SynDiff for unsupervised translation between separate modalities. In addition, experiments were performed using non-attentional GAN, attentional GAN, and regular diffusion models on the pelvic dataset for MRI-CT translation. SynDiff demonstrates superior performance across all tasks.

                                                                                                           Table 2: Transaltion between MRI and CT modalities

Review

 Strenghts:      

  • Extensive experimentation on the medical datassets
  • Examination of different translation scenarios and compaison to baseline models
  • High level of fiderlity and superior quality compared to state-of-the-art GAN and diffusion models
  • Cycle consistency
  • Compertitve inference time with GANs models
  • Code available in Source: https://github.com/icon-lab/SynDiff/tree/main

Weakness:

  • No limitations proposed in the paper
  • No medical review  (in order to verify the accuracy of the translated image results)
  • Some difficutly to understand the architecture of the model

Conclusion

This blog post introduces three different architectures for diffusion models that are designed for the purpose of image-to-image translation. Although DDIBs possess the ability to maintain data privacy, ensure cycle consistency, and demonstrate multi-domain adaptability through their simple architecture, they do confront a significant computational cost when addressing ordinary differential equations (ODEs). On the other hand, DiffuseIT sets more importance on the preservation of content and the semantic changes within the domains of image-guided and text-guided image translation. Nevertheless, the model's performance is insufficient when attempting to apply text-guided image translation in the presence of a large domain gap. Neither of these models conducted experiments on medical datasets. However, Syndiff demonstrated its applicability in the medical domain by preserving anatomical fidelity during translation. An area of potential future development for Syndiff would involve modifying its framework to cope with 3D datasets.

References

1 .  Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

2.   Su, Xuan, et al. "Dual diffusion implicit bridges for image-to-image translation." arXiv preprint arXiv:2203.08382 (2022).

3.   Kwon, Gihyun, and Jong Chul Ye. "Diffusion-based image translation using disentangled style and content representation." arXiv preprint arXiv:2209.15264 (2022).

4.  Caron, Mathilde, et al. "Emerging properties in self-supervised vision transformers." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

5.  Park, Taesung, et al. "Contrastive learning for unpaired image-to-image translation." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer International Publishing, 2020.

6.  Gal, Rinon, et al. "Stylegan-nada: Clip-guided domain adaptation of image generators." arXiv preprint arXiv:2108.00946 (2021).

7.  Özbey, Muzaffer, et al. "Unsupervised medical image translation with adversarial diffusion models." IEEE Transactions on Medical Imaging (2023).

  • Keine Stichwörter