Image to image translation with diffusion models

Abstract:
In the form of a blog post, we will provide an overview of the most recent image-to-image translation diffusion models. There will be a presentation on three distinct approaches, presenting the outcomes of these methods as well as their limitations.

Introduction and Motivation

1- Definitions

Translation: In the field of computer vision, translation refers to the process of converting images from a given source domain to a desired target domain, while ensuring that the semantic content of the images remains intact. By different domains, we refer to distinct visual attributes such as day/night, grayscale/color, and so forth.

Figure 1: Some domains of the image [1]

Diffusion: It is a stochastic process that describes the evolution of a random variable over time. The technique of diffusion is employed in image-to-image translation to gradually transform the input and output images, thereby generating novel images.

Figure 2: Overview of the diffusion process

Source: https://medium.com/@steinsfu/diffusion-model-clearly-explained-cd331bd41166

The diffusion models that have been reported exhibit superior performance compared to previously utilized models such as Generative Adversarial Networks (GANs), Cycle GANs, and Variational Autoencoders (VAEs).

2- Application in the medical field

This section aims to address the question regarding the potential applications of said translations in the medical domain.

Why?

To begin, we observe that the medical image could be obtained via different modalities such as X-Ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Ultrasound(US), etc.. . Every imaging modality possesses distinct visual characteristics, and sometimes within each modality, there are different sequences (T1, Flair,... ) that enable specific tissue contrasts, thereby emphasizing various aspects of the anatomy.

As the most noticeable characteristic of medical images is their scarce availability, the present study employs diffusion models to facilitate the translation between an existing modality and a missing one, thereby resulting in the augmentation of data. The utilization of synthetic images can provide supplementary training samples, thereby augmenting the dataset's size, addressing the issue of imbalanced data distribution, and minimizing data scarcity.

Furthermore, clinicians can utilize these simulated images as a reference to aid in the diagnosis and comprehension of various disorders. The created images can provide additional visual information, allowing them to make more informed decisions and deliver better patient care.

However...Precautions!

Nonetheless, the implementation of translation in the medical domain ought to adhere to specific criteria.

On the one hand, the procedure should keep the overall structure and anatomical features intact. It comprises the sizes and shapes of the organs, tissues, and the original image's structure. On the other hand, it is crucial for the process to be robust to noise and variations commonly found in medical imaging, such as acquisition protocols, patient motions, the scanner and Equipment's variations,... Furthermore, given that medical images cover confidential patient data, it is crucial to guarantee the preservation of data privacy throughout the translation process.

Methodology

This section will examine three distinct papers that have introduced three diverse diffusion-based model architectures. The analysis includes a description of each model, a presentation of the obtained results, and a personal feedback.

1- Dual Diffusion Implicit Bridges

The first paper introduced Dual Diffusion Implicit Bridges (DDIB) [2].

Model

The dual diffusion implicit bridges approach utilizes two separate diffusion models: one model used for transforming source images to a latent space and another defined for generating target images from the obtained latent space. While the term "dual diffusion" is derived from this characteristic, the authors specifically refer to implicit bridges as "Schrödinger Bridges".

The term "Schrödinger bridges" is employed metaphorically in this context. The researchers have successfully illustrated in their study that, under specific conditions, the translation between the source and latent distributions, as well as between the latent and target distributions, can be accurately described by two diffusion models, that have the properties and definition of two Schrödinger bridges.

To solve the ordinary differential equations (ODEs) defining the data distributions, the model relies on the PF ODE framework that incroporates probabilistic measurments. In PF ODE, the probability density function (PDF) is commonly discretized and then updated based on the ODE system's differential equations. Thus, it experiences temporal evolution in line with the dynamics of the underlying ODE system by numerically solving this transport equation.

To learn more about the math behind this comparaison, please refer to section 2.2 of [2] under Schrödinger Bridge Problem.

Upon concluding their mathematical analysis, the authors demonstrated that the process of flowing through PF ODEs in DDIBs is equivalent to flowing through Schrödinger Bridges, a form of entropy-regularized optimal transport, to explain the efficacy of the method.

Consequently, the authors present the code's high-level structure in the format as follows:

DDIBs first apply ODESolve in the source domain to obtain the encoding $\begin{array}{l}x(s)\end{array}$ of the image at the end time t = 1
We refer to $\begin{array}{l}x(s)\end{array}$ as the latent code
The source latent code is fed as the initial condition to ODESolve with the target model $\begin{array}{l}v_θ^{(t)}\end{array}$ to obtain the target image $\begin{array}{l}x(t )\end{array}$

Figure 3: Visualisation of the dual models and the diffusion process [2]

Results

Despite their simplicity, DDIBs have shown significant advantages over previous techniques, which I will address below.

Cycle Consistency: The cycle consistency property is guaranteed because PF ODEs are employed in the DDIBs. The framework ensures that, when a data point is being translated from the source domain to the target domain and afterward back to the source domain, its original state in the source domain is recovered. When compared to GAN-based models, which must include additional training terms to optimize for cycle consistency, DDIBs do not need extra conditioning or optimization to prevent the cycle inconsistency.

Multi-domain Translation: The utilization of DDIBs extends beyond paired domain translation, as it enables the translation between any source-target domain pairs without requiring additional finetuning or adaptation, relying only on a single pre-trained conditional diffusion model. The figure presented below shows the result of the transformation of a single image featuring a roaring lion into multiple different ImageNet classes.

Data privacy: The training procedure of the DDIBs can be executed in a manner that is sensitive to privacy concerns, as the training of both diffusion models is not reliant on any prior knowledge of the domain pair. Given the illustrated example below, Alice and Bob are the owners of the source and target domains. Alice trains a diffusion model using the source data, eventually encoding the data into the latent space and transmitting it to Bob. Bob then applies his trained diffusion model to the data and returns the results to Alice. Both parties took measures to ensure that their data is not disclosed directly.

Cycle Consistency [2]	Multi-domain Translation [2]	Data privacy [2]

Review

Strenghts: The authors did indeed explain deeply the Schrödinger bridge problem, the Probability Flow of ODEs and how the ODE Solver helps solving the Schrödinger bridge problem. Though their method is simple, it is original and did present promising advantages. For me the data privacy property comes in first place, as it is important to preserve the patient's personal informations when conducting the image translation between different clinics or research centers.

Weakness: However, one can notice some occasional strange translation behavior: for example in the multi domain translation, a part from the tiger and jaguar, all other animals tend to show their tongue instead of roaring. Another limitation of their study is the absence of actual proof to support their discussion on data privacy in medical images. While natural image patterns contain different foreground and background contexts even within one class, medical image translation focuses on the detail preservation. Therefore, it is not clear if such a separately training method is capable to handle the task.

2- Diffusion-based Translation Using Disentangled Style and Content Representation

The authors of the paper [3] have introduced a new unsupervised image translation technique called DiffuseIT, which is based on the diffusion process and utilizes disentangled representations of style and content. Initially, the term disentangled style is defined as the capacity of the model to separate and control the stylistic characteristics of the image independently. Whereas, the term content representation refers to the process of representing high-level features in a manner that distinguishes them from unnecessary details. The model they have proposed is suitable for employment in both image-guided and text-guided translation tasks.