Image-to-Image Translation

This is the blog post for the topic: Image-to-image translation. It explains some preliminary knowledge of I2I translation and briefly introduces three state-of-the-arts methods and their medical applications.

Blog post author: Yihao Wang

Table of contents:

1. Introduction

Image-to-image translation has been widely researched in tasks such as translating cats to dogs, horses to zebras, and summer to winter. Recently, it has been employed in the medical field. Medical images obtained from instruments usually suffer from low quality, stark optical differences from the real images and may lead to potential radiation or trauma for patients. To solve these problems, we can utilize I2I translation to translate raw simulated images to real inner images, which assists doctors in making more accurate diagnoses. We can also translate MR images to CT to avoid harmful radiation.

Figure 1. Example of I2I translation (CycleGAN) [1].

Figure 2. Example of MR-to-CT translation [2].

2. Motivation

Generative Adversarial Networks (GAN) [3]: I2I translation is a typical GAN because, in most cases, we need to generate a verisimilar picture. However, it cannot control the generated data. In order to solve this problem, Conditional GAN (CGAN) [4] proposes adding conditional information to both the generative and discriminant model to guide the model. The well-known pix2pix [5] first employs these networks for I2I translation. After that, more and more GAN-based image translation methods are put forward, such as CycleGAN [1], BicycleGAN [6], DiscoGAN [7], and DualGAN [8].

Relationship Preservation: When we translate an image, how to ensure the relationship between input and output is the same? For example, when we do horse → zebra translation, only the appearance of the horse changes, and all other part is the same. One intuitive method is to calculate the distance or vectors between input and output at the pixel level, but it does not achieve a perfect result [1, 5, 6]. On the more abstract level, the feature-based loss was proposed, which can compare the feature map or the spatially-correlative map [9, 10]. This loss function can preserve the domain-specific features well.

Contrastive Representative Learning is an effective tool for the measurement of relationship preservation in unsupervised representation learning. Its core idea is to learn robust features by associating "positive" pairs and dissociating "negative" pairs (see Figure 4). Therefore, the positive samples can be "more positive" and vice versa. The representative work is CPC [11], which tries to maximize mutual information and utilize noise contrastive estimation (NCE). CUT [9] is the first to introduce InfoNCE to image translation tasks.

Figure 3. The process of GAN [12].

Figure 4. Contrastive Learning.

3. Methodology

3.1 CycleGAN

CycleGAN [1] is one of the representative unpaired image translation methods. The authors employ the idea of bijection and introduce the cycle-consistency loss. This model has two generators and two discriminators, in which one pair is for forward mapping and the other for backward, forming a cycle. Figure 5 shows that after twice mapping, the output can be brought to the original image. According to the GAN model by Goodfellow et al., the authors introduce adversarial loss to both sides to ensure the generated images are as close to real images as possible. Besides, they utilize a cycle-consistency loss, which helps to change objects while preserving their domain-invariant features. The full objective combines both losses as follows:

$\begin{array}{l}\mathcal{L}(G,\ F,\ D_{X},\ D_{Y})=&\mathcal{L}_{\text{GAN}}(G,\ D_{Y},\ X,\ Y)+\mathcal{L}_{\text{GAN}}(F,\ D_{X},\ Y,\ X)+\lambda \mathcal{L}_{\text{cyc}}(G,\ F)\end{array}$

Figure 5. (a) model of CycleGAN; (b) the forward mapping; (c) the backward mapping.

3.2 Contrastive Learning for Unpaired Image-to-Image Translation (CUT)

The bijection shows good performance, but relying on auxiliary networks and loss function can be problematic. These problems have been overcome in one-sided unsupervised image translation. As a one-sided unsupervised translation, CUT [9] only requires learning the mapping in one direction and owns only one generator and one discriminator. To be specific, the generator is an autoencoder, in which the encoder $\begin{array}{l}G_{enc}\end{array}$ learns the domain-invariant concepts (e.g. the background and the gesture of the horse body), and the decoder $\begin{array}{l}G_{dec}\end{array}$ generates domain-specified features like the stripes of the zebra. The output can be formulated as $\begin{array}{l}\hat{\textbf{y}}=G(z)=G_{dec}(G_{enc}(\textbf{x}))\end{array}$ .

Loss function

Adversarial Loss: The adversarial loss $\begin{array}{l}\mathcal{L}_{\text{GAN}}(G,\ D_{Y},\ X,\ Y)\end{array}$ here is the same as that of CycleGAN.

InfoNCE Loss [13]: It aims to learn the embedding that associates corresponding patches to each other, while disassociating them from others. After sampling the query patch (output), one positive patch (corresponding input) and N negative patch (noncorresponding input), the cross-entropy loss is calculated, representing the probability of the positive example being selected over the negatives. Before the calculation, we need to normalize these vectors to a unit sphere. In short, we can use it to attract positive samples and detract negative samples. The formulation is as follows:

where $\begin{array}{l}sim(v,v^+)=v^Tv^+/\|v\|\|v^+\|\end{array}$ is the normalized vector pair and $\begin{array}{l}\tau\end{array}$ stands for an adjustment factor.

Therefore, it can be a mutual information estimator. After we sample the positive and negatives, we can utilize the InfoNCE to maximize the mutual information between the query and positive samples and minimize it between the query and negatives.

Figure 6. The process to calculate contrastive loss.

PatchNCE Loss: Here, adopting the idea of SimCLR [14], the authors select $\begin{array}{l}L\end{array}$ layers and pass the feature maps through a small MLP (multi-layer perceptron) $\begin{array}{l}H_l\end{array}$ . Then the captured features of the $\begin{array}{l}l\end{array}$ -th layer can be written as $\begin{array}{l}\{z_l\}_L=\{H_l(G^l_{enc}(\textbf{x}))\}_L\end{array}$ , such as the animal leg (regardless of horse leg or zebra leg). Similarly, for the output image $\begin{array}{l}\hat{\textbf{y}}\end{array}$ , they encode it back to the feature layer and get $\begin{array}{l}\{\hat{z}_l\}_L=\{H_l(G^l_{enc}(\textbf{y}))\}_L=\{H_l(G^l_{enc}(G(\textbf{x})))\}_L\end{array}$ . What to pay attention to is they use $\begin{array}{l}z_l^s\end{array}$ to denote the corresponding feature, and $\begin{array}{l}z_l^{s\backslash S}\end{array}$ to denote other features, where $\begin{array}{l}S\end{array}$ means the number of spatial locations and $\begin{array}{l}s \in \{1,...,S\}\end{array}$ . Then, with the help of the abovementioned contrastive loss function, we can calculate the loss of domain-invariant concepts for every input-output patch and "neglect" the domain-specified features. The proposed PatchNCE loss can be computed as follows:

Full objective

The final objective function considers the GAN loss and PatchNCE loss. Additionally, identity loss $\begin{array}{l}\mathcal {L}_\text {PatchNCE}(G, H, Y)\end{array}$ should also be utilized to prevent the generator from making unnecessary changes.

3.3 Fixed/Learned Self Similarity (F/LSeSim)

Chuanxia Zheng et al. first introduced a new method to explicitly learn spatially-correlative maps for image translation tasks [10]. Different from the previous pixel-level or feature-level loss that cannot decouple the structure and appearance, they applied the idea of self-similarity loss. An image translation task that can preserve the structure will retain the patterns of self-similarity in both the source and translated images, regardless of the shapes or appearances.

Loss function

Fixed Self Similarity (FSeSim) loss: In this function, the authors directly compare the self-similarity patterns of features extracted from a simple network.

Firstly, they compute the spatially-correlative map as follows:

An input image is fed into a feature extractor (e.g. VGG-16) , and then $\begin{array}{l}\left({{f_{{x_i}}}} \right)^T\end{array}$ is the feature of a query point $\begin{array}{l}x_i\end{array}$ . $\begin{array}{l}f_{{x_*}}\end{array}$ is corresponding features in a patch of $\begin{array}{l}N_p\end{array}$ points, and $\begin{array}{l}S_{{x_i}}\end{array}$ stands for the spatial correlation between the query point and other points.

Then, The authors represent the multiple spatially-correlative maps of the whole image as $\begin{array}{l}S_x=[S_{x_{1}};S_{x_{2}};...;S_{x_{s}}]\in\mathbb{R}^{N_s\times N_p}\end{array}$ , where $\begin{array}{l}N_s\end{array}$ is the number of sampled patches. The similarity maps between the input and output can be computed as follows:

The distance function $\begin{array}{l}d(\cdot)\end{array}$ can be the $\begin{array}{l}L_1\end{array}$ distance or the cosine distance, based on the requirements. Figure 7 shows the process of computing FSeSim.

Learned Self Similarity (LFSeSim) loss: The contrastive loss function is the same as Section 3.2, except that the vector $\begin{array}{l}v =S_{x_{i}}\in\mathbb{R}^{1\times N_p}\end{array}$ denotes the spatially-correlative map of the query patch and $\begin{array}{l}v^+ =S_{\hat{x}_{i}}\in\mathbb{R}^{1\times N_p}, v^-\in\mathbb{R}^{K\times N_p}\end{array}$ respective for 1 positive (here: augmented) sample and $\begin{array}{l}K\end{array}$ negatives.

Full Objective

The final objective is to minimize the following losses:

where $\begin{array}{l}\mathcal{L}_G\end{array}$ and $\begin{array}{l}\mathcal{L}_D\end{array}$ stands for the GAN loss and $\begin{array}{l}\mathcal{L}_S\end{array}$ is the contrastive loss. It should be noted that the $\begin{array}{l}d\left({{S_x}，{S_{\hat y}}\right)\end{array}$ is added to measure the structure loss and $\begin{array}{l}\lambda\end{array}$ is the trade-off hyperparameter.

Figure 7. An Example of computing FSeSim.

Figure 8. An Example of computing LSeSim.

3.4 Analysis and Discussion

These methods are similar in architecture but different in loss criterium. In CycleGAN, the authors adopt a cycled GAN structure with two generators and two discriminators. CycleGAN also utilizes cycle-consistency loss to ensure the input images after forward and backward mapping are as close to the original images as possible. However, because of the two GANs, it own a heavy structure and requires large memory. The CUT first utilized contrastive learning to image translation without introducing inverse mapping networks and additional discriminators. Li et al. have proved that maximizing mutual information is identical to CycleGAN [15]. By using contrastive loss, the structure of the network is largely lightened and simplified. F/LSeSim also uses contrastive loss, but it compares the spatially-correlative map instead of a feature in a certain layer. In this way, it can avoid the dependence between certain features and domain-specific attributes.

For simplicity, these points are summarized in the following table:

Method	CycleGAN	CUT	F/LSeSim
Dataset type	unpaired data	unpaired data	unpaired data
Output modal	single-modal	single-modal	single/multi-modal
Architecture	2 G + 2 D	1 G + 1 D	1 G + 1 D
Loss function	GAN + cycle-consistency	GAN + PatchNCE	GAN + self-similarity
Loss type	pixel-level	feature-level	spatially-correlative map
Contribution	apply cGAN and bijection on unpaired image translation	first to utilized InfoNCE for image translation abandon inverse mapping networks	avoid the dependence between certain features and domain-specific attributes faster, lighter, and more efficient
Disadvantage	heavy structure, requires large memory artefacts cannot decouple structure and appearance	still conflate domain-specific structure and appearance attributes artefacts	artefacts

Table 1. Comparison, strengths and weaknesses of mentioned methods.

4. Medical Applications and Experiments

4.1 DetCycleGAN: Endoscopic Image Synthesis

Mitral valve repair is a minimally invasive surgery that requires doctors correctly and effectively identify the suture. The simulated images from endoscopic are usually saliently different from that of the intra-operative tissue. To solve this problem, Sharan et al. proposed the Detection-integrated CycleGAN (DetCycleGAN) [16] that mutually improve the performance of image-to-image translation and suture detection. They first adopt the CycleGAN for image translation, where one pair of $\begin{array}{l}G_{sim2or}\end{array}$ and $\begin{array}{l}G_{or2sim}\end{array}$ form a cycle. Then, with the help of detection loss, they refine the quality of the generated images. Finally, compared to CycleGAN, they obtain a better suture detection performance (see Figure 10, the green circles for true positives, red for false positives and orange for false negatives).

Figure 9. The proposed DetCycleGAN architecture. The image translation part is shown with black lines, and the combination with detection networks is with red lines.

Figure 10. Qualitative compare between the result from CycleGAN and detCycleGAN.

4.2 ConPres: from Simulated to Realistic Ultrasound Images

Ultrasound (US) is a common medical imaging that supports the real-time and safe clinical diagnosis. It is particularly popular in gynecology and obstetrics. However, because of the low simulated image quality and complex operation, the doctors usually have to attend professional training. To make the US images more readable, Tomar et al. introduced the content-preserving image translation method (ConPres) [17] based on the CUT framework. It uses one pair of encoder and decoder to extract the positive and negative patch samples to compute the contrastive loss. They combined a semantic-consistent regularization to encourage the disentanglement of content and style. Figure 12 shows higher accuracy of ConPres, and the generated images have fewer artefacts than baselines.

Figure 11. The ConPres architecture (left) and the introduced loss function (right).

Figure 12. Qualitative comparison of translation results.

4.3 APS: MR-to-CT Image Synthesis with a New Hybrid Objective Function

In addition to translating the simulated images to the reals, another important application is synthesis image in other modals. Magnetic resonance (MR) and computed tomography (CT) images are increasingly jointly used for diagnosing [17], but the registration of MR and CT images can lead to alignment errors. Besides, the CT scan introduces harmful radiation. Inspired by LSeSim, Ang et al. use spatially-correlative maps to enhance structural consistency between the input MR images and the translated CT images [18]. Figure 13 illustrates the process of extracting spatially-correlative maps. Different to LSeSim, APS requires paired images because it also introduces pixel loss. Compared with unsupervised methods (see Table 2), the supervised methods have much better performance. The results in Figure 14 show less difference in head MR images translation (top), but the APS saliently outperforms the others in the neck images translation (middle and bottom).

Figure 13. Spatially-correlative map in MR-to-CT translation.

Table 2. Quantitative comparison of different methods.

Figure 14. Qualitative comparison of supervised translation methods.

5. Personal Review

Comparison

Translation tasks: These unpaired image-to-image translation methods employ the same setting, except for their different loss functions. They all use the adversarial loss to ensure the fidelity of output images. For the translation, CycleGAN introduces the cycle-consistency loss to maximize the pixel-level similarity of output and target domains, while CUT and F/LSeSim take the idea of contrastive learning, separately using InfoNCE loss in feature-level and self-similarity loss in the spatially-correlative map, to maximize the mutual information of input and output images.
Architectures: Compared to CycleGAN that use two pairs of generators and discriminators, CUT and F/LSeSim abandon the auxiliary GAN. In this way, they can serve as a lighter and more practical alternative.
Medical applications: DetCycleGAN, ConPres and APS all show more or fewer artefacts, which can be further decreased. Because of the limitation of datasets, the result of unpaired image translation (the former two) can hardly overperform that of paired translation (APS). However, since the paired images are difficult to obtain, the former two methods are more generalized and practical.

Strengths

Architecture: Unlike unstable image quality in traditional image-to-image translation tasks via CNN, these methods take advantage of GAN, where the discriminator can help to select sharp, realistic images. Besides, They adapt some public loss functions such as cycle-consistency loss and InfoNCE loss, avoiding the manual effort to design effective losses.
Completeness: the methods provide strict and detailed mathematical reasoning, such as the connection between CycleGAN and CUT. They all adapt ablation studies to prove the necessity of critical elements.
Safety: The I2I translation can overcome the limitation of medical equipment, generate images with higher quality or another modality, and avoid potential radiation or trauma for patients.

Weaknesses

Because of the privacy protection and a few samples, the dataset is hard to obtain. This can be solved by using simulated dataset as supervision.
Strictly speaking, the unpaired image translation is not "pure" unsupervised learning. The unpaired data does not only mean unlabeled data. Although the input and target domains are not a one-to-one correspondence, two image domains mean binary labels. In the future, if we can realize image-to-image translation from an unlabeled dataset, asymmetric datasets or a semantic condition is still a problem.

Reference

[1] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle consistent adversarial networks, In Proceedings of International Conference on Computer Vision, 2017.

[2] Zenglin Shi, Pascal Mettes, Guoyan Zheng, Cees Snoek. Frequency-Supervised MR-to-CT Image Synthesis, Deep Generative Models, and Data Augmentation, Labelling, and Imperfections, 2021.

[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.

[4] Mehdi Mirza, Simon Osindero. Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784, 2014.

[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros. Image-to-Image Translation with Conditional Adversarial Networks, Computer Vision and Pattern Recognition, 2017.

[6] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation, In Advances in Neural Information Processing Systems, 2017.

[7] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks, In Proceedings of International Conference on Machine Learning, 2017.

[8] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. DualGAN: Unsupervised dual learning for image-to-image translation, In Proceedings of International Conference on Computer Vision, 2017.

[9] Taesung Park, Alexei A. Efros, Richard Zhang, Jun-Yan Zhu. Contrastive Learning for Unpaired Image-to-Image Translation, arXiv preprint arXiv:2007.15651, 2020.

[10] Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai. The Spatially-Correlative Loss for Various Image Translation Tasks, arXiv preprint arXiv:2104.00854, 2021.

[11] Aaron van den Oord, Yazhe Li, Oriol Vinyals. Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748, 2018.

[12] Goodfellow et al. Generative Adversarial Networks, In Advances in Neural Information Processing Systems, 2014, (slide from McGuinness).

[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton. A simple framework for contrastive learning of visual representations, In Proceedings of International Conference on Machine Learning, 2020.

[14] Chunyuan Li, Hao Liu, Changyou Chen, Yunchen Pu, Liqun Chen, Ricardo Henao, Lawrence Carin: ALICE: Towards understanding adversarial learning for joint distribution matching, Advances in Neural Information Processing Systems, 2017.

[15] Lalith Sharan, Gabriele Romano, Sven Koehler, Halvar Kelm, Matthias Karck, Raffaele De Simone, Sandy Engelhardt. Mutually Improved Endoscopic Image Synthesis and Landmark Detection in Unpaired Image-to-Image Translation, in IEEE Journal of Biomedical and Health Informatics, 2022.

[16] Maria A. Schmidt and Geoffrey S. Payne, Radiotherapy planning using MRI, Physics in Medicine & Biology, 2015.

[17] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther. Autoencoding beyond pixels using a learned similarity metric, In Proceedings of International Conference on Machine Learning, 2016.

[18] Sui Paul Ang, Son Lam Phung, Matthew Field, Mark Matthias Schira. An Improved Deep Learning Framework for MR-to-CT Image Synthesis with a New Hybrid Objective Function, International Symposium on Biomedical Imaging, 2022.

Seitenhierarchie