Author: Unbekannter Benutzer (ge32qaj)

Introduction

Domain adaptation is the process of taking a model trained on one domain and adapting it to work on a different but related target domain. This is non-trivial because the model has to learn to account for the domain shift. For example, the source and target domains could be images taken during day and night, images of horses and images of zebras, or virtual images and real images. The latter is interesting because it allows the model to learn from a large amount of virtually generated training data, and then be applied to real-world scenarios. The advantage is that for domain adaptation, less or even no annotated data in the target domain is needed. This is why it is called unsupervised domain adaptation. In the following picture, an adaptation from the virtual domain into the real world is visualized.

DA for IS from virtual to real domain; source: [1]

In image segmentation, the task is to classify each input pixel of an image to a label. In the medical domain, the labels can be different organs or different parts of them. Image segmentation is an important subtask of many other image understanding tasks, and therefore high reliability is very much desired.

Organ segmented into different parts; source: [2]

Now that both parts – domain adaptation and image segmentation – are introduced, a look can be taken at why unsupervised domain adaptation in image segmentation is a relevant field in the medical domain.

Classical convolutional neural networks require a large amount of labeled training data in order to produce good results. However, in many domains – including and even more in the medical domain – the lack of labeled training data is a major challenge. It is time-consuming to label training data, and it requires costly experts. Therefore, it is desired that available training data can be leveraged as much as possible, not only in the domain that it belongs to. So, for example, a labeled training set of CT images can be used not only to train a CT segmentation network, but also to get a good segmentation network for MRI or UI images.

Although the topic is interesting for all domains of medical imaging and DA is basically possible and desired for all combinations of source and target domains, the existing papers are mostly oriented on MRI and CT images as an example. Afterwards, they then mention that their methods could be used for other areas as well. Especially for ultrasound imaging, it would have been interesting to find specific papers because this domain is particular challenging due to the challenging quality of the images, which are affected by attenuation, noise and acoustic shadowing. Unfortunately, none of the papers I found particularly addressed these issues of ultrasound imaging.

Related Work

General Approaches to the Problem

Three general approaches exist for unsupervised domain adaption for image segmentation. Most of the current state-of-the-art methods use ideas from multiple of these classes.

The first one is feature alignment. Here, the idea is to learn domain-invariant features of both domains and use them to train a segmentation network. The features can be learned using, for example, a shared encoder or adversarial loss. It is possible to do that at different places in the encoding network, so e.g. at lower level or at a higher level features. In adversarial loss training, an auxiliary discriminator is learned, which tries to distinguish between the feature vectors from source and target domains. The encoding network thus learns a representation that tries to fool the classifier, and therefore learns to produce domain invariant feature vectors.

The second approach is disentangled representation. This is similar to feature alignment, except that here images from each domain are embedded in two feature spaces. A shared domain invariant content space and a domain specific style space.

The third approach is image alignment. In contrast to feature alignment, which acts on the feature space, in image alignment the goal is to align the image space, also called pixel space. That means that images from the source domain, for which the ground truth is known, get translated to images in the style of the target domain. The result are target-like images with known ground-truth with which a target segmentation network can be trained. To achieve the translation from source to target-like images, a commonly used method is cycle consistency loss, introduced by [3]. In cycle-consistent adversarial networks the GAN loss helps in generating realistic images, while the cycle-consistency loss helps in having a tight relationship between both domains.

With the help of cycle consistency loss, a typical approach for domain adaptation is using cross-modality image translation for the labeled source images into the target domain and using the translated images to train a segmentation network in the target domain.
But this basic approach has several drawbacks. First, there is an ambiguity problem, since there may exist multiple cross-modality mappings that satisfy the cycle-consistency constraint. Therefore, the translated images do not necessarily show the same anatomical structures as the input images, thus the segmentation may not be the correct. The other problem it that the cycle-consistent regularization acts on the pixel space, thus equally on each pixel. This leads to the behavior, that it works good when the domains differ in low-level features like textures, but not when the domains differ in higher-level features like anatomical structures. Some of the later presented papers, that use cycle consistency loss have different ideas to improve the capabilities of the basic approach.

Nevertheless, one of the ground works in the field of unsupervised domain adaptation for image segmentation is CyCADA from 2017 [4], which uses cycle consistency loss to train its generator for target-like source images. It does image alignment and adds some adversarial losses to also do feature alignment.

Presented Methods

In this section, five recent papers in the medical field are presented. The focus on the medical field is intentional, as this blog post is part of “Deep Learning for Medical Applications”.

Unsupervised Bidirectional Cross-Modality Adaptation via Deeply Synergistic Image and Feature Alignment for Medical Image Segmentation (SIFA) – 2020

Schema of SIFA network; source: [6]

SIFA [6] is basically CycleGan with additional discriminators to emphasize on feature-level adaption. The source images (s) get translated into target-like images (t) and then fed into the (green) encoder. The encoder is shared between the (purple) segmentation network and the (orange) decoding part. In the (purple) segmentation subnet, the features are used to predict a segmentation. In the (orange) decoding part, the features are used to reconstruct source-like images. For the s→t→s images, the cycle consistency loss can then be applied, and for both, the s→t→s and the t→s images an additional adversarial loss can be used to encourage the (green) encoder to produce domain-invariant features. In the (purple) segmentation subnet, an adversarial loss is used to distinguish between target images and target-like images and for the later ones, the actual task loss is applied. An additional idea that SIFA uses, is to do the purple part not only for the original high level features of the encoder but, only for training purpose, also for features from a lower level.

Self-Attentive Spatial Adaptive Normalization For Cross Modality Domain Adaptation (SASAN) – 2021

The new idea in SASAN [7] is that they use multiple attention maps generated from the source image as input for the decoder part and for the segmentation network. They help to preserve high level geometrical relationship between the anatomical structures. By using an additional loss called attention regularization loss, the attention layers are encouraged to learn different attention maps, the loss is a simple measure of how orthogonal they are to each other. This should avoid redundancy of the attention maps.

The training of SASAN is again GAN based, but uses least squares GAN loss, in which the discriminators are PatchGAN based. The discriminator B in the shown schema therefore tries to differentiate between real and fake target images. Like in SIFA, the network uses cyclic consistency loss (top red arrow in the schema) to learn to generate fake target-like images from source images, but in the decoding part of the generators each layer gets an attention map as additional input. Only the attention maps plus one additional convolution layer per segmentation class are then used to generate the segmentation prediction. Using cross-entropy-loss and dice-loss, they are then compared to the ground truth.

Schema of SASAN network; source: [7]

Anatomy-Regularized Representation Learning for Cross-Modality Medical Image Segmentation (ARL-GAN) – 2021

As already mentioned, pixel-wise cycle-consistency adversarial models have the problem that during transformation from the source to the target domain, high-level semantic information is not necessarily preserved. The goal of ARL-GAN [8] is to explicitly preserve the anatomical structure information during synthesis. They use least squares adversarial loss (brown) for training the generators (orange) but do not use cycle consistency loss. Instead, they use anatomy regularized representation learning: The generators are encouraged to generate transformations that have equal encoding $\begin{array}{l}S_E()\end{array}$ in the shared latent space. To achieve that anatomical structures are preserved, the shared encoder $\begin{array}{l}S_E()\end{array}$ (green) is trained by the idea that for any two images the difference of the encoding should be persevered by their counterparts in the other domain. Related to the schema, this means that the loss that trains the shared encoder quantifies how much the following holds: $\begin{array}{l}S_E(x_i^R) - S_E(x_j^R) == S_E(\hat{x}_i^T) - S_E(\hat{x}_i^T)\end{array}$ where $\begin{array}{l}\hat{x}^T\end{array}$ is the target-like transformation of the source image $\begin{array}{l}x^R\end{array}$ . The encoder additionally gets a binary mask vector that encodes from which domain the input image is. The encoding of $\begin{array}{l}S_E()\end{array}$ is then used as the input for a segmentation network, that – like in the previous papers – is trained by leveraging the known ground-truth of the transformed source images.

Schema of ARL-GAN training; source: [8]

Adapt Everywhere: Unsupervised Adaptation of Point-Clouds and Entropy Minimization for Multi-modal Cardiac Image Segmentation – 2021

Adapt Everywhere [9] adds two novel ideas to improve the feature encoding part. The method adapts network features between source and target domain not only by adversarial feature alignment but also by adversarial alignment in two other spaces, namely the pixel-wise entropy and point-cloud spaces. These three spaces can be recognized as the three rectangular blocks in the middle of the following image in blue, purple and green. The red blocks are the feature encoding and segmentation prediction networks. Using the ground truth for the source domain images, the red parts learn to do the segmentation. All three now explained rectangle blocks help the shared encoder in red to learn domain invariant feature representations for source and target domain images.

The features (in yellow), that are the output of the encoding part of the generators (in red), are used as input to the point cloud generation network (in blue). This fully connected network generates 300 points. For the source images, of which the ground truth is known, the organ annotations are combined to produce the external heart surface and then using the marching cubes algorithm and farthest point sampling the ground-truth point cloud is calculated. The predicted point cloud is compared with the ground truth using $\begin{array}{l}\mathcal{L}_{EMD} =\end{array}$ Earth Mover’s Distance. Using an adversarial discriminator $\begin{array}{l}\mathcal D_3\end{array}$ , this knowledge about the point clouds is transferred to the target domain. Using the point clouds, the model should better learn to represent the shapes of the organs.
Using the encodings, the generators (in red) produce class-wise segmentation predictions $\begin{array}{l}S\end{array}$ , which are fed into a discriminator $\begin{array}{l}\mathcal D_1\end{array}$ , which tries to differentiate between predictions from the source and the target domain.
Additionally, from these class-wise segmentation predictions, the pixel-wise cross entropy is calculated and the resulting maps $\begin{array}{l}E\end{array}$ are again fed into a discriminator $\begin{array}{l}\mathcal D_2\end{array}$ . The hope is that this encourages the model to learn to be more sure at the borders of the segmented classes.

Schema of the “Adapt Everywhere” network; source: [9]

Diverse data augmentation for learning image segmentation with cross-modality annotations – 2021

This paper [10] is one that uses the idea of disentangled representation. As shown in the following picture, the source and the target images are both encoded into two feature spaces: the structural representation $\begin{array}{l}s\end{array}$ and the appearance representation $\begin{array}{l}a\end{array}$ . The former is related to the segmentation task and is shared between the domains, the latter is both not. The authors assume the learned distribution of the target appearance space to be gaussian. Thus, it is possible to randomly sample from that ( $\begin{array}{l}\tilde{a}_2\end{array}$ ). This way, multiple synthetic target-like images ( $\begin{array}{l}\hat{x}_{1|2}\end{array}$ , $\begin{array}{l}\hat{x}_{1|2}^*\end{array}$ ) can be generated by combining a single structural representation from a source domain image $\begin{array}{l}s_1\end{array}$ and multiple samples from the target appearance space $\begin{array}{l}\tilde{a}_2\end{array}$ , $\begin{array}{l}\tilde{a}_2^*\end{array}$ . These synthetic images that still have the known ground-truth are then used to train a segmentation network.

Schema of the idea behind “Diverse Data Augmentation”; source: [10]

Experiments

One of the often used datasets for unsupervised domain adaptation for image segmentation in the medical domain is the MM-WHS dataset [11] of whole heart CT and MRI scans. In the following image, an example from that dataset is shown. The segmentation classes are different parts of the heart that should be distinguished.

Example from MM-WHS dataset; source: [11]

To evaluate domain adaptation, basically just the segmentation strength of the learned model on the target domain is evaluated, like in the general segmentation task.

Two major metrics for that are Dice Similarity Coefficient (DSC) which is better to be higher and the Average Symmetric Surface Distance (ASSD) which is better to be lower.

$\begin{array}{l}\displaystyle \mathrm{ASSD}=\frac{1}{\left|B_{G}\right|+\left|B_{P}\right|}\left(\sum_{a \in B_{G}} \min _{b \in B_{P}}\|a-b\|+\sum_{b \in B_{P}} \min _{a \in B_{G}}\|b-a\|\right)\end{array}$

$\begin{array}{l}\displaystyle \mathrm{DSC}=\frac{2|P \cap G|}{|P|+|G|}\end{array}$

$\begin{array}{l}P\end{array}$ represent the prediction and $\begin{array}{l}G\end{array}$ the ground truth mask and $\begin{array}{l}P \cap G\end{array}$ is the intersection of them. $\begin{array}{l}B_G\end{array}$ and $\begin{array}{l}B_P\end{array}$ denote the boundary points of prediction and ground truth, and $\begin{array}{l}\lVert \cdot \rVert\end{array}$ is the Euclidean distance.

In the following table, the results that the different papers stated for the MM-WHS dataset, are summarized. It would be desirable to have more and better possibilities for comparisons of the different approaches. The focus on the MM-WHS cardiac dataset was due to the fact that this is the only dataset that was broadly used for evaluation in that specific research field.

Table of stated results by the respective papers; source: [4,6,7,8,9,10]

In the first row of the table the lower bound for the results is shown which uses the models trained on the source domain for the target domain without any adaptation. The second row in the table acts as an upper bound, because it is the result for the models trained directly on the target domain with ground-truth.

Student Review

The presented methods all reach good results on the MM-WHS test set, but it is not clear which one is the best. The numbers for SASAN are even not completely comparable because they used a slightly different test setting, which also results in higher values if used with the other methods. All in all, it is obvious that the newer methods were successful in countering the problems that pure cyclic consistency loss had, see the difference to CycleGAN or CyCADA. Keeping in mind that all these methods have the possibility to be further improved by using limitedly available target data, the capabilities of the new methods could be quite useful.

It is notable that Adapt Everywhere states a particularly good result for the mode in which it is used without adaptation, there should be further investigation in that.

The strength of the first four papers is that they all find different ways to improve the basic capability of simple cycle consistency loss or adversarial loss, which used to be the door opener for this area. All these different ideas seemed to work and could now be used also in other fields or other combinations. The core ideas are:

SIFA: additional discriminators to emphasize on feature-level adaption
SASAN: training attention maps and using them in the segmentation network and the generator for the domain translation
ARL-GAN: difference in the encoding of two images in one domain should be the same as their counterparts in the other domain
Adapt Everywhere: using two additional spaces for methods to give the model a better understanding of segmentation borders by using pixel-wise cross entropy and algorithmically calculated point clouds

In my opinion, the papers generally lack some major new ideas, relying mainly on various combinations and minor modifications of existing ideas. Also, since the comparison is not as straightforward as it would be desired, it is hard to decide which of the newly proposed methods really introduce particularly good and new ideas, and which just do some existing methods a little bit different and therefore score slightly higher on tests just by luck. It would be desirable to have larger datasets that are used more commonly. Similar to that critic, also more in depth ablation studies especially for ARL-GAN are missing. They would be needed to verify the usefulness of all the proposed methods.

Having this critic in mind, my personal ideas for future work would be the following:

build larger datasets
better create and agree on evaluation frameworks for comparison
research in using simulated data as source domain like usually done outside the medical field
- explore the potential of the idea behind Diverse Data Augmentation paper, which generates a kind of virtual training data
use transformers and leverage their special way of working
- I am unaware of any DA for IS method that relies on transformers (except of SASAN which just uses attention as a sub block without leveraging its special properties in my opinion)

Reference

[1]: Csurka, Gabriela, Riccardo Volpi and Boris Chidlovskii. “Unsupervised Domain Adaptation for Semantic Image Segmentation: a Comprehensive Survey.”

[2]: From: D. Tomar, M. Lortkipanidze, G. Vray, B. Bozorgtabar and J. -P. Thiran, "Self-Attentive Spatial Adaptive Normalization for Cross-Modality Domain Adaptation,"

[3]: Zhu, Jun-Yan, Taesung Park, Phillip Isola and Alexei A. Efros. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks.” 2017 IEEE International Conference on Computer Vision (ICCV) (2017): 2242-2251.

[4]: Hoffman, Judy, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros and Trevor Darrell. “CyCADA: Cycle-Consistent Adversarial Domain Adaptation.” ICML (2018).

[5]: Bermúdez-Chacón, Róger, Pablo Márquez-Neila, Mathieu Salzmann and P. Fua. “A domain-adaptive two-stream U-Net for electron microscopy image segmentation.” 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018) (2018): 400-404.

[6]: Chen, X., Lian, C., Wang, L., Deng, H., Kuang, T., Fung, S. H., Gateno, J., Shen, D., Xia, J. J., & Yap, P. T. (2021). Diverse data augmentation for learning image segmentation with cross-modality annotations. Medical image analysis, 71, 102060. https://doi.org/10.1016/j.media.2021.102060

[7]: Tomar, Devavrat, Manana Lortkipanidze, Guillaume Vray, Behzad Bozorgtabar and Jean-Philippe Thiran. “Self-Attentive Spatial Adaptive Normalization for Cross-Modality Domain Adaptation.” IEEE Transactions on Medical Imaging 40 (2021): 2926-2938

[8]: Chen, X., Lian, C., Wang, L., Deng, H., Kuang, T., Fung, S., Gateno, J., Yap, P. T., Xia, J. J., & Shen, D. (2021). Anatomy-Regularized Representation Learning for Cross-Modality Medical Image Segmentation. IEEE transactions on medical imaging, 40(1), 274–285. https://doi.org/10.1109/TMI.2020.3025133

[9]: Vesal, Sulaiman, Mingxuan Gu, Ronak Kosti, Andreas K. Maier and Nishant Ravikumar. “Adapt Everywhere: Unsupervised Adaptation of Point-Clouds and Entropy Minimization for Multi-Modal Cardiac Image Segmentation.” IEEE Transactions on Medical Imaging 40 (2021): 1838-1851.

[10]: Chen, X., Lian, C., Wang, L., Deng, H., Kuang, T., Fung, S. H., Gateno, J., Shen, D., Xia, J. J., & Yap, P. T. (2021). Diverse data augmentation for learning image segmentation with cross-modality annotations. Medical image analysis, 71, 102060. https://doi.org/10.1016/j.media.2021.102060

[11]: Xiahai Zhuang and Juan Shen: Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI Xiahai Zhuang: Challenges and Methodologies of Fully Automatic Whole Heart Segmentation: A Review

Seitenhierarchie

Unsupervised Domain Adaptation for Segmentation