This blogpost summarises the paper "Anatomy-Constrained Contrastive Learning for Synthetic Segmentation Without Ground-Truth" [1]

1. Motivation

Image segmentation plays a vital role in many medical imaging applications by supporting or automating the tracing of anatomical structures and regions of interest in medical images. Over the recent years, more and more deep-learning-based segmentation frameworks have been proposed and adopted to improve segmentation performance.
The success of deep-learning-based approaches often depends on the availability of large datasets to train robust networks. Thus, holding a massive challenge for medical use cases in general, but especially for segmentation tasks, as acquiring ground truth data through manual segmentation by radiologists is very expensive.
Throughout diagnosis and treatment, it is common to obtain several image modalities of the same subject. Unfortunately, those images are usually not registered multi-modal images, so they are unpaired images, and one would need to manually delineate each modality's region of interest, which is unfeasible.
The following work proposes to alleviate these challenges by translating unpaired images from a source imaging modality with a lot of training data available to the specialized target image modality with a focus on preserving anatomical information without the need for pairing images. Thus, we can train a robust segmentation network on our synthetic training data without the need for ground truth data in our target domain.

Figure 1. The proposed architecture consisting of a domain translating GAN with a anatomy-constraint and contrastive learning setting and a segmenter. Both are trained in an end-to-end setting with 5 losses. (adapted from [1])

2. Methods

The general challenge of previous work appeared to be the training of a robust domain adaptation network that does not lose the relevant information for segmentation like anatomical structures and contrasts between organs.

Motivated by this challenge, this paper combined anatomical losses to persevere anatomical structures during domain translation with a patch contrastive loss that strengthens the association between an organ in the source image and its equivalent in the synthesized output image while disassociating the irrelevant organs.

The network architecture AccSeg-Net proposed in this paper, seen in Figure 1, consists of a generator (ResNet) that adapts images to the target domain and a discriminator (CNN) that tries to distinguish real images of the target domain from synthetic images generated by the generator.

The conventional GAN training procedure was improved by the following modifications to preserve the anatomical consistency:

2.1) anatomy-constrained loss 2.2) patch-contrastive loss 2.3) identity loss 2.4) Dice loss for segmentation that will be briefly explained.

2.1 Anatomy-constrained loss

The image adaptation process should only change the image's appearance to match the target domain while preserving anatomical features. Therefore an anatomical constrained loss function with two equally weighted parts is used. The Pearson correlation coefficient (CC) is used as the first term, which is a translation and scale invariant measure that fits the needs of maintaining anatomical structures very well since it does not focus on precise values but shapes. As the second term, the image registration algorithm Modality Independent Neighbourhood Descriptor (MIND) [2] is utilized to calculate an L1 loss between the source and target image. MIND captures the anatomical structures in the region around each pixel, which should be independent of the imaging modality as seen in Figure 2. MIND and CC combined emphasize the generator to preserve the anatomical structure, which is crucial for good segmentation performance.

where is a distance vector of patches around voxel x and is the local variance at x.

Figure 2. MIND calculated in CT and MRI highlighting three different types of image features (blue, yellow, green). The corresponding descriptors (Boxes with borders) are modality independent and can be used for loss calculation. (adapted from [2])

2.2 Patch contrastive loss

As additional supervision for the synthetic image generation, a patch contrastive loss is applied. The idea is to associate organs from the source domain to the equivalent organ in the target domain while dissociating other organs. For example, as seen in Figure 2, the kidney marked as green in our target domain should be stronger associated with the kidney marked purple in our source domain than the negative samples in orange.

For a patch in our target domain, we find the equivalent patch in our source image as a positive sample and select other patches as negative ones. Based on this, a classification task using softmax cross-entropy (CE) is created to express the probability of selecting the equivalent patch. Finally, this is applied as further supervision of our generator to improve the association between input and output patches.

For the implementation of this concept, the encoder of the generator is utilized since its feature stack already represents patches of the input image, where deeper layers represent larger patches. Thus, each feature stack is passed through a fully-connected layer to project the patch to a latent space [3] where the CE loss can be applied.

where is the input associated feature, the input disassociated feature and L the chosen layers of the Encoder. A spatial location of layer l is represented by, while is an empirically chosen hyper-parameter.

Figure 3. Visualises the patch-contrastive loss. The patches that should be associated (Green, Purple) and the negative patches (Orange) are passed through the encoder of G followed by a fully connected layer to obtain patch feature-vectors used by the loss. (adapted from [1])

2.3 Identity loss

As demonstrated by previous work on unpaired translation tasks [4,5], the authors added an identity loss for additional regularization to motivate the generator to be the identity mapping for input images that are from the target domain.

2.4 Segmentation loss

The segmentation network uses the region-based dice loss originating from the Dice coefficient [6], which is widely used in computer vision. The authors stated that a more advanced one [6] could replace this loss function to improve segmentation performance. However, it is essential to note that this loss also gets backpropagated into the domain adaptation part as additional supervision to the whole process.

Also, it is essential to mention that the proposed architecture is independent of the used segmentation network. The authors performed successful ablation studies, replacing the segmentation network with R2UNet [7], Attention-UNet [8], and UNet [9].

2.5 Adversarial loss

The discriminator that tries to distinguish between real images Ib and generated Images Ib^ gets supervised by the usual adversarial loss from GANs [10], where the discriminator tries to maximize the function while the generator tries to minimize it.

3. Experiments

Liver segmentation was chosen as the task for all experiments. As many CT liver segmentation images were available, the authors chose CT as the source domain and CBCT, MRI, and PET as target domains. All images were sampled to an isotropic spatial resolution of 1mm^3 and resized to 256x256 2d images. In total, 13,241 CT images were used for the source domain and 3,792 CBCT, 1,128 MR, and 6,150 PET images separately as the target domain. For the experiment evaluation, four-fold cross-validation was performed, and the Dice similarity coefficient (DSC) and Average Symmetric Surface Distance (ASD) were used for the quantitative performance evaluation. Since the focus was not on the actual segmentation, the authors used the same segmentation network [11] for all compared methods to ensure comparability. The proposed framework surpassed the previous work on synthetic segmentation for CBCT and MRI as seen in Figure 3 and showed a statistically significant improvement for CBCT. It also performed slightly better than the supervised network trained on limited data. The main reason for the weak performance of the TD-GAN [12] is that it translates the CBCT image to CT (the other direction as the proposed network) and has to remove all metal artifacts, which is challenging. Also, TD-GAN and SynSeg [13] are not using structural constraints during the domain adaptation leading to non-ideal segmentations. The authors further showed in experiments that using both anatomical constraining losses yields the best segmentation performance by studying different anatomy-constraint settings. For PET, the experiments showed reasonable segmentation results, although the authors did not mention any concrete numbers.

Figure. 4 Segmentation performance comparison with previous and supervised work, where the liver segmentation is marked red and the ground truth green. One can especially see that TD-GAN struggles to remove the metal artifacts during CBCT to CT adaptation. (adapted from [1])

Table 1. Quantitative comparison of CBCT and MRI segmentation results. SegModality* represents the segmenter trained in a supervised fashion. The -PCT is the proposed architecture without patch-contrastive loss, as one can see using the patch-contrastive loss improves the results a lot. (adapted from [1])

4. Conclusion

The proposed synthetic segmentation framework AccSeg-Net demonstrated superior segmentation performance over previous work while significantly reducing network complexity compared to SynSeg. In addition, they showed that the proposed framework could be successfully applied on CBCT, MRI, and PET and that anatomy-constraint and patch contrastive learning improved the segmentation performance by preserving the correct anatomy during the unpaired image adaptation from the utilized source image modality.

5. Students Review

The paper is well written, and the motivation and impact of synthetic segmentation and the concrete approach are well described. The authors showed nicely that the concepts of anatomy-constraint and contrastive learning are beneficial for synthetic segmentation. They did not try to boost their experiment results by detailed optimizations but focused on the core concept they wanted to demonstrate, which had the effect that the performance gain in contrast to previous work could be attributed to the applied anatomy-constraint and contrastive learning.

I like that the approach is independent of the used segmentation network and that the segmenter gets extracted after training for computational efficiency during inference. The authors also used the same segmentation framework for all compared networks to be fair and scientifically correct. However, it would have been nice to see experiments on a different source modality besides CT.

To improve the proposed work, one could further increase the anatomical correctness of the image adaptation by extending the network to 3D to capture more anatomy-related spatial information. Also, the MIND part of the anatomy-constraint loss could be replaced by the more advanced Gradient MIND [14] with improved edge registration that could be beneficial for organ recognition. Additionally, as suggested by the authors, more advanced segmentation networks and segmentation losses could be applied to improve the segmentation performance.

6. Implementation Guidance

The authors have published their code at https://github.com/bbbbbbzhou/AccSeg-Net.

Generator: A Decoder-Encoder network with nine residual bottlenecks

Discriminator: 3-layer CNN

Segmenter: DuSEUNet, a 5-level U-Net with concurrent SE (squeeze & excitation) module [11].

7. References

Zhou, Bo & Liu, Chi & Duncan, James. (2021). Anatomy-Constrained Contrastive Learning for Synthetic Segmentation Without Ground-Truth. 10.1007/978-3-030-87193-2_5.
Mattias P. Heinrich, Mark Jenkinson, Manav Bhushan, Tahreema Matin, Fergus V. Gleeson, Sir Michael Brady, Julia A. Schnabel, MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration, Medical Image Analysis, Volume 16, Issue 7,2012, Pages 1423-1435, ISSN 1361-8415.
Park, Taesung & Efros, Alexei & Zhang, Richard & Zhu, Jun-Yan. (2020). Contrastive Learning for Unpaired Image-to-Image Translation. 10.1007/978-3-030-58545-7_19.
Taigman, Yaniv & Polyak, Adam & Wolf, Lior. (2016). Unsupervised Cross-Domain Image Generation.
Zhu, Jun-Yan & Park, Taesung & Isola, Phillip & Efros, Alexei. (2017). Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. 2242-2251. 10.1109/ICCV.2017.244.
Jadon, Shruti. (2020). A survey of loss functions for semantic segmentation.
Alom, M.Z., Yakopcic, C., Hasan, M., Taha, T.M., Asari, V.K.: Recurrent residual u-net for medical image segmentation. Journal of Medical Imaging 6(1) (2019) 014006
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer (2015) 234–241
Goodfellow, Ian & Pouget-Abadie, Jean & Mirza, Mehdi & Xu, Bing & Warde-Farley, David & Ozair, Sherjil & Courville, Aaron & Bengio, Y.. (2014). Generative Adversarial Networks. Advances in Neural Information Processing Systems. 3. 10.1145/3422622.
Guha Roy, Abhijit & Navab, Nassir & Wachinger, Christian. (2018). Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks.
Zhang, Yue & Miao, Shun & Mansi, Tommaso & Liao, Rui. (2018). Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation.
Huo, Yuankai & Xu, Zhoubing & Moon, Hyeonsoo & Bao, Shunxing & Assad, Albert & Moyo, Tamara & Savona, Michael & Abramson, Richard & Landman, Bennett. (2018). SynSeg-Net: Synthetic Segmentation Without Target Modality Ground Truth. IEEE Transactions on Medical Imaging. PP. 1-1. 10.1109/TMI.2018.2876633.
Rott, Tamar & Shriki, Dorin & Bendory, Tamir. (2014). Edge Preserving Multi-Modal Registration Based On Gradient Intensity Self-Similarity. 2014 IEEE 28th Convention of Electrical and Electronics Engineers in Israel, IEEEI 2014. 10.1109/EEEI.2014.7005886.

Seitenhierarchie

Anatomy-Constrained Contrastive Learning for Synthetic Segmentation Without Ground-Truth