Unsupervised domain adaptation for medical imaging segmentation with self-ensembling

This is the blog post for the paper 'Unsupervised domain adaptation for medical imaging segmentation with self-ensembling.'

Introduction

Data Distribution Shift Problem

In the figure above on the left, we can see MRI images of the spinal cord coming from four different centers (UCL, Montreal, Zurich, Vanderbilt) from the Spinal Cord Grey Matter Segmentation Challenge [1]. On the right, their normalized intensity distributions are shown. There, we can see a common problem of data from different sources - a shift in the data distribution coming in this case from various hardware and software settings of the MRI machines. This shift leads to a significant performance drop when a deep learning model that is trained with images from one domain is deployed on images from a different domain. To solve this problem, domain adaptation is needed.

Unsupervised Domain Adaptation

Domain adaptation means that a deep learning model is trained with data that have a shift in their distribution. In this paper, unsupervised domain adaptation is used, meaning that there are two datasets. One is coming from the source domain where labeled examples are provided, and the other is coming from the target domain where only unlabeled data is available. The goal is to leverage knowledge from the target domain through the unlabeled data to increase the model's performance on new data from the target domain.

Main Contributions

The main contributions presented in this paper are the following. First, the authors extend the task of unsupervised domain adaptation to semantic segmentation. Second, they evaluate the performance of their method using a realistically small MRI dataset, namely the above mentioned Spinal Cord Grey Matter Segmentation Challenge dataset [1]. Third, they perform an ablation experiment to show that it is indeed the unlabeled data that is responsible for the performance improvement and not some other part of their method. And fourth, they visually analyze how domain adaptation affects the prediction space.

Related Work

The authors' work builds upon several works from other authors. One crucial part is the U-net architecture for image segmentation [2], as this works also deals with the task of segmentation. However, the authors emphasize that their proposed method is independent of the chosen architecture. The other essential contributions come from the field of deep domain adaptation that the authors separate in four different areas. First, Generative Adversarial Networks (GANs) can be used to build domain-invariant features spaces [3]. Second, other methods optimize higher-order statistics to change the parameters of neural network layers [4]. Third, methods that explicitly minimize the discrepancy between source and target domains [5]. And fourth, so-called self-ensembling methods based on the Mean Teacher network [6]. It is the last of these methods that the authors build upon by adapting the method to the task of image segmentation.

In the following part, the original Mean Teacher method is explained first before the proposed changes are covered.

Methodology

Mean Teacher

The original mean teacher framework for domain adaptation [6] works as follows. There are two architectures, the student and the teacher model, whereas one is a copy of the other. At each training step, the same minibatch is used as input to both models, but noise is added to each input separately so that the models see slightly different versions of the data. Besides the normal classification cost that is calculated between the outputs of the student model and the ground truth, an additional consistency cost between the student and teacher outputs is added to ensure that both models have the same output. Whereas the optimizer updates the student weights normally, the teacher weights are calculated as the exponential moving average (EMA) of the student weights. In that way, a slightly improved model compared to the model without the EMA is generated at each step. The best results are achieved when the student model is close to convergence as the teacher benefits from having a larger memory of the student's past weights. The training step with unlabeled examples is similar, except that no classification cost is applied.

Mean Teacher for Segmentation

The authors' main contribution is adapting the above-described method to the task of semantic segmentation, which needs the following steps. One, as the desired output of the network is a segmentation mask, the classification cost is replaced with the dice loss. Two, instead of adding random noise to the inputs, random spatial transformations are applied before feeding the data into the model. However, these transformations are applied to the inputs of the student model and the outputs of the teacher model. Otherwise, you would have the problem that the outputs of the student and teacher model cannot be compared as different spatial transformations were applied to the inputs of those models.

Experimental Setup

Dataset

For this work, the authors use the Spinal Cord Gray Matter Challenge dataset [1], which is a multi-center, multi-vendor, and publicly-available MRI data collection of 80 healthy subjects with 20 subjects from each center. Three different MRI systems (Philips Achieva, Siemens Trio, Siemens Skyra) created the images making the dataset ideally suited for the problem that the authors want to tackle. As described in the introduction, there is a distribution shift in the data from different centers.

Baseline & Training Setup

For their baseline model, the authors create a model that is based on the U-Net architecture [2] but uses Group instead of Batch norm and it is trained in standard supervised fashion with no additional unlabeled data. To make a fair comparison between their proposed method and the baseline model possible, they use the same hyperparameters for both models.

Their proposed models are trained with data from centers 1 and 2 in a supervised fashion as the baseline model and are then adapted to unlabeled data from centers 3 and 4 separately with the Mean Teacher method described above.

The evaluation of their method is covered next in the results section.

Results & Discussion

Domain Adaption

The authors show the success of their method by answering three different questions regarding the domain adaption capabilities of their proposed method.

How does additional unsupervised data from domains different than the source domain influence generalization?

In this case, we would like to asses how additional unsupervised data from different domains than the training domains (centers 3 and 4) improve generalization on the training domain (centers 1 and 2). The above figure shows the Dice score and mean Intersection over Union (mIoU) for the evaluation of the model on centers 1 and 2, respectively. It is clearly visible that adaptation of the model either to data from centers 3 or 4 improves both metrics meaning that the unlabeled data is leveraged.

How does the network change its predictions to the new domain after performing domain adaptation?

In this case, the model's performance is assessed when it is evaluated on the same domain that it is adapted to. So for the model that is adapted to center 3, it is evaluated on new data from center 3, and the same for the model adapted to center 4. In the above figure, we can observe that the metrics where both evaluation and adaptation centers are the same present the highest values showing that domain adaptation is working properly.

How well does an adapted network generalize when presented with images that were not used during training?

In the last case, we look at the model's metrics when it is evaluated on an entirely new domain. So for the model that is adapted to center 3, evaluation takes place on center 4, and the other way round. Here, a gain over the baseline method can be observed in both methods meaning that domain adaptation improves the generalization for unseen centers.

Ablation Study

In the last part of the discussion, the authors show that it is not the introduction of the exponential moving average (EMA) that is responsible for the above-described improvements, but it is the use of unlabeled data. In an ablation experiment, they train a model that uses only the EMA. The results are depicted in the following figure.

The evaluation results of the EMA model are very similar to the baseline model meaning that the Mean Teacher leveraging unlabeled data introduces the observed improvements.

Conclusion

In this work, we have seen that unsupervised domain adaptation is an effective way to increase the performance of deep learning models across multiple centers. The Mean Teacher method improves generalization on unseen domains through the leverage of unlabeled data from various areas, and the improvements come through the introduction of unlabeled data and not only through the EMA.

Limitations & Future Work

The authors mention several limitations to their work. The most important ones are that they do not compare their method for domain adaptation to adversarial training methods, they evaluate their method only on a single task, and that in this evaluation, there is only a limited amount of centers present.

Future work includes testing their methods on more datasets beyond the medical domain. Also, the importance of proper multidomain evaluation in studies and medical imaging challenges needs to be reassessed as only rarely a test set from a different domain is provided.

Own View

Overall, I really enjoyed reading the paper, as it is understandable and well written. In my opinion, the authors did a great job of ensuring a fair comparison between the baseline method and their proposed methods. They also repeated their experiments over ten runs and open-sourced their code to ensure reproducibility.

What could be improved from my perspective is that the original Mean Teacher method could be explained in more detailed as it is the basis for their work. Besides that, a more in-depth analysis of their method on different datasets, for example, would be helpful to understand the impact of their approach further.

References

[1] Prados F, Ashburner J, Blaiotta C, Brosch T, Carballido-Gamio J, Cardoso MJ. Spinal cord grey matter segmentation challenge, Neuroimage, 2017

[2] Ronneberger O, Fischer P, Brox T. U-net: Convolutional Networks for Biomedical Image Segmentation, arXiv preprint, 2015

[3] Hoffman J, Tzeng E, Park T, Zhu JY, Isola, P, Saenko, K. Cycada: Cycle-Consistent Adversarial Domain Adaptation, arXiv preprint, 2017

[4] Li Y, Wang N, Shi J, Liu J, Hou X. Revisiting Batch Normalization for Practical Domain Adaptation, arXiv preprint, 2016

[5] Tzeng E, Ho_man J, Zhang N, Saenko K, Darrell T. Deep Domain Confusion: Maximizing for Domain Invariance, arXiv preprint, 2014

[6] Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results, Advances in Neural Information Processing Systems, 2017

Seitenhierarchie

Table of Contents