Semi-supervised Medical Image Segmentation through Dual-task Consistency

This blog post summarizes and reviews the paper "Semi-supervised Medical Image Segmentation through Dual-task Consistency” by Xiangde Luo, Jieneng Chen, Tao Song, Guotai Wang [1].

1. Introduction

1.1 Motivation

Automatically and accurately labeling organs or lesions in medical images can greatly reduce the doctors' workload during clinical diagnosis and treatment [2]. However, annotating medical images can be time-consuming and labor-intensive. Recently, many semi-supervised medical image segmentation methods have achieved impressive results.

This paper proposes an innovative semi-supervised dual-task consistency model for medical image segmentation, which jointly predicts a global-level level set representation and a pixel-wise segmentation. According to the experimental results on two commonly used medical image datasets, the proposed method shows remarkable improvement of segmentation accuracy, compared with other semi-supervised segmentation methods.

1.2 Related Work

From a rough broad overview, the consistency-based semi-supervised learning strategies can be divided into two types: data-level consistency and model-level consistency. [3] enforces the unsupervised prediction consistency between the original data and perturbed data. In [5], the author introduces the temporal ensembling and uses the consistent predictions of unlabeled data as the target/ground truth, i.e. using the output of the model under different regularization and input data augmentation conditions. Later, [13] proposed Mean Teacher model and build the consistency loss which measures the distance between the prediction of student and teacher model to alternatively improve each other. In [4], based on Mean Teacher model, they also estimated the uncertainty of target predictions and calculated the consistency loss between student and teacher model.

2. Methology

There are two important observations that should be noted. First of all, When we tackle the segmentation task from different level, the information from separate branches can complement each other. Meanwhile, there are inherent perturbations in the results of different tasks. Secondly, when the network takes unlabeled data as input and minimizes the consistency constraint loss between different tasks, it can learn from the unlabeled data and booster the performance of supervised learning.

The proposed semi-supervised learning framework contains three main parts: a dual-task network for segmentation and regression, a task-transform layer and a loss function and a dual-task consistency constraint, which is demonstrated in Figure 1.

Figure 1: The structure of the proposed dual-task consistency framework [1].

2.1 Dual-task Consistency

Inspired by [6], the author builds a task-level consistency, which is one important contribution of this paper. They spilt the segmentation task into two branches: a pixel-wise segmentation branch and a global-level level set regression branch. By enforcing that the predictions of these two task branches should be consistent for the same segmentation mask, we can build a dual-task consistency loss to learn from the unlabeled data. The network takes 3D grayscale medical images as input, applies VNet [7] as backbone and predicts two tasks simultaneously.

In order to map the outputs of two different tasks into a common space, the author builds a transform layer between dual-tasks, which is the second novel point of this paper is building.

The level set function is defined as:

$\begin{array}{l}\mathcal{T}(x) = \begin{cases} - \underset{y\in \partial S}{inf} ||x - y||_{2}, x\in\mathcal{S}_{in} \\ 0, x \in\partial\mathcal{S} \\ + \underset{y\in \partial S}{inf} ||x - y||_{2}, x\in\mathcal{S}_{out} \end{cases}\right.\end{array}$ (1)

where $\begin{array}{l}\mathcal{x}\end{array}$ , $\begin{array}{l}\mathcal{y}\end{array}$ are two different pixels/voxels in a segmentation mask, the $\begin{array}{l}\partial S\end{array}$ is the zero level set and represents the contour of the target object, $\begin{array}{l}\mathcal{S}_{in}\end{array}$ and $\begin{array}{l}\mathcal{S}_{out}\end{array}$ denote the inside region and outside region of the target object[1].

It is noteworthy to point out that $\begin{array}{l}\mathcal{T}(x)\end{array}$ is employed to convert the pixel-wise segmentation ground truth to the level set function ground truth, i.e. from task1 to task2.

Furthermore, we can use the inverse of $\begin{array}{l}\mathcal{T}(x)\end{array}$ to transform the output of level-set function to pixel-wise map. However, due to the non-differentiability, we need to construct a smooth approximation to implement conversion, which is defined as:

$\begin{array}{l}\mathcal{T}^{(-1)} (z) = \frac{1}{1+e^{-k\cdot z}}\end{array}$ (2)

where $\begin{array}{l}\mathcal{z}\end{array}$ denotes the level set value at pixel/voxel $\begin{array}{l}\mathcal{x}\end{array}$ [1]. Since $\begin{array}{l}\mathcal{T}^{(-1)}\end{array}$ has the same simple form as the sigmoid function, $\begin{array}{l}\mathcal{T}^{(-1)}\end{array}$ is applied as an activation function in the last layer of regression task, mapping the output into the same predefined space as that of the segmentation task.

With the aforementioned functions, we can build our dual-task consistency loss $\begin{array}{l}\mathcal{L}_{DTC}\end{array}$ for labeled and unlabeled data:

$\begin{array}{l}\mathcal{L}_{DTC}(x)=\underset{x_{i} \in \mathcal{D}}{\sum} ||f_{1}(x_{i}) - \mathcal{T}^{-1} (f_{2}(x_{i})) ||^{2} = \underset{x_{i} \in \mathcal{D}}{\sum} ||f_{1}(x_{i}) - \sigma (k \cdot f_{2}(x_{i})) ||^{2}\end{array}$ (4)

where $\begin{array}{l}\mathcal{f}_{1}(\mathbf{x}_{i})\end{array}$ represents the predicted result of task1 and $\begin{array}{l}\mathcal{T}^{-1} (f_{2}(x_{i}))\end{array}$ denotes the transformed prediction of task2.

2.2 Semi-supervised training through Dual-Task-Consistency

The dataset $\begin{array}{l}\mathcal{D}\end{array}$ is composed of labeled data $\begin{array}{l}\mathcal{D}_{l}\end{array}$ and unlabeled data $\begin{array}{l}\mathcal{D}_{u}\end{array}$ , the labeled data pair is denoted as $\begin{array}{l}\mathbf{(X, Y)} \in \mathcal{D}_{l}\end{array}$ ， $\begin{array}{l}\mathbf{Y}\end{array}$ denotes the ground truth of segmentation mask and the unlabeled data is denoted as $\begin{array}{l}\mathbf{X} \in \mathcal{D}_{u}\end{array}$ . In the solely supervised learning process for the labeled data $\begin{array}{l}\mathcal{D}_{l}\end{array}$ , the $\begin{array}{l}\mathcal{L}_{sup}\end{array}$ consists of two parts, namely the loss of pixel-wise segmentation task $\begin{array}{l}\mathcal{L}_{Seg}\end{array}$ and the loss of level set function regression task $\begin{array}{l}\mathcal{L}_{LSF}\end{array}$ :

$\begin{array}{l}\mathcal{L}_{Seg}(\mathbf{x}, \mathbf{y}) = \underset{\mathbf{x}_{i}, \mathbf{y}_{i} \in \mathcal{D}_{l}}{\sum}\mathcal{L}_{Dice} (\mathbf{x}_{i}, \mathbf{y}_{i}) = \underset{\mathbf{x}_{i}, \mathbf{y}_{i} \in \mathcal{D}_{l}}{\sum} (1 - \frac{2 \sum _{{x_{j}}\in \mathbf{x}_{i}, y_{j}\in\mathbf{y}_{i} } f_{1}(x_{i})y_{i}} {\sum_{{x_{j}}\in \mathbf{x}_{i}, y_{j}\in\mathbf{y}_{i}}f_{1}(x_{j})+ \sum_{y_{j} \in \mathbf{y}_{i}}y_{j} })\end{array}$ (5)

$\begin{array}{l}\mathcal{L}_{LSF}(\mathbf{x,y})=\underset{\mathbf{x}_{i},\mathbf{y}_{i} \in \mathcal{D}_{l}}{\sum} ||f_{2}(\mathbf{x}_{i}) - \mathcal{T} (\mathbf{y}_{i}) ||^{2}\end{array}$ (6)

where $\begin{array}{l}(x,y)\end{array}$ is voxel-level pair, the summation $\begin{array}{l}\sum_{{x_{j}}\in \mathbf{x}_{i}, y_{j}\in\mathbf{y}_{i}}\end{array}$ represents voxel-wise summation in a 3D image, the $\begin{array}{l}\sum_{\mathbf{x}_{i}, \mathbf{y}_{i} \in \mathcal{D}_{l}}\end{array}$ is image-level summation in labeled dataset[1].

Therefore, the semi-supervised training loss is defined as:

$\begin{array}{l}\mathcal{L}_{total} = \mathcal{L}_{Seg} + \mathcal{L}_{LSF} + \mathcal{\lambda}_{d} \mathcal{L}_{DTC}\end{array}$ (7)

where only labeled data is engaged in $\begin{array}{l}\mathcal{L}_{Seg}\end{array}$ and $\begin{array}{l}\mathcal{L}_{LSF}\end{array}$ , and $\begin{array}{l}\mathcal{L}_{DTC}\end{array}$ is computed for both labeled and unlabeled data. $\begin{array}{l}\mathcal{\lambda}_{d}\end{array}$ is a function of current training step $\begin{array}{l}\mathcal{t}\end{array}$ , where $\begin{array}{l}\mathcal{\lambda}_{d}(t) = \mathcal{e}^{(-5(1-\frac{t}{t_{max}})^{2})}\end{array}$ , $\begin{array}{l}\mathcal{t}_{max}\end{array}$ is the maximum training step.

In the training progress, dual-task model's parameters are back propagated by the gradient of $\begin{array}{l}\mathcal{L}_{total}\end{array}$ w.r.t $\begin{array}{l}\mathcal{\theta}\end{array}$ (shared parameter), $\begin{array}{l}\mathcal{\theta}_{1}\end{array}$ (segmentation parameter) and $\begin{array}{l}\mathcal{\theta}_{2}\end{array}$ (LSF parameter).

3. Experiments and Results

3.1 Datasets and Pre-processing

The proposed algorithm is implemented on two widely used datasets: the left atrial dataset [8], which has 100 3D gadolinium-enhanced MR images and the pancreas dataset [9], consisting of 82 abdominal CT images. Besides, the images are cropped around the ground truth with enlarged margins [1]. The training dataset is composed of 80% unlabeled images and 20% labeled images.

3.2 The Effects of Different tasks in Fully Supervised Way

To analyse the effects of the different tasks, they trained the network under four kinds of strategies, using 12 labeled images and 62 labeled images, respectively.The performance is evaluated according to Dice, Jaccard, the average surface distance (ASD) and the 95% Hausdorff Distance(95HD). Results are revealed in Table 1. Compared to only applying task 1(Seg), simultaneously training task 1(Seg) and task2(LSF) can improve the segmentation accuracy. Moreover, the proposed dual-task consistency framework (Seg + LSF + DTC) outperforms other strategies on 12 and 62 labeled scans, which is revealed in Figure 2.

Table 1: Evaluation results of four training strategies on 12 and 62 labeled images.

Figure 2: 3D Visualization of different training strategies under 12 labeled pancreas images. GT: ground truth.

3.3 Comparison with Other Methods in Semi-supervised Way

The author also compared the proposed framework with six state-of-art semi-supervised segmentation methods by means of either reimplementing the methods or using the offical code: cross-consistency training method(CCT) [10], deep adversarial network(DAN) [11], entropy minimization approach(Entropy-mini) [12], mean teacher self-ensembling model(MT) [13], uncertainty-aware mean teacher model(UA_MT) [14] and shape-aware adversarial network(SASSNet) [15] . Table 2 and 3 show the quantitative comparison between the proposed method and other methods. Figure 3 showed some 3D visualized segmentation results on pancreas and left atrium of other methods. According to the comparison results, the proposed method demonstrates better performance on all evaluation metrics and requires less training time and computational cost.

Table 2: The comparison on the pancreas CT dataset.

Table 3: The comparison on the left atrium MRI dataset.

Figure 3: 3D visualization of different semi-supervised method. The first and second row are pancreas and left atrium segmentation result, respectively.

3.4 The Data Utilization Efficiency

Figure 4 reveals the data utilization efficiency. When the training data is composed of the same proportion of labeled images, the proposed semi-supervised dual-task VNet performs better than fully-supervised VNet (green line) and VNet (red line), which shows that semi-supervised learning can actually help and booster fully-supervised learning. Moreover, the proposed semi-supervised structure with different labeled data ratio consistently gets the best Dice score, which illustrates that the dual-task framework outperforms the other two frameworks. In addition, the difference of Dice Score between different approaches decreases with the increase in the proportion of labeled images.

Figure 4: The segmentation performance of semi-supervised approach with different radio of labeled CT scans.

4. Conclusion

Results illustrated in the paper shows that:

In fully supervised way, the proposed dual-task consistency method outperforms separate and joint supervised strategies.
This method achieves state-of-art performance and shows superiority compared to other semi-supervised methods.
Cross-task consistency can booster semi-supervised learning.

Student's Review

This paper proposed a novel semi-supervised segmentation method by using the consistency between different tasks and applying dual-task consistency regularization. It provides a promising way to leverage unlabeled data and shows cross-task consistency can improve the fully supervised learning performance. It also has potential to be applied to other area and combined with other methods, such as edge extraction and key-point estimation. The innovative idea presented in this paper also inspired other work [16].

However, there are some parts of the paper that I find ambiguous.

In Figure 2 and 3, Compared the visualized segmentation results of proposed method to the ground truth, there are some obvious false positve parts, which can be misleading in the segmentation task.
According to the formula (1), when $\begin{array}{l}\mathcal{T}(x)=0\end{array}$ , x belongs to the zero level set, also represents the contour of the target object. When we use formula (2) to get the binary prediction map of the points corresponding to contour, i.e. $\begin{array}{l}\mathcal{T}(x)=\mathcal{z}=0\end{array}$ , $\begin{array}{l}\mathcal{T}^{(-1)} (0) = \frac{1}{1+e^{-k\cdot 0}}=\frac{1}{2}\end{array}$ , which is not binary, no matter how we choose the value of k. They do not explain more about how to deal with the transformation of contour points, which makes me confused.
When they compared their proposed method to other methods, besides aforementioned evaluation metrics, standard deviation of accuracies can provide auxiliary information for ranking different image segmentation algorithms. It would be more convincing to illustrate the superiority of the method.

References

[1] Luo, Xiangde, Jieneng Chen, Tao Song, and Guotai Wang. "Semi-supervised medical image segmentation through dual-task consistency." arXiv preprint arXiv:2009.04448 (2020).

[2] Masood, Saleha, Muhammad Sharif, Afifa Masood, Mussarat Yasmin, and Mudassar Raza. "A survey on medical image segmentation." Current Medical Imaging 11, no. 1 (2015): 3-14.

[3] Li, Xiaomeng, Lequan Yu, Hao Chen, Chi-Wing Fu, Lei Xing, and Pheng-Ann Heng. "Transformation-consistent self-ensembling model for semisupervised medical image segmentation." IEEE Transactions on Neural Networks and Learning Systems 32, no. 2 (2020): 523-534.

[4] Yu, Lequan, Shujun Wang, Xiaomeng Li, Chi-Wing Fu, and Pheng-Ann Heng. "Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation." In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 605-613. Springer, Cham, 2019.

[5] Laine, Samuli, and Timo Aila. "Temporal ensembling for semi-supervised learning." arXiv preprint arXiv:1610.02242 (2016).

[6]Zamir, Amir R., Alexander Sax, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, and Leonidas J. Guibas. "Robust learning through cross-task consistency." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11197-11206. 2020.

[7] Milletari, Fausto, Nassir Navab, and Seyed-Ahmad Ahmadi. "V-net: Fully convolutional neural networks for volumetric medical image segmentation." In 2016 fourth international conference on 3D vision (3DV), pp. 565-571. IEEE, 2016.

[8] Xiong, Zhaohan, Qing Xia, Zhiqiang Hu, Ning Huang, Cheng Bian, Yefeng Zheng, Sulaiman Vesal et al. "A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance Imaging." arXiv preprint arXiv:2004.12314 (2020).

[9] Roth, Holger R., Le Lu, Amal Farag, Hoo-Chang Shin, Jiamin Liu, Evrim B. Turkbey, and Ronald M. Summers. "Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation." In International conference on medical image computing and computer-assisted intervention, pp. 556-564. Springer, Cham, 2015.

[10] Ouali, Yassine, Céline Hudelot, and Myriam Tami. "Semi-supervised semantic segmentation with cross-consistency training." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12674-12684. 2020.

[11]Zhang, Yizhe, Lin Yang, Jianxu Chen, Maridel Fredericksen, David P. Hughes, and Danny Z. Chen. "Deep adversarial networks for biomedical image segmentation utilizing unannotated images." In International conference on medical image computing and computer-assisted intervention, pp. 408-416. Springer, Cham, 2017.

[12] Vu, Tuan-Hung, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. "Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517-2526. 2019.

[13] Tarvainen, Antti, and Harri Valpola. "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results." arXiv preprint arXiv:1703.01780 (2017).

[14] Yu, Lequan, Shujun Wang, Xiaomeng Li, Chi-Wing Fu, and Pheng-Ann Heng. "Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation." In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 605-613. Springer, Cham, 2019.

[15] Li, Shuailin, Chuyu Zhang, and Xuming He. Shape-aware semi-supervised 3d semantic segmentation for medical images. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2020.

[16] Zhang, Yichi, and Jicong Zhang. Dual-Task Mutual Learning for Semi-Supervised Medical Image Segmentation. arXiv preprint arXiv:2103.04708 (2021).

Seitenhierarchie