Collaborative Learning of Semi-Supervised Segmentation and Classification for Medical Images

This is the blogpost of the paper 'Collaborative Learning of Semi-Supervised Segmentation and Classification for Medical Images'

Written by Yi Zhou, Xiaodong He, Lei Huang, Li Liu, Fan Zhu, Shanshan Cui, and Ling Shao

Introduction

Computer-aided diagnosis (CAD) has become one of the major research areas in medical imaging and diagnostic radiology [5]. And for CAD applications, disease grading (DG) and lesion detection (LD) represent two fundamental problems [1]. The emergence of Deep Learning (DL) has led to a significant improvement in the capabilities of CAD systems [19]. However, training DL models requires a great amount of labeled data. Obtaining labeled data for medical applications is expensive because it requires domain experts and is very time-consuming. And it becomes even more challenging if pixel-level annotations are required for segmentation, like in the case of lesion detection. This makes fully supervised methods to be impractical [6], especially when compared with other fields where data is more available [7]. For disease grading, CNNs also show limitations. Usually, medical doctors asses the severity of a disease based on the detection of specific lesions. Nevertheless, CNNs do not take this fact into account, resulting in limited accuracy.

Collaborative learning consists of training multiple models together so that their outcomes improve each others' learning and performance. Although only a few state-of-the-art deep learning approaches have aimed to relate both problems [2, 3, 4], accurate lesion detection can improve classification, while class-specific information could help segmentation.

With this aim, the authors propose a novel collaborative learning method, that enhances simultaneously DG and LD tasks. Their main contributions are:

A multi-lesion mask generator based on U-net [8] and extended with the Xception [9] module was carefully designed for the LD task due to limited data.
A lesion attentive model that can automatically predict lesion maps from image-level annotated data is proposed.
Both tasks are improved in an end-to-end manner.

The method's performance is tested for diabetic retinopathy (DR), an eye disease that results from diabetes mellitus and can lead to blindness. DR presents four different lesion types (microaneurysms, hemorrhages, hard exudates, and soft exudates), each of one related to one of the four stages of the disease (mild, moderate or severe, non-proliferative DR and proliferative DR) [5]

Methodology

Problem Formulation

In order to learn both tasks simultaneously, the method has to optimize together a lesion segmentation model $\begin{array}{l}G(·)\end{array}$ and a disease grading model $\begin{array}{l}C(·)\end{array}$ . For that, two different datasets are available, one containing image-level annotated images ( $\begin{array}{l}X^I\end{array}$ ), and the other one containing pixel-level annotated images ( $\begin{array}{l}X^P\end{array}$ ).

On the one hand, to train the segmentation model, the difference between the predicted lesion masks and the ground truth is minimized according to the following function:

$\begin{array}{l}\displaystyle \min\limits_{G}\sum_{l=1}^L\mathcal{L}_{Seg}(G(X^{P}), G(X^{I}), s^{P}_{l}, \tilde s^{I}_{l}\end{array}$

where $\begin{array}{l}s^{P}_{l}\end{array}$ denotes the lesion annotated ground truths, $\begin{array}{l}\tilde s^{I}_{l}\end{array}$ the attention maps generated by the lesion attention model, and L is equal to the number of different lesion types

On the other hand, the grading model is trained through the optimization of:

$\begin{array}{l}\displaystyle \min\limits_{C}\sum_{l=1}^L\mathcal{L}_{Cls}(C(X^{I}).att(G(X^{I})),y^{I})\end{array}$

where $\begin{array}{l}att(·)\end{array}$ refers to the attention model and $\begin{array}{l}y^{I}\end{array}$ to disease grading labels

Method Overview

The proposed method divides training into pretraining and semi-supervised training steps (Fig1). First, during the pretraining step, both the classification and the segmentation models are trained in a fully-supervised manner. Second, during the semi-supervised step, images without segmentation labels are used for the generator to predict weak lesion masks. However, these masks are not accurate enough to be used for semi-supervision. Instead, they are fed into the attention model for refinement. These refined masks are called attention maps and can be used for semi-supervision of the segmentation model and, for improving grading accuracy by weighting original $\begin{array}{l}X^{I}\end{array}$ images into so-called attention maps.

Figure 1. Pretraining (left) and semi-supervised (right) training steps.

Adversarial Multi-Lesion Mask Generator

The multi-lesion mask generator, which is represented in yellow in Fig2, is in charge of predicting lesion segmentation masks from images. Training a semantic segmentation model usually requires a large amount of pixel-level annotated data. So the authors propose a U-shape network [8] extended with the Xception Module [9], due to the scarcity of data. Xception inherits its ideas from the Inception module, with the difference being the implementation of depthwise-separable convolution. Separable convolutions perform a spatial convolution over each image channel (1 kernel per channel) and then a 1x1 convolution to merge all channels into one. Thanks to this, it uses weights more efficiently. In total, the generator is composed of nine tuples. Except for the first one, all are based on the Xception module and their structure is shown in the lower part of the yellow box in Fig2. In the end, L convolution layers with sigmoid activations are added to generate one segmentation mask per lesion type.

Figure 2. Pipeline of the proposed method. The input data consists of a very small set of pixel-level annotated lesion images $\begin{array}{l}X^{P}\end{array}$ and a large set of images $\begin{array}{l}X^{I}\end{array}$ with only image-level labels showing the disease severity. A multi-lesion masks generator $\begin{array}{l}G(·)\end{array}$ is proposed for learning the lesion segmentation task in a semi-supervised manner, where $\begin{array}{l}X^{P}\end{array}$ has real ground-truth masks and $\begin{array}{l}X^{I}\end{array}$ uses the pseudo masks learned from the lesion attentive disease grading model. An adversarial architecture is also proposed to benefit the training. Moreover, the segmented lesion masks are adopted to generate attentive features for improving the final disease grading performance. The two tasks are jointly optimized in an end-to-end network.

As mentioned before, the segmentation model is optimized both in a supervised and semi-supervised manner. In both cases, a binary cross-entropy loss is used to minimize the difference between the predictions and the ground-truths (or attention maps), respectively. Finally, a multi-lesion discriminator ( $\begin{array}{l}D(x_{1},x_{2})\end{array}$ ) is introduced to improve the generator's performance through adversarial training. The total loss for optimizing LD task could be defined as:

$\begin{array}{l}\displaystyle \mathcal{L}_{Seg} = \mathcal{L}_{Adv} + \lambda\mathcal{L}_{CE} = \mathbb{E}[\mathbb{log}(D(X^{P}, G(X^{P}))] + \mathbb{E}[1- \mathbb{log}(D(X^{I}, G(X^{I}))] + \lambda\mathbb{E}[-s\cdot\mathbb{log}(G(X^{(P,I)}-(1-s)\cdot\mathbb{log}(1-G(X^{(P,I)}))]\end{array}$

Where $\begin{array}{l}s\end{array}$ is a brief expression of $\begin{array}{l}s^{P}_{l}\end{array}$ and $\begin{array}{l}\tilde s^{I}_{l}\end{array}$ for the ground truths of pixel-level and image-level annotated data, respectively. And where $\begin{array}{l}\lambda\end{array}$ is the weight of two objective functions.

Lesion Attentive Disease Grading

Human doctors usually base the diagnosis on the observation of lesions characteristic of the disease. Visual attention models are techniques that allow networks to assess images in a human-like manner. They allow networks to focus on image regions that carry task-relevant information and to neglect other irrelevant regions. So their application to medicine could present a lot of benefits by allowing DL models to imitate doctors' grading paradigms.

The proposed lesion attentive disease grading model, shown in Fig3, is composed of two branches. First, the main branch $\begin{array}{l}C(\cdot)\end{array}$ for feature extraction and classification of the input disease images. Second, the lesion attention model $\begin{array}{l}att(\cdot)\end{array}$ , composed of L branches, encodes the attention models for each lesion. It is trained using the Focal Loss [12] due to class imbalanced data.

Regular attention models only focus on high-level features [10, 11]. But, as mentioned before, medical diagnosis relies on pixel-sized lesion detection, so low-level features must also be taken into account. This makes regular attention models not suitable for medical applications. Because of that, the proposed attention mechanism first focuses on low-level information by fusing the encoded low-level embeddings from both the input images and the initially predicted lesion masks. Then, high-level guidance is provided through a global context vector, which contains high-feature information extracted from the main branch during the pretraining step. In the end, the attention model produces lesion attention maps which are defined as:

$\begin{array}{l}\displaystyle \alpha_{l}=Sigmoid(W_{l}^{high}[f^{low\_att}_{l}\odot f^{high}]+b^{high}_{l})\end{array}$

Where $\begin{array}{l}\alpha_{l}\end{array}$ are the attention maps that give high responses to different lesion regions that characterize the disease, $\begin{array}{l}\odot\end{array}$ denotes the element-wise multiplication operation and,

$\begin{array}{l}\displaystyle f^{low\_att}_{l}=ReLU(W^{low}_{l}concat(m_{l},f^{low})+b^{low}_{l})\end{array}$

Where $\begin{array}{l}concat(\cdot)\end{array}$ refers to the channel-wise concatenation operation

Figure 3. Detailed schematic of the lesion attentive disease grading model. The blue part is the classification model for disease grading and the orange part is the attention model for learning refined lesion maps.

Results and Discussion

Experimental Setup

During pretraining, the segmentation model was trained over 60 epochs with a batch size of 32. For the grading model, a batch size of 128 was used over 30 epochs. For both models an Adam optimizer with ϵ = 0.0002 and momentum = 0.5 was used

Later, semi-supervised training of the segmentation model was carried out over 50 epochs with a batch size of 15, and $\begin{array}{l}\lambda\end{array}$ equal to 10.

Datasets

In this work, three different datasets were used.

-IDRID [13]: is the only dataset containing pixel-level annotated images. However, it is extremely small, only 54 training and 27 testing images

-EyePACS [14]: contains a large number of disease graded images and shares the grading protocol with IDRID

-Messidor [15]: have a different grading protocol and was only used for validation of results

Since images from different datasets were used and they differed in resolution and illumination, a preprocessing method based on [16] was used to unify image quality and sharpen texture details. Also, vertical and horizontal flips and rotations were used to augment data and mitigate class imbalance.

Ablation Studies

Qualitative Multi-lesion Segmentation Results

First, to evaluate segmentation performance, predicted lesions from a pre-trained and a semi-supervised generator are compared with the ground truth. In Fig 4, we can observe that the pre-trained model tends to under-segment big hemorrhages or even miss-detect small ones. However, the semi-supervised model shows higher sensitivity and robustness. In the case of Soft Exudates, the pre-trained model also reports several false-positives (red squares) while the semi-supervised model doesn't.

Figure 4. Qualitative multi-lesion segmentation results

Effect of Lesion Attentive Disease Grading

To evaluate the influence of lesion segmentation in DR grading and also the influence of semi-supervision by the attention model, the authors propose three baselines and a final method.

Ori: do not use the lesion attention model. Only train the main branch of the classifier on the pre-processed fundus images
Lesion (Pretrained): the segmentation model is only trained with limited pixel-level annotated data, and predicted masks are used for weighting image feature maps to train the classification model without the attention model.
Lesion (Semi): the lesion attention model is included and used for semi-supervision of the generator. However, only the cross-entropy loss is used.
Lesion (Semi + adv): as the final method, an adversarial training architecture is included in the objective function.

Table 1. Evaluation of the effectiveness of the lesion attentive disease grading on the IDRID and EyePACS dataset.

Table 1 shows the classification accuracy and kappa [14] score for each baseline. For the IDRID dataset, each baseline outperforms the previous one. This fact supports the importance of each of the components in the method. According to the authors, the increase of 5.86% in the kappa score from the Lesion (Pretrained) to the Lesion (Semi) baseline is especially significant. Because it proves that the proposed lesion attention model can effectively refine the lesion maps and thus improve the grading results.

For the EyePACS dataset, IDRID pixel annotated images are used for fully-supervised training of the segmentation model. According to results, a similar comparison can be made for performance, thus supporting previous conclusions.

Effect of Semi-Supervised Lesion Segmentation

To evaluate the influence of semi-supervised and adversarial learning, another three baselines are compared with the proposed method:

The pre-trained segmentation model using the normal convolution tuple (CE1).
The Xception module-based model (CE2).
The semi-supervised learning component without an adversarial training architecture (CE2 + Semi).
The semi-supervised learning component with an adversarial training architecture (CE2 + Semi + Adv).

Segmentation performance for the different lesions is evaluated in the IDRID dataset. Table 2 containing AUC values of ROC and PR curves, compares the segmentation performance of each baseline with respect to different DR related lesions. And, again, we can observe that each baseline performs better than their precedent one. The fact that CE2 shows higher AUC values than CE1, validates the choice of Xception for extending the U-net architecture. Semi-supervision is also capable of improving performance since it allows the model to exploit image-level annotated data. Finally, the adversarial training architecture also provides a refinement of the segmentation performance.

For further validation, the proposed model is compared with state-of-the-art models in the lower part of Table 2. The first three models correspond to the best 3 performers in IDRID competition while AdvSeg [17] and ASDNet [18] are transferred from other vision tasks. Although the proposed model, do not beat the best microaneurysm detector, moderate improvements are observed along with the other three lesion types

Table 2. Performance comparisons of multi-lesion segmentation on the IDRID dataset.

Conclusion

In this paper, a novel collaborative learning method is proposed for semi-supervised lesion segmentation and disease grading. Lesion masks were used to attend the classification model and improve grading accuracy. For semi-supervision of the segmentation task, attention maps produced by the attention model from class-specific information were used. Extensive experimentation was performed to validate the method and show the improvements achieved on the DR problem.

References

[1] E. Miranda, M. Aryuni, and E. Irwansyah. A survey of medical image classification techniques. In Information Management and Technology (ICIMTech), International Conference on, pages 56–61. IEEE, 2016.

[2] Y. Yang, T. Li, W. Li, H. Wu, W. Fan, and W. Zhang. Lesion detection and grading of diabetic retinopathy via two-stages deep convolutional neural networks. In MICCAI, pages 533– 540. Springer, 2017.

[3] Z.Wang, Y.Yin, J.Shi, W.Fang, H.Li, and X.Wang.Zoom- in-net: Deep mining lesions for diabetic retinopathy detection. In MICCAI, pages 267–275. Springer, 2017.

[4] B. Antal, A. Hajdu, et al. An ensemble-based system for microaneurysm detection and diabetic retinopathy grading. IEEE transactions on biomedical engineering, 59(6):1720, 2012.

[5] Doi, Kunio. “Computer-aided diagnosis in medical imaging: historical review, current status and future potential.” Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society vol. 31,4-5 (2007): 198-211. doi:10.1016/j.compmedimag.2007.02.002[7] X. Wang, S. You, X. Li, and H. Ma. Weakly-supervised semantic segmentation by iteratively mining common object features. In CVPR, June 2018.

[6] Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431– 3440, 2015

[8] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.

[9] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.

[10] Y. Zhou and L. Shao. Viewpoint-aware attentive multi-view inference for vehicle re-identification. In CVPR, June 2018

[11] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, July 2017.

[12] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dolla ́r. Focal loss for dense object detection. TPAMI, 2018.

[13] Idrid diabetic retinopathy segmentation challenge. http://idrid.grand-challenge.org/

[14] Kaggle diabetic retinopathy detection competition https://www.kaggle.com/c/diabetic-retinopathy-detection.

[15] E. Decencière, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez, P. Massin, A. Erginay, et al. Feedback on a publicly distributed image database: the messidor database. Image Analysis & Stereology, 33(3):231– 234, 2014

[16] M.J.vanGrinsven,B.vanGinneken,C.B.Hoyng,T.Theelen, and C. I. Sa ́nchez. Fast convolutional neural network training using selective data sampling: application to hemorrhage detection in color fundus images. IEEE transactions on medical imaging, 35(5):1273–1284, 2016.

[17] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H. Yang. Adversarial learning for semi-supervised semantic segmentation. arXiv preprint arXiv:1802.07934, 2018.

[18] D. Nie, Y. Gao, L. Wang, and D. Shen. Asdnet: Attention based semi-supervised deep networks for medical image seg- mentation. In MICCAI, pages 370–378. Springer, 2018.

[19] Summers, Ronald. (2017). Deep Learning and Computer-Aided Diagnosis for Medical Image Processing: A Personal Perspective. 10.1007/978-3-319-42999-1_1.

Seitenhierarchie