This blog post reviews Big Self-Supervised Models Advance Medical Image Classifications written by the following members of the Google Research and Health team: Shekoofeh Azizi, Basil Mustafa, Fiona Ryan, Zachary Beaver, Jan Freyberg, Jonathan Deaton, Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting Chen, Vivek Natarajan, Mohammad Norouzi [1].


Introduction

Especially machine learning in the medical field requires precise and credible labeled data that usually must be determined and confirmed by multiple physicians, making it extremely costly to build extensive datasets. The supervised pretraining of models with different kinds of data (e.g., natural images from ImageNet) is one well-explored way to improve the performance of medical image classification [11, 12]. In recent years, however, self-supervised learning and with-it contrastive learning have become another increasingly popular way to tackle this lack of labeled data across the board [2, 3, 9, 10] and even in the medical field [6-8].
This paper combines knowledge of distinctive pretraining and self-supervised contrastive learning and transfers it to a medical scope, leading to enhanced medical image classification with already existing natural and medical images. In addition, it proposes Multi-Instance Contrastive Learning (MICLe), a new form of contrastive learning, that benefits from the fact that a pathology of a single patient is often documented from multiple different views.

Methodology

The reviewed paper comes up with a novel approach of contrastive learning in medical imaging and integrates it into a semi-supervised approach. This combines self-supervised learning on both natural and medical images and fine-tunes the model with supervised training on labeled medical images.

Self-supervised learning using natural and medical images

With SimCLR, Google researchers have developed one of the first self-supervised learning methods that can outpace supervised learning [2, 3]. This paper uses SimCLR as a foundation for contrastive learning on both natural and medical images. Basically, it generates two augmented versions of one source image out of a pool of transformations including "random crop, color distortion and Gaussian blur" [1] as can be seen in Figure 1. After these steps, the contrastive loss is determined with Equation 1. Instead of only relying on natural images to pretrain a model in a supervised fashion, the authors train big models with self-supervised, contrastive learning from the scratch on both natural images of ImageNet and medical images.


Figure 1: Procedure of SimCLR contrastive learning on chest x-rays and dermatology images (adapted from [1]).


\ell _{i,j} = -log \frac{exp(sim(z_i,z_j)/\tau)}{\sum^{2N}_{k=1}I_{[k\neq i]} exp(sim(z_i,z_k)/\tau)}

Equation 1: Contrastive normalized and temperature-scaled cross entropy loss for augmented, encoded, and projected images z_i and z_j of the same positive image pair using sim() as cosine similarity, I as the identity matrix, N as batch size, and \tau stands for the temperature hyper parameter [1,2].

Multi-Instance Contrastive Learning (MICLe)

While the previous subsection focuses on the performance of self-supervised learning on natural and medical images, the following investigates the improvement of contrastive learning on medical images.

Contrastive learning uses augmentation to generate different representations of the same original image. These augmentations of the same image are described as positive pair. With these, the network can learn the features of an image without having additional information such as manually added, descriptive labels.
The authors propose a novel approach of contrastive learning utilizing the fact that pathologies of medical origin often have two or more pictures taken from different views. The idea behind this approach is to use the available, natural augmentation from the same pathology as a foundation for contrastive learning as depicted in Figure 2.


Figure 2: Procedure of Multi-Instance Contrastive Learning (MICLe) on two distinct dermatology images of the same patient and skin disease (adapted from [1]).

Instead of selecting a single image from the dataset and augmenting it in two randomized ways, MICLe chooses – if available – two distinct images image1 and image2 from a single patient that are usually taken from different views. These images are then separately augmented with two random augmentations t1() and t2() and fed into the training process. This is implemented with a ResNET encoder f() [13] and a Multi-Layer Perceptron (MLP) projection network g() which results in a sequence z = g(f(t(x))) [2]. Finally, the contrastive loss is calculated with Equation 1 and used to determine the loss for the whole batch. Algorithm 1 describes MICLe in a more detailed form where each case in the batch consists of one or more images of a single skin disease of one patient.

n = size(batch)
k = 1

for case in batch:
	t1, t2 = generate_random_augmentation()
	if size(case) >= 2:
		image1, image2 = get_two_random_images(case)
	else:
		image1 = image2 = get_image(case)
	endif
	
	augmented[2k-1] = t1(image1)
	augmented[2k] = t2(image2)
  
	z[2k-1] = g(f(augmented[2k-1]))
	z[2k] = g(f(augmented[2k]))
	
	k = k + 1
endfor

for i in [1,2*n] and j in [1,2*n]:
	s[i][j] = transpose(z[i]) * z[j] / (norm(z[i]) * norm(z[j]))
endfor

l[i][j] = loss function from Equation 1
L = 1 / 2*n * sum(k=1, n, l[2k-1][2k] + l[2k][2k-1])

return Trained encoder network f

Algorithm 1: Pseudocode for Multi-Instance Contrastive Learning (MICLe) applied to a batch (adapted from [1]).

Experimental Setup

The experiments conducted in this paper describe the training of two models with completely distinct medical targets to demonstrate the performance of this approach under different conditions. The first experiment focuses on the training of dermatological images including the application of their novel contrastive learning technique MICLe. The second experiment trains a model with contrastive learning to classify chest x-ray images.
The models are trained on multiple ResNet variants: ResNet-50 (1x, 4x) and ResNet-152 (2x). A connected Multi-Layer Perceptron (MLP) reduces the dimensionality of the ResNet encoders to 128 output channels. These outputs form the basis for the contrastive learning approach.
Figure 3 gives an overview of the steps taken to train a model which will be described in the following.


Figure 3: Three steps to improve the performance of medical image classification proposed by the authors (adapted from [1]).

1. Self-supervised pretraining on ImageNet

The authors use SimCLR not only to train on medical images but also on the entire ImageNet dataset to compare the performance to supervised training results.

2.1 Self-supervised pretraining on dermatological images

The dermatology experiment was carried out based on the findings of Liu et al. including the experiment and its dataset [5]. The latter contains one or several cases of 12,306 patients, each depicted on one to six images taken with a non-professional digital camera. These images are not standardized, meaning they can differentiate completely in their background, lightning, color, focus, view, skin condition, and noise. This background gives the network the ability to learn from a high and natural variance of data, reflecting a real-world application. However, images showing more than one skin disease or having really poor quality weren't used. This resulted in 27 target skin disease classes split up into three sets of data Derm(train) = 15,340 cases, Derm(val) = 1,190 cases, and Derm(test) = 4,146 cases. While 26 of these classes categorize most different diseases, the 27th class summarizes various, rarer skin diseases to keep the number of classes at a lower level. SimCLR pretraining was performed on a total of 454,294 images including more unlabeled images, MICLe was only pretrained on the images of Derm(train).

2.2 Self-supervised pretraining on chest x-ray images

The chest x-ray experiment utilizes two datasets. For training, it uses the CheXpert dataset [4] which contains 224,316 x-rays of 65,240 patients categorized in 14 target classes. However, the model is trained for multi-label classification predicting only five targets as described by the authors of CheXpert [4]. The evaluation of the trained model is achieved by testing it on the NIH chest X-ray dataset [14].

3. Supervised fine-tuning

Besides contrastive learning, the authors of SimCLR also propose a fine-tuning approach which was transferred to this paper [2,3]. Besides the hyperparameter search, one vital aspect to reach peak performance during supervised fine-tuning on the pretrained models is to apply random data augmentation.

Evaluation baselines

The models of both dermatology and chest x-ray experiments that perform best on the validation set are evaluated for 5 and 10 repetitions on the generated test set respectively. For the dermatology model, values for the top-1 accuracy and Area Under Curve (AUC) are calculated [5] while the chest x-ray multi-label classification model only focuses on the mean AUC as a performance indicator as proposed by the authors of the CheXpert dataset [4].

Results and Discussion

The paper leads to two main contributions:

  1. Self-supervised pretraining can outpace supervised training on the full ImageNet dataset

  2. The novel approach of Multi-Instance Contrastive Learning (MICLe) can improve contrastive learning on medical images

Table 1 presents performance metrics of models that were pretrained on one or more of the training sets as shown in Figure 3. Especially the gap between the ResNet-50 (1x) with 63.44 % and ResNet-152 (2x) with 68.30 % in top-1 accuracy demonstrates this. Moreover, pretraining with MICLe on dermatology images can lead to a further top-1 accuracy improvement, reaching 68.43 % when training the ResNet-152 (2x). In conclusion, the results show that bigger models achieve increasingly better results than smaller models that were trained with the self-supervised learning approach and their novel MICLe approach has a further positive impact on the top-1 accuracy of these models. With that, the authors also prove that their approach works for various types of inputs since natural images and x-rays are fundamentally different (e.g., colored and greyscale).

Table 1: Summary of the pretraining performance results for both dermatology and chest x-ray image classification. Dermatology Classification also shows the top-1 accuracy for pretraining with MICLe if available (adapted from [1]). The best performance for each architecture is displayed bold.

Besides comparing the different pretraining experiments above, another goal of the authors was to compare their approach with supervised models. Table 2 shows that the previously described experimental setup leads to an improvement of 6.7% in top-1 accuracy for dermatology images and increases the AUC performance for multi-label classification of chest x-rays by 1.1% compared to supervised training only using ImageNet. In more detail, the authors also confirm that contrastive self-supervised learning can outpace supervised learning on the full ImageNet dataset.

Table 2: Comparison of different pretraining strategies (supervised, SimCLR, MICLe) using multiple pretraining dataset compositions of ImageNet, dermatology, and CheXpert (adapted from [1]). The best overall performance for top-1 accuracy and mean AUC is highlighted bold.

Own Review and Discussion

The contributions of this paper push the application of self-supervised, especially contrastive, learning in the medical field one step in the right direction. They show that the lack of labeled medical images does not necessarily need to be the limiting factor anymore when the performance of machine learning models should be improved.
With their Multi-Instance Contrastive Learning (MICLe) approach the authors propose a method that is not bound to the medical scope. It should be applicable to any field that uses a similar structured dataset e.g., the automotive industry uses multiple cameras that might need to recognize an object from different points of view.
Furthermore, they provide extensive information about their results in the appendix proofing their findings.

Even though this paper is very well written and comprehensible, I would like to point out some weaknesses I noticed. Firstly, the authors do not compare their approach to models that were trained fully supervised on both ImageNet and a medical dataset. Secondly, the authors describe horizontal flipping as one of the augmentations for chest x-ray images. However, such left and right flipping can also implicate another condition called situs inversus where e.g., the heart is located on the flipped side by nature [p. 221, 15]. If such cases are included in a dataset that would likely lead to classification issues. Thirdly, the paper completely focuses on the use of SimCLR. Probably since few of the authors of SimCLR also have contributed to this paper. However, it remains unclear if they use SimCLR or its successor SimCLRv2 for their experiments since they do not differentiate between these models. Furthermore, there is no information on how MICLe performs with different pretraining algorithms such as plain supervised learning beforehand as depicted in Table 2.

As future work, I suggest the application of full self-supervised contrastive learning to other natural datasets. Additionally, the MICLe algorithm should undergo further research in other fields such as the automotive industry.

References

  1. Shekoofeh Azizi, Basil Mustafa, Fiona Ryan, Zachary Beaver, Jan Freyberg, Jonathan Deaton, Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting Chen, Vivek Natarajan, Mohammad Norouzi. Big Self-Supervised Models Advance Medical Image Classifications. arXiv preprint, arXiv:2101.05224v2, 2021.

  2. Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations. ICML, 2020.

  3. Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Geoffrey Hinton. Big Self-Supervised Models are Strong Semi-Supervised Learners. NeurIPS, 2020.

  4. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, Andrew Y. Ng. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. AAAI, 2019.

  5. Yuan Liu, Ayush Jain, Clara Eng, David H. Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, Vishakha Gupta, Nalini Singh, Vivek Natarajan, Rainer Hofmann-Wellenhof, Greg S. Corrado, Lily H. Peng, Dale R. Webster, Dennis Ai, Susan J. Huang, Yun Liu, R. Carter Dunn, David Coz. A deep learning system for differential diagnosis of skin diseases. Nature Medicine volume 26, 900-908, 2020.

  6. Xuehai He, Xingyi Yang, Shanghang Zhang, Jinyu Zhao, Yichen Zhang, Eric Xing, Pengtao Xie. Sample-efficient deep learning for COVID-19 diagnosis based on CT scans. medRxiv, 2020.

  7. Jingyu Liu, Gangming Zhao, Yu Fei, Ming Zhang, Yizhou Wang, Yizhou Yu. Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision. Proceedings of the IEEE International Conference on Computer Vision, p. 10632–10641, 2019.

  8. Hong-Yu Zhou, Shuang Yu, Cheng Bian, Yifan Hu, Kai Ma, Yefeng Zheng. Comparing to learn: Surpassing imagenet pretraining on radiographs by comparing image representations. International Conference on Medical Image Computing and Computer-Assisted Intervention, p. 398–407. Springer, 2020.

  9. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 9729–9738, 2020.

  10. Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint, arXiv:2003.04297, 2020.

  11. Michal Heker, Hayit Greenspan. Joint liver lesion segmentation and classification via transfer learning. arXiv preprint, arXiv:2004.12352, 2020.

  12. Laith Alzubaidi, Mohammed A Fadhel, Omran Al-Shamma, Jinglan Zhang, J Santamaría, Ye Duan, Sameer R Oleiwi. Towards a better understanding of transfer learning for medical imaging: a case study. Applied Sciences, 10(13):4523, 2020.

  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, p. 770–778, 2016.

  14. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo-hammadhadi Bagheri, Ronald M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE conference on computer vision and pattern recognition, p. 2097–2106, 2017.
  15. Elizabeth Carver, Barry Carver, Karen Knapp. Medical Imaging - E-Book: Techniques, Reflection and Evaluation (Third Edition). ISBN: 0702085308, 9780702085307. Elsevier Health Sciences, 2021.
  • Keine Stichwörter