Contrastive Learning/Trends in Self-Supervised Learning

Abstract

Self-supervised learning (SSL) has advanced Computer Vision and is of importance in the medical field, where substantial costs and expertise e.g. from a trained radiologist are required to obtain high-quality annotated data. Moreover, models which can learn from large unlabeled data sets can be considered a step towards more generalist models. Joint embedding architectures are a common way to extract information from unlabeled images which can then be used to improve performance on a subsequent transfer learning task, such as object detection. Here, we present a meta-analysis on visual representation learning through joint embedding architectures in medical applications. These architectures have to address the collapse problem that can result in non-informative output embeddings. Thus, two methods, Information Maximization, which maximizes the information content of the learned embeddings, and contrastive learning, which is used to learn similar and dissimilar data representations after categorizing the data into positive and negative pairs are introduced. Furthermore, the challenge of the false negatives problem are also addressed. The methods are then compared on image classification as well as object detection through transfer learning. Finally, an approach to learn self-supervised anatomical embeddings (SAMs) from radiological images as well as an example for leveraging medical information to improve contrastive learning, MedAug, are presented. A refined SSL approach for 3-D medical image segmentation, prior local guided self-supervised learning (PGL), is also presented.

Introduction - The Learning Streams

Figure 1: Learning streams illustration [1]
Notes: Black line: decision boundary; grey circles: unlabeled data; colored circles: labeled data; striped circles: partially neglected labels

While supervised learning is concerned with only labeled data that are used to learn an algorithm that maps features to labels for inference or prediction tasks, unsupervised learning only uses unlabeled data to find new insights or patterns in the data. However, semi-supervised learning can use both labeled and unlabeled data. By using the labeled data to predict a pseudo-label for the unlabeled data and then using both a labeled and an unlabeled loss with different weights, results drawing on labeled data only could be improved. Finally, self-supervised learning (SSL), picks up supervisory signals from the data itself to learn data representations [2]. In case of a large set of unlabeled data, SSL permits to learn features from these data and to transfer these to a subsequent task and to improve the performance on this second task. Figure 1 visualizes all of these types of learning.

Figure 2: Two stages of self-supervised learning [3]
Notes: Phase 1: pretext task; phase 2: target task.

Training an SSL model involves two stages [3]: 1. a pretext task or pre-training task to learn intermediate data representations from unlabeled data, and 2. a downstream task or target task, which is concerned with transferring the knowledge from pre-training to fine-tune a target model that is used to carry out a target task such as object detection or image classification. The training pipeline is illustrated in Figure 2.

Figure 3: Three potential effects of pre-training on model accuracy (two stage SSL model compared to a one-stage model) [4]
Notes: Three effects with more labels: (a) improvement over baseline, (b) higher accuracy but same plateau, (c) convergence to baseline performance

In a study [4], the author shows that pre-training improves the model performance when applied to larger models and more complex data and that three potential outcomes are conceivable as shown in Figure 3.

Focus and Problem Statement

This meta-analysis is focused on the task of self-supervised (visual) representation learning through joint embedding architectures by training two neural networks to produce similar embeddings from two different views obtained from augmenting the same image. A practical challenge of these SSL architectures is the "collapse problem", i.e. the encoders produce identical, constant and non-informative output embedding vectors and ignore the inputs.

This problem can be overcome by two types of methods: 1. Information maximization methods that maximize the information content of the two learned embeddings. 2. Contrastive methods that rely on a contrastive coupled with an instance segmentation task. This task is as follows: positive instance pairs, which refer to augmentations of the same image, should be attracted, while negative instance pairs, i.e. the comparison of an augmentation from one image to an instance from a different image, should lead to a repulse of the negative instance by pushing the embedding components further apart. An illustration can be seen in Figure 4. The blue framed picture is a crop of the anchor image and thus the two form a positive pair, while the red framed pictures are taken from different images to form negative pairs with the anchor. It can be problematic that visual features and semantic content are ignored in this negative pair definition.

Figure 4: Positive and negative sample pairs in contrastive learning [12]
Notes: Positive pair: anchor image and crop of the anchor (blue frame). Negative pair: anchor image and one sample from different images (red frame) .

Contrastive learning can be defined as a framework to learn representations of data from similar (positive) and dissimilar (negative) pairs [13]. Figures 5(a) and 5(b) illustrate how the contrastive loss, Info Noise Contrastive Estimation (NCE), can be thought of as a scaled and normalized cross entropy loss which applies a similarity measure between the representation components $\begin{array}{l}z_i\end{array}$ and $\begin{array}{l}z_j\end{array}$ . It can also be applied to the Momentum Contrast formulation.

Figure 5(a): Contrastive loss used in SimCLR Figure 5(b): MomentumContrast loss

Next, we contrast breakdowns of the loss terms for the contrastive SimCLR method and for Barlow Twins, a method based on the cross-correlation matrix. The two losses are shown in Figures 6(a) and 6(b). As we can see from this example, both methods can be employed to overcome the collapse problem.

Figure 6(a): SimCLR loss breakdown Figure 6(b): BarlowTwins loss breakdown

Comparison of SSL methods (Information Maximization)

Figure 7 provides an overview of SSL architectures, which all rely on encoder-decoder structures organized in two branches and which feature an architecture head with a loss function and transformations.

Figure 7: Overview of SSL method architectures and common joint embedding architecture characteristics [7]

Based on this overview, Figures 8(a)-(d) and Figures 9(a)-(c) show how the collapse problem can be overcome by the selected SSL architectures, which are categorized into two method categories.

Figure 8(a): Figure 8(b): Figure 8(c): Figure 8(d):
VICReg BarlowTwins BYOL SimSiam

Figures 8(a)-(d) show the Information Maximization methods VICReg (Variance-Invariance-Covariance Regularization for self-supervised learning) [7], Barlow Twins [5], BYOL (Bootstrapping your own Latent) [11] and SimSiam (Simple Siamese Networks) [9]. VICReg features three loss terms. The variance term maintains the variance of each embedding dimension above a threshold. One decorrelates variable pairs (covariance term) to avoid the collapse problem. Lastly, the invariance term minimizes the mean squared distance between the embeddings. A unique feature of VICReg is that it allows for different branch architectures and weights. Barlow Twins is centered around an invariance term, which pushes the off-diagonal elements of the cross-correlation matrix towards 1, and a redundancy reduction term, which pushes the off-diagonal elements towards 0. The latter prevents the collapse problem. BYOL only relies on positive instance pairs and uses a target and an online network with the peculiarities that the online network predicts the target network representation and that a momentum encoder is used to update the target network. SimSiam is another method which also only uses positive instance pairs and which employs a predictor branch as well as a branch with a stop-gradient operation in order to resolve the collapse problem.

Comparison of SSL methods (Contrastive Learning)

Figure 9(a): Figure 9(b): Figure 9(c):
Momentum Contrast Whitening-MSE SimCLR

Figures 9(a)-(c) illustrate the architecture of three contrastive methods. MoCo (Momentum Contrast) [6] maintains negative instances in a queue and encodes query representations and positive instances in each training batch through a query encoder. In addition, a momentum encoder ensures consistency between current and earlier keys. W-MSE (Whitening Mean Squared Error Loss) is an SSL approach that involves a whitening transformation, which projects sample representations onto a distribution with zero-mean and an identity matrix covariance and which solves the collapse problem [8]. After a further normalization step, the MSE is computed to measure distances between positive-positive instance pairs only and negative instances are not needed, which removes the need for large batch sizes. Lastly, the SimCLR (simple framework for contrastive learning of visual representations) [10] uses a contrastive loss in the latent space that attracts positive pairs and repels negative pairs for each batch considered. This excludes constant outputs from the solution space.

Drawbacks of Selected SSL methods

Numerically, SimCLR is computationally demanding as the contrastive loss requires a large number of negative instances and batches [7]. The W-MSE method can also be computationally problematic as the inverse of the covariance matrix of the embeddings has to be computed [7].

Moreover, all contrastive methods could suffer from the problem of "false negatives", i.e. samples from other images with similar semantic content or visual features. The negative effects are slow convergence and a loss of semantic information. As illustrated in Figure 10(a), support views, which refer to augmentations of the image used for the negative instance, can be used to detect false negatives [12]. Subsequently, the contrastive loss can then be modified such that an anchor image i is not contrasted against false negatives from the set F_ias shown in Figure 10(b).

Selected Results – Classification on ImageNet & Transfer Learning

Table 1: SSL method evaluation on ImageNet (Top-1/Top-5 accuracy on top of a ResNet-50)

In Table 1, five selected SSL methods are compared in terms of their Top-1 and Top-5 accuracy when a linear image classifier is fitted on top of the frozen representations from ImageNet with a ResNet-50 backbone after pre-training with the selected SSL method. BYOL seems to perform marginally better on this task than the other methods. Moreover, the same accuracy measures are compared across methods when 1%/10% of the instances from ImageNet are used for pre-training and subsequently to fine-tune a semi-supervised image classifier. Barlow Twins and VICReg exhibit the best performance on this second task. However, the performance of all methods compared does not differ substantially.

In Table 2, the transfer learning performance on the target task of object detection is compared across five SSL methods, which have been pre-trained on standard image classification data sets. The (median) average precision as a target metric is used as the target metric. All considered SSL methods are closely matched in their transfer learning performance on the target task of object detection.

Table 2: Transfer learning evaluation from image classification data (pre-text task) to object detection (target task)
Notes: $\begin{array}{l}AP\end{array}$ : average precision, $\begin{array}{l}AP_{50}\end{array}$ : median average precision

Links to the Medical Field

One application of SSL in the medical field is the learning of anatomical embeddings from unlabeled radiological images (Computer Tomography (CTs) or X-Rays) to locate anatomical structures in other images as illustrated in the inference illustration in Figure 11. The SAM (self-supervised anatomical embeddings) [14] approach, that relies on a pixel-wise contrastive learning framework, can be used for this task. SAM encodes global and local anatomical information and is motivated by the costs of obtaining annotated medical image data and the required radiologist expertise. In Figure 12, the learning process of the SAM method is visualized. The steps involve augmenting a patch of an unlabeled image, applying a convolutional neural network (CNN) to extract local and global pixel-wise embeddings and finally applying a contrastive loss (InfoNCE) to perform contrastive learning. Lastly, a nearest neighbor search is carried out in order to locate the points of interest. A large and diverse number of negative pairs can increase the performance of contrastive learning.

Figure 11: Training and inference in the SAM method [14]

Figure 12: Steps in the SAM learning process [14]
Notes: Step 1: patch augmentation, step 2: extract embeddings, step 3: apply contrastive loss, step 4: nearest neighbor search

In Figure 13, a refinement of the BYOL method for 3-D medical image segmentation is illustrated, which is named prior guided local self-supervised learning (PGL) [16]. PGL can be used to learn local consistency between features of the same region in different image augmentations. With PGL, it can be seen how the colored image patches represent the same locations in two different augmentations of the initial image x. In contrast, the BYOL method is only concerned with global consistency. Experiments show that PGL can outperform BYOL on liver and spleen segmentation tasks [16].

Figure 13: Comparison of BYOL and PGL for 3D medical image segmentation [16]
Notes: BYOL: Bootstrapping your own Latent; PGL: prior guided local self-supervised learning; $\begin{array}{l}\tau_1(x),\tau_2(x)\end{array}$ : data augmentations of image x

Overall, Table 3 shows the most used SSL approaches in the field of medical imaging and the corresponding downstream tasks.

Table 3: Most commonly used self-supervised learning methods and corresponding medical downstream tasks [15]

In Figure 14, the selection of positive instance pairs using patient metadata (patient number, study number or laterality) is illustrated, which results in a greater set of positive instance pairs of images for contrastive learning. A 14% higher mean AUC over the ImageNet baseline performance can be achieved with this MedAug method that leverages medical information for the target task of disease classification through X-ray interpretation and uses Momentum Contrast for the pre-training task [17].

Possible Directions for Future SSL Research

A combination of the asymmetric architectures (BYOL or SimSiam) with the whitening operation (W-MSE) could be investigated to improve existing SSL approaches [8]
Further incorporation of medical knowledge to design pretext tasks might improve performance on downstream tasks further [15]
Collecting standard pools of unlabeled data related to imaging modalities and medical conditions could advance the application of SSL in the medical field [15]

Summary and Lessons Learned

Motivation & Joint Embedding Architecture Challenge:

Utilizing unlabeled medical data due to high costs and the required expertise to annotate medical data can be accomplished by applying self-supervised and semi-supervised learning
SSL-based representation learning through joint embedding architectures must overcome the collapse problem which can be solved by both contrastive methods and information maximization methods
Both SSL method types are equally suitable for transfer learning

Contrastive Learning Challenge:

The false negatives problem for contrastive methods can lead to slow convergence and to a loss of semantic information
Solution: False negative elimination

Medical Applications:

SAM can be used to learn anatomical embeddings from unlabeled radiological images and to locate them again
PGL is a successful refinement of the BYOL method for the specific purpose of 3-D medical image segmentation

Personal Review

A method with future potential: VICReg is the only approach that allows for different architectures and weights for the two embedding branches among the Information Maximization methods. This could make this method slightly more flexible than the other methods considered, but a particular application in the medical field has yet to be devised.
A method with the highest current potential: The BYOL approach seems to show considerable potential for medical applications since the approach has been modified in order to factor in information about local regions of embedding features in different image augmentations to permit accurate 3-D medical image segmentation.
Methods with potential for large scale unlabeled medical data: While SimCLR and MoCo treat positive and negative instance pairs differently, SimSiam and BYOL don't rely on the negative instances and employ two tricks to overcome the collapse problem instead, namely a stop-gradient operation and a momentum encoder, respectively. Given the higher computational complexity that comes as a drawback of methods that use the dissimilarity between positive and negative sample pairs, SimSiam and BYOL as well as the Information Maximization based methods Barlow Twins and VICReg appear to exhibit more potential to be employed on large pools of unlabeled medical data. In addition, MedAug, SAM and PGL show how SSL approaches can be successfully adapted and refined for medical applications such as disease classification, the location of anatomical structures as well as the segmentation of images of human organs.

References

[1]: Dyakonov, A. (2020, December 12). Self-Supervised Machine Learning: Examples and Tutorials. Dasha AI Blog. https://dasha.ai/en-us/blog/self-supervised-machine-learning.

[2]: LeCun, Y., & Misra, I. (2021, March 04). Self-supervised learning: The dark matter of intelligence. Meta AI Blog. https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/.

[3]: Shah, D., & Jha, A. (2022, May 13). Self-supervised Learning and its applications. Neptune Blog. https://neptune.ai/blog/self-supervised-learning.

[4]: Newell, A., & Deng, J. (2020). How useful is self-supervised pretraining for visual tasks? Proceedings of the IEEE/CVF Conference
on CVPR, 7345–7354.

[5]: Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021).Barlow twins: Self-supervised learning via redundancy
reduction. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (pp. 12310–12320).
PMLR.

[6]: He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning.
Proceedings of the IEEE/CVF conference on CVPR, 9729–9738.

[7]: Bardes, A., Ponce, J., & Lecun, Y. (2022). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. ICLR
2022-10th ICLR.

[8]: Ermolov, A., Siarohin, A., Sangineto, E., & Sebe, N. (2021). Whitening for self-supervised representation learning. International
Conference on Machine Learning, 3015–3024.

[9]: Chen, X., & He, K. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on
CVPR, 15750–15758.

[10]: Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations.
International conference on machine learning, 1597–1607.

[11]: Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar,
M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing
Systems, 33, 21271–21284.

[12]: Huynh, T., Kornblith, S., Walter, M. R., Maire, M., & Khademi, M. (2022). Boosting contrastive self-supervised learning with false
negative cancellation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2785–2795.

[13]: Chen, X., Fan, H., Girshick, R., & He, K. (2020). Improved baselines with momentum contrastive learning. ArXiv preprint
arXiv:2003.04297.

[14]: Yan, K., Cai, J., Jin, D., Miao, S., Guo, D., Harrison, A. P., Tang, Y., Xiao, J., Lu, J., & Lu, L. (2022). Sam: Self-supervised learning of
pixel-wise anatomical embeddings in radiological images. IEEE Transactions on Medical Imaging.

[15]: Shurrab, S., & Duwairi, R. (2021). Self-supervised learning methods and applications in medical imaging analysis: A survey. arXiv
preprint arXiv:2109.08685.

[16]: Xie, Y., Zhang, J., Liao, Z., Xia, Y., & Shen, C. (2020). Pgl: Prior-guided local self-supervised learning for 3d medical
image segmentation. arXiv preprint arXiv:2011.12640.

[17]: Vu, Y. N. T., Wang, R., Balachandar, N., Liu, C., Ng, A. Y., & Rajpurkar, P. (2021). Medaug: Contrastive learning leveraging patient
metadata improves representations for chest x-ray interpretation. In K. Jung, S. Yeung, M. Sendak, M. Sjoding, & R. Ranganath (Eds.),
Proceedings of the 6th machine learning for healthcare conference (pp. 755–769).

Seitenhierarchie