How Well Do Self-Supervised Models Transfer?

This is the blogpost for the paper "How Well Do Self-Supervised Models Transfer?" [1] by Linus Ericsson, Henry Gouk and Timothy M. Hospedales.

Blog post author: Miruna-Alexandra Gafencu

Table of Contents

Introduction

Motivation & Problem Statement

Self-supervised learning (SSL) has the advantage of being able to cope with unlabeled datasets. Reaching the point where it outcompetes supervised approaches [2], SSL algorithms exhibit reduced risk of overfitting by selecting task-agnostic features leading to a better generalization and scaling on downstream tasks and datasets. Researchers in multiple fields, such as medical [3] or agricultural [4] image processing are interested in adapting and applying the existing natural image processing based models to a variety of tasks. This interest sparks the need for evaluation and comparison of transfer capabilities of most recent SSL approaches. For this purpose, the authors formulate four questions that outline the motivation behind their work:

How well does SSL transfer compared to a supervised approach?
Can we determine the best SSL method for all tasks and datasets?
Is ImageNet classification performance representative for non-classification tasks on downstream datasets?
What information is retained in the features of supervised and SSL models?

Related Work & Contribution

Previous studies have focused on analysing all-purpose SSL [5], the transferability of supervised pre-trained models [6], influence of different pretraining datasets and Convolutional Neural Network(CNN) architectures on downstream performance [7] or suitability of CNN architectures for different pre-text tasks [8]; however, no prior work has focused on transferability of pretrained self-supervised models on downstream tasks and datasets. The authors have identified this gap and propose a large-scale comparison of diverse SSL approaches with a supervised model. They evaluate the pre-trained ImageNet models on classification, object detection and dense prediction using diverse datasets, both natural images and domain-specific.

Methodology & Experiments

Models

Thirteen self-supervised learning methods are compared to a supervised baseline. Figure 1 and Table 1 show the starting point of the models’ comparison. We can see the accuracy scores on ImageNet as well as the differences in pre-training, such as the number of epochs and batch size.

Figure 1: ImageNet classification performance of compared models. Adapted from [1]

Method	Algorithm Type	Epochs	Batch Size
InsDis [9]	Contrastive	200	256
MoCo-v1 [10]	Contrastive	200	256
PCL-v1[11]	Clustering	200	256
PIRL[12]	Contrastive	200	1024
PCL-v2[11]	Clustering	200	256
SimCLR-v1[13]	Contrastive	1000	4096
MoCo-v2[14]	Contrastive	800	256
SimCLR-v2[15]	Contrastive	800	4096
SeLa-v2[16,17]	Clustering	400	4096
InfoMin[18]	Contrastive	800	256
BYOL[2]	Contrastive	1000	4096
DeepCluster-v2[17,19]	Clustering	800	4096
SwAV[17]	Clustering	800	4096
Supervised		120	256

Table 1: Overview of compared self-supervised models and supervised baseline indicating the type of approach as well as pre-training specifications. Adapted from [1]

Network architecture

Supervised and self supervised networks are composed of a backbone and a head as can be seen in Figure 2. The backbone, a ResNet50 architecture [20] acts as a feature extractor for the self-supervised models and as an image processing unit for the supervised case. Sharing the same backbone architecture enables fair comparison among models despite the above-listed differences in pre-training. Attached to the backbone is the head, a task-dependent block of layers. The training of the backbone-head network on the target downstream dataset is performed in two manners: either by freezing the backbone and optimising only the head or by finetuning the entire network.

Figure 2: Network architecture for both self-supervised and for supervised models. The backbone ResNet50 [20] block is the same for all experiments, whereas the head block is changed depending on the current downstream task.

Comparison Experiments & Datasets

1. Transferability to Downstream Tasks and Datasets

In this section, the conducted experiments are presented in terms of datasets, training and evaluation strategy as well as metrics used.

1.1 Many-shot Recognition

Datasets: The labelled datasets from [6] form a benchmark for diverse image classification in terms of dataset object homogeneity, training samples and number of classes within one dataset (Table 2). The authors refer to this dataset ensemble as “Kornblith”.

Dataset	Description	#classes	#total samples	#images/category	Classification task	Metric
FGVC Aircraft[21]	aircraft models	102	10200	100	fine-grained	mean per class accuracy
Caltech-101[22]	diverse objects	101	9144	40-800	coarse-grained	mean per class accuracy
Stanford Cars[23]	cars	196	16185	diverse	fine-grained	top-1 accuracy
CIFAR-10[24]	diverse objects	10	60000	6000	coarse-grained	top-1 accuracy
CIFAR-100[24]	diverse objects	100	60000	600	coarse-grained	top-1 accuracy
DTD[25]	describable textures	47	5640	120	texture	top-1 accuracy
Oxford 102 flowers [26]	flowers	102	8189	40-258	fine-grained	mean per class accuracy
Food-101[27]	food categories	101	101000	1000	fine-grained	top-1 accuracy
Oxford-IIIT Pets[28]	cats&dogs	37	7349	around 200	coarse-grained	mean per class accuracy
SUN397[29]	scenes	397	108753	at least 100	scenes	top-1 accuracy
Pascal VOC2007 [30]	diverse objects	20	2913 (training&validation)	diverse	coarse-grained	11-point mAP metric[30]

Table 2: Overview of the "Kornblith" dataset [6] used for multi-shot image classification.

Training and evaluation is performed in two manners:

Linear: The extracted features are used as input for training a logistic regression model to predict one of the classes among multiple ones available, i.e multinomial logistic regression. For optimization, a quasi-Newton method called Limited-memory Broyden-Fletcher-Goldfarb-Shanoo (L-BFGS)[48] is used to minimize a softmax cross-entropy objective function.

Whole network fine-tuning is performed for 5000 iterations with a batch size of 64 using stochastic gradient descent with Nesterov Momentum. Note that the number of iterations indicates the number of forward (or backward) passes, and not the number of epochs.

Metric: see last column of Table 2.

1.2 Few-shot recognition

Datasets: Kornblith without Pascal VOC together with the benchmark proposed by Broader Study of Cross-Domain Few-Shot Learning (CD-FSL) [31] (Table 3) is used . The images from this study resemble less and less natural images.

Dataset	Description	#classes	#total samples	#images/category
CropDiseases[4]	Plant diseases	38	58200	500 - 5500
EuroSat[32]	Satellite images	10	27000	2000-3000
ISIC2018[3,33]	Skin lesions	8	2594	diverse
ChestX[34]	Chest X-ray images	8	108948	diverse

Table 3: Overview of the CD-FSL [31] dataset.

Training and evaluation: A prototypical network [50] is an algorithm which learns to make correct predictions when only a limited number of samples is available. It is based on the idea that there exists an embedding in which inputs with similar features will form clusters and each class can therefore be represented by the cluster centroid. The authors use prototypical networks [35] in a 5-way 20-shot (5 randomly chosen classes, 20 samples used for clustering) setting with 600 episodes (or iterations) and 15 query images.

Metric: Average accuracy across the episodes together with 95% confidence interval.

1.3 Object detection

Dataset: Pascal VOC, trainval06 & trainval12 for training and test2007 for testing

Training and evaluation: Faster R-CNN [36] is an object detection algorithm that improves the previous Fast R-CNN [37] approach by using a network to predict the proposed regions in which object instances could be found. To create the input feature maps to this network, the authors use a Pyramid Network [38]. Training is performed either by freezing the backbone except for the last residual layer or by finetuning all layers end-to-end.

1.4 Dense prediction - surface normal estimation

Dataset: NYUv2 [39] contains 1449 labeled indoor scenes in form of pairs of RGB and depth images

Training and evaluation: PSPNet [40] architecture takes into consideration the global context of the image to improve prediction. For example, if the model needs to chose whether an object is a car or a boat, the surroundings should be taken into consideration. If the object is situated in a lake, it is more probable that it is a boat than a car. The authors use the PSPNet in combination with the ResNet50 backbone.

Metric: Mean and median angular error, percentage of estimations within 11.25°,22.5°,30° compared to the ground truth.

1.5 Dense prediction - semantic segmentation

Dataset: ADE20K [41] contains 27000 segmented diverse images

Training and evaluation: UPerNet [42] aims at understanding, recognizing and correlating as many concepts from an image as possible, similar to how humans understand the visual world at multiple levels. An image is assigned a general label (e.g "living room"), followed by the object localization (e.g ceiling,wall,sofa,tv, etc.). Going from general to detailed, the parts of the objects are identified, then the material and lastly even the textures. The authors use the CSAIL Semantic Segmentation framework [42] which implements this architecture.

Metrics:

Intersection over Union (IoU)
Accuracy

2. Feature information analysis

Reconstruction ability of RGB images from extracted features [55] which are similar to the original recognition datasets is the method that the authors use to determine which types of information is captured in the features.
Metrics:

Perceptual distance [43]
Per pixel mean squared error

Model calibration [44] indicates the prediction certainty. Uncertain predictions should be flagged by the model, so that further measures can be taken to avoid unwanted consequences, such as system danger [45] or social bias [46].

Metric: Expected calibration error [44]

Model attention is measured by using an occlusion-based method similar to [47]. The main idea is to occlude certain patches of an image and to compare the resulting features with those of the initial unoccluded image. If occluding a certain pixel leads to a large distance between the features, it indicates that the model pays a lot of attention to that particular pixel.

Metric: Percentage of attention map pixels which contains a distance above the average of the whole attention map

Results and Conclusions

Transfer performance

The transfer performance results for each experiment can be seen in Figure 1 and enable to answer the following questions:

Q1. How well does SSL transfer compared to a supervised baseline?

A1. In all experiments except for the few-shot on Kornblith, the supervised baseline is on average outperformed by at least one SSL approach.

Q2. Can we determine the best SSL method for all tasks and datasets?

A2. No method is known that outperforms every other in every task. We can observe that depending on the task and dataset, different methods are more suitable. Therefore, there is no method which can achieve general features for all tasks and datasets.

Q3. Is ImageNet classification performance always representative for non-classification tasks on downstream datasets?

A3. No, it is not. The results show that the correlation between performances decreases with the number of available samples, similarity to the ImageNet dataset and inversely proportional to the prediction granularity of the downstream task. It is high for recognition tasks, but weak for detection and dense prediction or even nonexistent for the latter. This leads to the following two implications. Firstly, it is not always recommended for a deep learning practitioner to choose the current ImageNet benchmark leader. To this end, the authors create a short guideline which is summarized in table 4. Secondly, it is necessary for the computer vision community to adopt a larger, more diverse benchmark.

Figure 3: Correlation between ImageNet and downstream performance where each plot represents one experiment. On the X-axis we can see the ImageNet accuracy and on the Y-axis, the average logit-transformed transfer performance. In the upper left corner, Pearson’s correlation coefficient is also shown. The linear regression line in blue indicates a correlation and is surrounded by a lighter blue confidence interval. Source: [1]

Should you choose the current ImageNet benchmark leader?
Dataset type / Task at hand	Recognition - many shot	Recognition - few shot	Object detection / Dense prediction
Structured, ImageNet like	Yes	Maybe, supervised might outperform	Not necessarily, currently choose BYOL or SimCLR-v2
Structured, not ImageNet like	Yes	Yes
Unstructured or textures	Cannot say	Cannot say

Table 4: Overview of guideline that the authors propose for choosing the most suitable approach. Adapted from [1]

Feature information analysis

Figures 4,5,6 & 7 visualize quantitatively and qualitatively respectively the information that is retained by the models in the features. These results help us answer the last question.

Q4. What information is retained in the features of supervised and SSL models?

A4.1 Perceptual distance in Figure 4 (left) indicates the supervised models have a better ability to reconstruct images. We can deduce the same result qualitatively from Figure 5. However, we notice that almost all networks can reconstruct a recognizable image.

Figure 4: Left: Comparison of image reconstruction ability between SSL and supervised measured by perceptual distance. Middle: Comparison of colour reconstruction ability measured by mean squared error. Right: Comparison of level of attentive diffusion. Source: [1]

Figure 5: Reconstructed images based on extracted features for all models on one image from five datasets. Source [1]

A4.2 The metric in Figure 4(middle) indicates that the SSL reconstructs with low color fidelity, probably due to color augmentations during training. Again, we can visualize and better understand this result when looking at Figure 5.

A4.3 Figure 4(right) shows that the supervised baseline pays attention to certain smaller parts of the image, whereas the SSL’s attention is more diffuse. An example of a saliency map can be seen in Figure 6, which validates the quantitative results.

Figure 6: Focus maps for all models on one image from five datasets. Source [1]

A4.4 Figure 7 presents a correlation between SSL network performance and prediction certainty which indicates reliability on well-performing networks.

Figure 7: Correlation between pre-training accuracy and calibration ability on downstream task measured by expected calibration error. Each plot represents one of the many-shot training strategy either temperature scaled (right) as proposed by [8] or not (left). Source [1]

Own Review

In my opinion, this work achieves a comprehensive, well-explained large-scale transferability comparison of self-supervised models. The authors formulate concrete questions in the beginning of the work, which, when answered through experiments, offer the reader a deep understanding and intuition of the impact of the obtained results. Moreover, the appendix is used to explain their methodology up to the smallest detail which conveys transparency. Throughout the study, the authors maintain a strong connection to the application domain, enabling the reader to fastly connect the theoretical aspects and results with real-world situations and challenges. To this extent, they even propose a short guideline for practitioners to help them choose the most suitable model depending on the task and dataset at hand. I consider this a very useful and well-thought manner of transmitting the findings and conclusions to other practitioners or researchers.

I identified the following weaknesses and improvement points:

The choice of self-supervised methods, as well as most of the used comparison modalities for the different experiments are not always argumented.
There is no mention of whether and to which extent the setting differences in pre-training could affect the performance ranking of the models.
The amount of domain-specific dissimilar datasets to ImageNet used for comparison could be increased.
Although not the main topic of the paper, it would be interesting to dive deeper into the analysis of retained information in the features.

Shifting the focus to the medical domain, we have observed that the larger the dissimilarity between pre-training and target training dataset, the lower the performance correlation. Therefore, a similar comparison and analysis approach could be adapted to domain-specific tasks with self-supervised architectures and pre-training on medical datasets. A step in this direction has been already done by [49]. Additionally, it would be interesting to determine whether self-supervised pre-training on healthy samples could, and if so to which extent improve the identification or classification of pathological samples. Would the network also gain more confidence in its predictions? This future work direction would bring valuable insights into the untapped potential of SSL for medical image analysis.

References

[1] Ericsson, Linus, Henry Gouk, and Timothy M. Hospedales. "How Well Do Self-Supervised Models Transfer?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[2] Grill, Jean-Bastien, et al. "Bootstrap your own latent: A new approach to self-supervised learning." arXiv preprint arXiv:2006.07733 (2020).

[3] Tschandl, Philipp, Cliff Rosendahl, and Harald Kittler. "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions." Scientific data 5.1 (2018): 1-9.

[4] Mohanty, Sharada P., David P. Hughes, and Marcel Salathé. "Using deep learning for image-based plant disease detection." Frontiers in plant science 7 (2016): 1419.

[5] Zhai, Xiaohua, et al. "A large-scale study of representation learning with the visual task adaptation benchmark." arXiv preprint arXiv:1910.04867 (2019).

[6] Kornblith, Simon, Jonathon Shlens, and Quoc V. Le. "Do better imagenet models transfer better?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

[7] Goyal, Priya, et al. "Scaling and benchmarking self-supervised visual representation learning." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

[8] Kolesnikov, Alexander, Xiaohua Zhai, and Lucas Beyer. "Revisiting self-supervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[9] Wu, Zhirong, et al. "Unsupervised feature learning via non-parametric instance-level discrimination." arXiv preprint arXiv:1805.01978 (2018).

[10] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[11] Li, Junnan, et al. "Prototypical contrastive learning of unsupervised representations." arXiv preprint arXiv:2005.04966 (2020).ä

[12] Misra, Ishan, and Laurens van der Maaten. "Self-supervised learning of pretext-invariant representations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[13] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

[14] Chen, Xinlei, et al. "Improved baselines with momentum contrastive learning." arXiv preprint arXiv:2003.04297 (2020).

[15] Chen, Ting, et al. "Big self-supervised models are strong semi-supervised learners." arXiv preprint arXiv:2006.10029 (2020).

[16] Asano, Yuki Markus, Christian Rupprecht, and Andrea Vedaldi. "Self-labelling via simultaneous clustering and representation learning." arXiv preprint arXiv:1911.05371 (2019).

[17] Caron, Mathilde, et al. "Unsupervised learning of visual features by contrasting cluster assignments." arXiv preprint arXiv:2006.09882 (2020).

[18] Tian, Yonglong, et al. "What makes for good views for contrastive learning?." arXiv preprint arXiv:2005.10243 (2020).

[19] Caron, Mathilde, et al. "Deep clustering for unsupervised learning of visual features." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

[20] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[21] Maji, Subhransu, et al. "Fine-grained visual classification of aircraft." arXiv preprint arXiv:1306.5151 (2013).

[22] Fei-Fei, Li, Rob Fergus, and Pietro Perona. "Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories." 2004 conference on computer vision and pattern recognition workshop. IEEE, 2004.

[23] Krause, Jonathan, et al. "Collecting a large-scale dataset of fine-grained cars." (2013).

[24] Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009): 7.

[25] Cimpoi, Mircea, et al. "Describing textures in the wild." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

[26] Nilsback, Maria-Elena, and Andrew Zisserman. "Automated flower classification over a large number of classes." 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 2008.

[27] Bossard, Lukas, Matthieu Guillaumin, and Luc Van Gool. "Food-101–mining discriminative components with random forests." European conference on computer vision. Springer, Cham, 2014.

[28] Parkhi, Omkar M., et al. "Cats and dogs." 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012.

[29] Xiao, Jianxiong, et al. "Sun database: Large-scale scene recognition from abbey to zoo." 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010.

[30] Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.

[31] Guo, Yunhui, et al. "A broader study of cross-domain few-shot learning." European Conference on Computer Vision. Springer, Cham, 2020.

[32] Helber, Patrick, et al. "Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12.7 (2019): 2217-2226.

[33] Codella, Noel, et al. "Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic)." arXiv preprint arXiv:1902.03368 (2019).

[34] Wang, Xiaosong, et al. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[35] Snell, Jake, Kevin Swersky, and Richard S. Zemel. "Prototypical networks for few-shot learning." arXiv preprint arXiv:1703.05175 (2017).

[36] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015): 91-99.

[37] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.

[38] Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[39] Silberman, Nathan, et al. "Indoor segmentation and support inference from rgbd images." European conference on computer vision. Springer, Berlin, Heidelberg, 2012.

[40] Zhao, Hengshuang, et al. "Pyramid scene parsing network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[41] Zhou, Bolei, et al. "Semantic understanding of scenes through the ade20k dataset." International Journal of Computer Vision 127.3 (2019): 302-321.

[42] Xiao, Tete, et al. "Unified perceptual parsing for scene understanding." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

[43] Zhang, Richard, et al. "The unreasonable effectiveness of deep features as a perceptual metric." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[44] Guo, Chuan, et al. "On calibration of modern neural networks." International Conference on Machine Learning. PMLR, 2017.

[45] Kuper, Lindsey, et al. "Toward scalable verification for safety-critical deep networks." arXiv preprint arXiv:1801.05950 (2018).

[46] Du, Mengnan, et al. "Fairness in deep learning: A computational perspective." IEEE Intelligent Systems (2020).

[47] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European conference on computer vision. Springer, Cham, 2014.

[48] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45.1 (1989): 503-528.

[49] Hosseinzadeh Taher, Mohammad Reza, et al. "A Systematic Benchmarking Analysis of Transfer Learning for Medical Image Analysis." Domain Adaptation and Representation Transfer, and Affordable Healthcare and AI for Resource Diverse Global Health. Springer, Cham, 2021. 3-13.

Seitenhierarchie