This is the blogpost for the paper "How Well Do Self-Supervised Models Transfer?" [1] by Linus Ericsson, Henry Gouk and Timothy M. Hospedales.
Blog post author: Miruna-Alexandra Gafencu
Table of Contents
Introduction
Motivation & Problem Statement
Self-supervised learning (SSL) has the advantage of being able to cope with unlabeled datasets. Reaching the point where it outcompetes supervised approaches [2], SSL algorithms exhibit reduced risk of overfitting by selecting task-agnostic features leading to a better generalization and scaling on downstream tasks and datasets. Researchers in multiple fields, such as medical [3] or agricultural [4] image processing are interested in adapting and applying the existing natural image processing based models to a variety of tasks. This interest sparks the need for evaluation and comparison of transfer capabilities of most recent SSL approaches. For this purpose, the authors formulate four questions that outline the motivation behind their work:
- How well does SSL transfer compared to a supervised approach?
- Can we determine the best SSL method for all tasks and datasets?
- Is ImageNet classification performance representative for non-classification tasks on downstream datasets?
- What information is retained in the features of supervised and SSL models?
Related Work & Contribution
Previous studies have focused on analysing all-purpose SSL [5], the transferability of supervised pre-trained models [6], influence of different pretraining datasets and Convolutional Neural Network(CNN) architectures on downstream performance [7] or suitability of CNN architectures for different pre-text tasks [8]; however, no prior work has focused on transferability of pretrained self-supervised models on downstream tasks and datasets. The authors have identified this gap and propose a large-scale comparison of diverse SSL approaches with a supervised model. They evaluate the pre-trained ImageNet models on classification, object detection and dense prediction using diverse datasets, both natural images and domain-specific.
Methodology & Experiments
Models
Thirteen self-supervised learning methods are compared to a supervised baseline. Figure 1 and Table 1 show the starting point of the models’ comparison. We can see the accuracy scores on ImageNet as well as the differences in pre-training, such as the number of epochs and batch size.
Figure 1: ImageNet classification performance of compared models. Adapted from [1]
Method | Algorithm Type | Epochs | Batch Size |
---|---|---|---|
InsDis [9] | Contrastive | 200 | 256 |
MoCo-v1 [10] | Contrastive | 200 | 256 |
PCL-v1[11] | Clustering | 200 | 256 |
PIRL[12] | Contrastive | 200 | 1024 |
PCL-v2[11] | Clustering | 200 | 256 |
SimCLR-v1[13] | Contrastive | 1000 | 4096 |
MoCo-v2[14] | Contrastive | 800 | 256 |
SimCLR-v2[15] | Contrastive | 800 | 4096 |
SeLa-v2[16,17] | Clustering | 400 | 4096 |
InfoMin[18] | Contrastive | 800 | 256 |
BYOL[2] | Contrastive | 1000 | 4096 |
DeepCluster-v2[17,19] | Clustering | 800 | 4096 |
SwAV[17] | Clustering | 800 | 4096 |
Supervised | 120 | 256 |
Table 1: Overview of compared self-supervised models and supervised baseline indicating the type of approach as well as pre-training specifications. Adapted from [1]
Network architecture
Supervised and self supervised networks are composed of a backbone and a head as can be seen in Figure 2. The backbone, a ResNet50 architecture [20] acts as a feature extractor for the self-supervised models and as an image processing unit for the supervised case. Sharing the same backbone architecture enables fair comparison among models despite the above-listed differences in pre-training. Attached to the backbone is the head, a task-dependent block of layers. The training of the backbone-head network on the target downstream dataset is performed in two manners: either by freezing the backbone and optimising only the head or by finetuning the entire network.
Figure 2: Network architecture for both self-supervised and for supervised models. The backbone ResNet50 [20] block is the same for all experiments, whereas the head block is changed depending on the current downstream task.
Comparison Experiments & Datasets
1. Transferability to Downstream Tasks and Datasets
In this section, the conducted experiments are presented in terms of datasets, training and evaluation strategy as well as metrics used.
1.1 Many-shot Recognition
Datasets: The labelled datasets from [6] form a benchmark for diverse image classification in terms of dataset object homogeneity, training samples and number of classes within one dataset (Table 2). The authors refer to this dataset ensemble as “Kornblith”.
Dataset | Description | #classes | #total samples | #images/category | Classification task | Metric |
---|---|---|---|---|---|---|
FGVC Aircraft[21] | aircraft models | 102 | 10200 | 100 | fine-grained | mean per class accuracy |
Caltech-101[22] | diverse objects | 101 | 9144 | 40-800 | coarse-grained | mean per class accuracy |
Stanford Cars[23] | cars | 196 | 16185 | diverse | fine-grained | top-1 accuracy |
CIFAR-10[24] | diverse objects | 10 | 60000 | 6000 | coarse-grained | top-1 accuracy |
CIFAR-100[24] | diverse objects | 100 | 60000 | 600 | coarse-grained | top-1 accuracy |
DTD[25] | describable textures | 47 | 5640 | 120 | texture | top-1 accuracy |
Oxford 102 flowers [26] | flowers | 102 | 8189 | 40-258 | fine-grained | mean per class accuracy |
Food-101[27] | food categories | 101 | 101000 | 1000 | fine-grained | top-1 accuracy |
Oxford-IIIT Pets[28] | cats&dogs | 37 | 7349 | around 200 | coarse-grained | mean per class accuracy |
SUN397[29] | scenes | 397 | 108753 | at least 100 | scenes | top-1 accuracy |
Pascal VOC2007 [30] | diverse objects | 20 | 2913 (training&validation) | diverse | coarse-grained | 11-point mAP metric[30] |
Table 2: Overview of the "Kornblith" dataset [6] used for multi-shot image classification.
Training and evaluation is performed in two manners:
- Linear: The extracted features are used as input for training a logistic regression model to predict one of the classes among multiple ones available, i.e multinomial logistic regression. For optimization, a quasi-Newton method called Limited-memory Broyden-Fletcher-Goldfarb-Shanoo (L-BFGS)[48] is used to minimize a softmax cross-entropy objective function.
- Whole network fine-tuning is performed for 5000 iterations with a batch size of 64 using stochastic gradient descent with Nesterov Momentum. Note that the number of iterations indicates the number of forward (or backward) passes, and not the number of epochs.
Metric: see last column of Table 2.
1.2 Few-shot recognition
Datasets: Kornblith without Pascal VOC together with the benchmark proposed by Broader Study of Cross-Domain Few-Shot Learning (CD-FSL) [31] (Table 3) is used . The images from this study resemble less and less natural images.
Dataset | Description | #classes | #total samples | #images/category |
---|---|---|---|---|
CropDiseases[4] | Plant diseases | 38 | 58200 | 500 - 5500 |
EuroSat[32] | Satellite images | 10 | 27000 | 2000-3000 |
ISIC2018[3,33] | Skin lesions | 8 | 2594 | diverse |
ChestX[34] | Chest X-ray images | 8 | 108948 | diverse |
Table 3: Overview of the CD-FSL [31] dataset.
Training and evaluation: A prototypical network [50] is an algorithm which learns to make correct predictions when only a limited number of samples is available. It is based on the idea that there exists an embedding in which inputs with similar features will form clusters and each class can therefore be represented by the cluster centroid. The authors use prototypical networks [35] in a 5-way 20-shot (5 randomly chosen classes, 20 samples used for clustering) setting with 600 episodes (or iterations) and 15 query images.
Metric: Average accuracy across the episodes together with 95% confidence interval.
1.3 Object detection
Dataset: Pascal VOC, trainval06 & trainval12 for training and test2007 for testing
Training and evaluation: Faster R-CNN [36] is an object detection algorithm that improves the previous Fast R-CNN [37] approach by using a network to predict the proposed regions in which object instances could be found. To create the input feature maps to this network, the authors use a Pyramid Network [38]. Training is performed either by freezing the backbone except for the last residual layer or by finetuning all layers end-to-end.
1.4 Dense prediction - surface normal estimation
Dataset: NYUv2 [39] contains 1449 labeled indoor scenes in form of pairs of RGB and depth images
Training and evaluation: PSPNet [40] architecture takes into consideration the global context of the image to improve prediction. For example, if the model needs to chose whether an object is a car or a boat, the surroundings should be taken into consideration. If the object is situated in a lake, it is more probable that it is a boat than a car. The authors use the PSPNet in combination with the ResNet50 backbone.
Metric: Mean and median angular error, percentage of estimations within 11.25°,22.5°,30° compared to the ground truth.
1.5 Dense prediction - semantic segmentation
Dataset: ADE20K [41] contains 27000 segmented diverse images
Training and evaluation: UPerNet [42] aims at understanding, recognizing and correlating as many concepts from an image as possible, similar to how humans understand the visual world at multiple levels. An image is assigned a general label (e.g "living room"), followed by the object localization (e.g ceiling,wall,sofa,tv, etc.). Going from general to detailed, the parts of the objects are identified, then the material and lastly even the textures. The authors use the CSAIL Semantic Segmentation framework [42] which implements this architecture.
Metrics:
- Intersection over Union (IoU)
- Accuracy
2. Feature information analysis
Reconstruction ability of RGB images from extracted features [55] which are similar to the original recognition datasets is the method that the authors use to determine which types of information is captured in the features.
Metrics:
- Perceptual distance [43]
- Per pixel mean squared error
Model calibration [44] indicates the prediction certainty. Uncertain predictions should be flagged by the model, so that further measures can be taken to avoid unwanted consequences, such as system danger [45] or social bias [46].
Metric: Expected calibration error [44]
Model attention is measured by using an occlusion-based method similar to [47]. The main idea is to occlude certain patches of an image and to compare the resulting features with those of the initial unoccluded image. If occluding a certain pixel leads to a large distance between the features, it indicates that the model pays a lot of attention to that particular pixel.
Metric: Percentage of attention map pixels which contains a distance above the average of the whole attention map
Results and Conclusions
Transfer performance
The transfer performance results for each experiment can be seen in Figure 1 and enable to answer the following questions:
Q1. How well does SSL transfer compared to a supervised baseline?
A1. In all experiments except for the few-shot on Kornblith, the supervised baseline is on average outperformed by at least one SSL approach.
Q2. Can we determine the best SSL method for all tasks and datasets?
A2. No method is known that outperforms every other in every task. We can observe that depending on the task and dataset, different methods are more suitable. Therefore, there is no method which can achieve general features for all tasks and datasets.
Q3. Is ImageNet classification performance always representative for non-classification tasks on downstream datasets?
A3. No, it is not. The results show that the correlation between performances decreases with the number of available samples, similarity to the ImageNet dataset and inversely proportional to the prediction granularity of the downstream task. It is high for recognition tasks, but weak for detection and dense prediction or even nonexistent for the latter. This leads to the following two implications. Firstly, it is not always recommended for a deep learning practitioner to choose the current ImageNet benchmark leader. To this end, the authors create a short guideline which is summarized in table 4. Secondly, it is necessary for the computer vision community to adopt a larger, more diverse benchmark.
Figure 3: Correlation between ImageNet and downstream performance where each plot represents one experiment. On the X-axis we can see the ImageNet accuracy and on the Y-axis, the average logit-transformed transfer performance. In the upper left corner, Pearson’s correlation coefficient is also shown. The linear regression line in blue indicates a correlation and is surrounded by a lighter blue confidence interval. Source: [1]
Should you choose the current ImageNet benchmark leader? | |||
---|---|---|---|
Dataset type / Task at hand | Recognition - many shot | Recognition - few shot | Object detection / Dense prediction |
Structured, ImageNet like | Yes | Maybe, supervised might outperform | |
Structured, not ImageNet like | Yes | Yes | |
Unstructured or textures | Cannot say | Cannot say |
Table 4: Overview of guideline that the authors propose for choosing the most suitable approach. Adapted from [1]
Feature information analysis
Figures 4,5,6 & 7 visualize quantitatively and qualitatively respectively the information that is retained by the models in the features. These results help us answer the last question.
Q4. What information is retained in the features of supervised and SSL models?
A4.1 Perceptual distance in Figure 4 (left) indicates the supervised models have a better ability to reconstruct images. We can deduce the same result qualitatively from Figure 5. However, we notice that almost all networks can reconstruct a recognizable image.
Figure 4: Left: Comparison of image reconstruction ability between SSL and supervised measured by perceptual distance. Middle: Comparison of colour reconstruction ability measured by mean squared error. Right: Comparison of level of attentive diffusion. Source: [1]
Figure 5: Reconstructed images based on extracted features for all models on one image from five datasets. Source [1]
A4.2 The metric in Figure 4(middle) indicates that the SSL reconstructs with low color fidelity, probably due to color augmentations during training. Again, we can visualize and better understand this result when looking at Figure 5.
A4.3 Figure 4(right) shows that the supervised baseline pays attention to certain smaller parts of the image, whereas the SSL’s attention is more diffuse. An example of a saliency map can be seen in Figure 6, which validates the quantitative results.
Figure 6: Focus maps for all models on one image from five datasets. Source [1]
A4.4 Figure 7 presents a correlation between SSL network performance and prediction certainty which indicates reliability on well-performing networks.
Figure 7: Correlation between pre-training accuracy and calibration ability on downstream task measured by expected calibration error. Each plot represents one of the many-shot training strategy either temperature scaled (right) as proposed by [8] or not (left). Source [1]
Own Review
In my opinion, this work achieves a comprehensive, well-explained large-scale transferability comparison of self-supervised models. The authors formulate concrete questions in the beginning of the work, which, when answered through experiments, offer the reader a deep understanding and intuition of the impact of the obtained results. Moreover, the appendix is used to explain their methodology up to the smallest detail which conveys transparency. Throughout the study, the authors maintain a strong connection to the application domain, enabling the reader to fastly connect the theoretical aspects and results with real-world situations and challenges. To this extent, they even propose a short guideline for practitioners to help them choose the most suitable model depending on the task and dataset at hand. I consider this a very useful and well-thought manner of transmitting the findings and conclusions to other practitioners or researchers.
I identified the following weaknesses and improvement points:
- The choice of self-supervised methods, as well as most of the used comparison modalities for the different experiments are not always argumented.
- There is no mention of whether and to which extent the setting differences in pre-training could affect the performance ranking of the models.
- The amount of domain-specific dissimilar datasets to ImageNet used for comparison could be increased.
- Although not the main topic of the paper, it would be interesting to dive deeper into the analysis of retained information in the features.
Shifting the focus to the medical domain, we have observed that the larger the dissimilarity between pre-training and target training dataset, the lower the performance correlation. Therefore, a similar comparison and analysis approach could be adapted to domain-specific tasks with self-supervised architectures and pre-training on medical datasets. A step in this direction has been already done by [49]. Additionally, it would be interesting to determine whether self-supervised pre-training on healthy samples could, and if so to which extent improve the identification or classification of pathological samples. Would the network also gain more confidence in its predictions? This future work direction would bring valuable insights into the untapped potential of SSL for medical image analysis.
References
[1] Ericsson, Linus, Henry Gouk, and Timothy M. Hospedales. "How Well Do Self-Supervised Models Transfer?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[2] Grill, Jean-Bastien, et al. "Bootstrap your own latent: A new approach to self-supervised learning." arXiv preprint arXiv:2006.07733 (2020).
[3] Tschandl, Philipp, Cliff Rosendahl, and Harald Kittler. "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions." Scientific data 5.1 (2018): 1-9.
[4] Mohanty, Sharada P., David P. Hughes, and Marcel Salathé. "Using deep learning for image-based plant disease detection." Frontiers in plant science 7 (2016): 1419.
[5] Zhai, Xiaohua, et al. "A large-scale study of representation learning with the visual task adaptation benchmark." arXiv preprint arXiv:1910.04867 (2019).
[6] Kornblith, Simon, Jonathon Shlens, and Quoc V. Le. "Do better imagenet models transfer better?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[7] Goyal, Priya, et al. "Scaling and benchmarking self-supervised visual representation learning." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
[8] Kolesnikov, Alexander, Xiaohua Zhai, and Lucas Beyer. "Revisiting self-supervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[9] Wu, Zhirong, et al. "Unsupervised feature learning via non-parametric instance-level discrimination." arXiv preprint arXiv:1805.01978 (2018).
[10] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[11] Li, Junnan, et al. "Prototypical contrastive learning of unsupervised representations." arXiv preprint arXiv:2005.04966 (2020).ä
[12] Misra, Ishan, and Laurens van der Maaten. "Self-supervised learning of pretext-invariant representations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[13] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
[14] Chen, Xinlei, et al. "Improved baselines with momentum contrastive learning." arXiv preprint arXiv:2003.04297 (2020).
[15] Chen, Ting, et al. "Big self-supervised models are strong semi-supervised learners." arXiv preprint arXiv:2006.10029 (2020).
[16] Asano, Yuki Markus, Christian Rupprecht, and Andrea Vedaldi. "Self-labelling via simultaneous clustering and representation learning." arXiv preprint arXiv:1911.05371 (2019).
[17] Caron, Mathilde, et al. "Unsupervised learning of visual features by contrasting cluster assignments." arXiv preprint arXiv:2006.09882 (2020).
[18] Tian, Yonglong, et al. "What makes for good views for contrastive learning?." arXiv preprint arXiv:2005.10243 (2020).
[19] Caron, Mathilde, et al. "Deep clustering for unsupervised learning of visual features." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
[20] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[21] Maji, Subhransu, et al. "Fine-grained visual classification of aircraft." arXiv preprint arXiv:1306.5151 (2013).
[22] Fei-Fei, Li, Rob Fergus, and Pietro Perona. "Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories." 2004 conference on computer vision and pattern recognition workshop. IEEE, 2004.
[23] Krause, Jonathan, et al. "Collecting a large-scale dataset of fine-grained cars." (2013).
[24] Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009): 7.
[25] Cimpoi, Mircea, et al. "Describing textures in the wild." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.
[26] Nilsback, Maria-Elena, and Andrew Zisserman. "Automated flower classification over a large number of classes." 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 2008.
[27] Bossard, Lukas, Matthieu Guillaumin, and Luc Van Gool. "Food-101–mining discriminative components with random forests." European conference on computer vision. Springer, Cham, 2014.
[28] Parkhi, Omkar M., et al. "Cats and dogs." 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012.
[29] Xiao, Jianxiong, et al. "Sun database: Large-scale scene recognition from abbey to zoo." 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010.
[30] Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.
[31] Guo, Yunhui, et al. "A broader study of cross-domain few-shot learning." European Conference on Computer Vision. Springer, Cham, 2020.
[32] Helber, Patrick, et al. "Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12.7 (2019): 2217-2226.
[33] Codella, Noel, et al. "Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic)." arXiv preprint arXiv:1902.03368 (2019).
[34] Wang, Xiaosong, et al. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[35] Snell, Jake, Kevin Swersky, and Richard S. Zemel. "Prototypical networks for few-shot learning." arXiv preprint arXiv:1703.05175 (2017).
[36] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015): 91-99.
[37] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.
[38] Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[39] Silberman, Nathan, et al. "Indoor segmentation and support inference from rgbd images." European conference on computer vision. Springer, Berlin, Heidelberg, 2012.
[40] Zhao, Hengshuang, et al. "Pyramid scene parsing network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[41] Zhou, Bolei, et al. "Semantic understanding of scenes through the ade20k dataset." International Journal of Computer Vision 127.3 (2019): 302-321.
[42] Xiao, Tete, et al. "Unified perceptual parsing for scene understanding." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
[43] Zhang, Richard, et al. "The unreasonable effectiveness of deep features as a perceptual metric." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[44] Guo, Chuan, et al. "On calibration of modern neural networks." International Conference on Machine Learning. PMLR, 2017.
[45] Kuper, Lindsey, et al. "Toward scalable verification for safety-critical deep networks." arXiv preprint arXiv:1801.05950 (2018).
[46] Du, Mengnan, et al. "Fairness in deep learning: A computational perspective." IEEE Intelligent Systems (2020).
[47] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European conference on computer vision. Springer, Cham, 2014.
[48] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45.1 (1989): 503-528.
[49] Hosseinzadeh Taher, Mohammad Reza, et al. "A Systematic Benchmarking Analysis of Transfer Learning for Medical Image Analysis." Domain Adaptation and Representation Transfer, and Affordable Healthcare and AI for Resource Diverse Global Health. Springer, Cham, 2021. 3-13.