Introduction
Training without supervision in the field of computer vision has become increasingly well-researched in recent years due to its distinct advantages. To begin with, labeled data is not required, simplifying data acquisition, and enabling the creation of large datasets. This creates opportunities for foundation models capable of generating high-quality and versatile visual features that function in a zero-shot manner across various image distributions and tasks. Moreover, self-supervised training yields a richer learning signal compared to supervised training as the rich visual information present in images is not reduced to a single label from a predefined set.
Within the medical context, the segmentation task is of particular interest. Medical imaging technologies such as computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound are commonly used in all hospitals. Segmentation of these images is a fundamental step in various medical tasks such as disease diagnosis, surgical planning and image-guided surgery, and monitoring of disease progression.
DINO
DINO (self-distillation with no labels) [1] is a self-supervised training framework for computer vision tasks capable of producing high-quality features that can be utilized for various downstream tasks, including segmentation. This is accomplished by training the architecture to map different views of one image to the same feature vector, while also ensuring that distinct images are mapped to different feature vectors.
DINOv1
Overview
The DINO training framework is depicted in Figure 1 and consists of a student and a teacher network. Both have the same architecture but different parameters. During training, two views of an input image are constructed by applying random transformations. Then, one is passed to the student and the other one to the teacher. The output feature vectors of each network are compared using cross-entropy loss and only the student network is optimized. This ensures that both image views are mapped to the same feature vector, and the student output matches that of the teacher. Lastly, the newly optimized student parameters are used to update the teacher parameters using an exponential moving average. This assures that the teacher is performing better than the student. For downstream tasks, only parts of the student or teacher are used. The details of this architecture and the training process are explained in the following sections.
Figure 1: Overview of DINO training framework [1]. |
Creating and Distributing Views and Loss Function
Various views of a single image are created by cropping and applying Gaussian blur, color jittering, and solarization. In particular, two global crops covering more than 50% of the original image, as well as several smaller local crops covering less than 50%, are generated (Figure 2). All views are passed through the student, while only global views are inserted into the teacher (Figure 3). This promotes the model to learn “local-to-global” correspondences. The cross-entropy loss is computed between all pairs of the student and teacher output, except where both views are the same (Figure 4). The sum of these losses is used to optimize the student only.
Figure 2: Creating views to train DINO according to [1]. From the original image, two global views (covering more than 50% of the original image) and multiple local views (covering less than 50% of the original image) are created. | Figure 3: Distributing views between student and teacher according to [1]. | Figure 4: Loss function of DINO [1]. |
Update Rule of the Teacher
The teacher is updated via an exponential moving average [2–4] of the student weights:
θt ← λθt + (1 - λ)θs with λ following a cosine schedule from 0.996 to 1.
This is a model ensembling technique that has been shown to enhance performance [5]. During training, the teacher consistently outperforms the student (Figure 5) and guides the student by providing higher quality target features. As a result, this process creates a cycle where the student and teacher improve each other continually.
Figure 5: Top-1 validation accuracy of student and teacher during training [1]. |
Avoiding Collapse
The loss is exclusively calculated from views that originate from the same image. For this reason, both student and teacher could learn the mathematically perfect solution of mapping every input to the same constant feature vector. Avoiding this is crucial, as a constant output is not useful for any downstream tasks.
Two forms of collapse can occur in practice: First, regardless of the input, the feature output could be mapped to a uniform distribution along all dimensions. Second, the feature output could be dominated by one dimension.
To avoid collapse, centering and sharpening terms are added to the teacher, and their effects are balanced. To center the teacher output, a bias term is added, which is an exponential moving average of the teacher output. This approach prevents any one dimension from dominating while still encouraging collapse to a uniform distribution. Sharpening is achieved by adding a temperature parameter with low value to the softmax normalization of the teacher. This has the exact opposite effect to the centering.
Network Architecture
The exact same network architecture is used for both student and teacher. It consists of a backbone followed by a projection head. As mentioned previously, the loss function of this architecture compares different views from the same image. The backbone’s function is similar to that of a feature extractor, extracting view-specific features from the input. The projection head learns that different views originate from the same image by mapping view-specific features to the same output vector per image. Intuitively, this implies that the projection head is figuring out a classification task on its own.
When DINO is applied to downstream tasks, the projection head is discarded and just the backbone is used, since only the training procedure requires a comparison of different views of the same image.
While the backbone can take any form, this implementation utilizes either Residual Network (ResNet) [6] or a Vision Transformer (ViT) [7]. If a ViT is used, the backbone’s output is only the class token [CLS]. The projection head is a 3-layer multi-layer perceptron (MLP).
DINOv2
Overview
DINOv2 [8] scales DINOv1 in terms of model size and amount of training data. Training is stabilized by adapting the architecture as well as accelerated leveraging hardware-aware implementations and model distillation amongst others. The resulting training is faster training requires only of the memory.
To enable the model to learn high-quality features, diverse and curated datasets are required for training. However, curated datasets for computer vision are typically limited in size and cannot provide sufficient data for large-scale self-supervised training. Hence, an automatic pipeline capable of generating a diverse dataset from large amounts of uncurated data is introduced. Using this pipeline, a dataset containing 142 million images was created and used for training.
In DINOv2, the backbone always consists of a ViT since it performed best in DINOv1.
Image-level and Patch-level Objectives
DINOv2 introduces some changes to the DINOv1 architecture including new image-level and patch-level [9] objectives. The image-level objective (Figure 6) is carried over from DINOv1, which uses only the features from the class token [CLS] to compute the loss. The DINOv2 architecture introduces a patch-level objective (Figure 7) that utilizes patch feature outputs of ViT. To be specific, some patches of the student input are corrupted via a mask before being fed into the Transformer, while all patches of the teacher remain unchanged. Afterward, a cross-entropy loss between the corresponding patch features of both the student and teacher is computed and added to the image-level loss.
Figure 6: Image-level objective of DINOv2 according to [1, 8]. | Figure 7: Patch-level objective of DINOv2 according to [8, 9]. |
Additional Modifications
Three additional modifications are applied to further enhance performance. First, the softmax-centering of the teacher is replaced by Sinkhorn-Knopp centering [10]. Second, the KoLeo regularizer [11] is applied to encourage a uniform feature distribution within a batch. Third, the input image resolution is increased during the final stage of training for a brief duration [12].
Evaluation
To perform semantic segmentation, a supplemental model needs to be added after the backbone. This model is trained in a supervised manner to predict the task specific pixel-level labels.
The results are shown in Figure 8a.
Linear (lin.): A low-resolution logit map is predicted using a linear layer trained on patch tokens. The pretrained backbone is frozen during finetuning. This method demonstrates satisfactory results considering its simplicity.
Multiscale (+ms): This is an enhanced version of the linear method. It uses patch tokens from the final four layers, multiscale test-time augmentations, and a higher input image resolution. The pretrained backbone is frozen during finetuning. This slightly enhances performance when compared to the linear method, but still significantly lags behind the absolute state-of-the-art.
ViT-Adapter + Mask2Former Head: This method adds the backbone to a ViT-Adapter [13] architecture with a Mask2Former head [14]. The backbone is frozen while the adapter and head are trained on the given task. On the ADE20k dataset [15, 16], a mIoU of 60.2 is achieved, which is close to the state-of-the-art of 63.0 [17] as of July 2023 (Figure 8b).
Figure 8a: Semantic segmentation results on various datasets [8]. | Figure 8b: Semantic segmentation on ADE20k leaderboard [17]. |
Segment Anything
Segment Anything [18] is a promptable foundation model for image segmentation capable of zero-shot transfer to unseen image distributions and segmentation tasks. Although trained using supervised learning, the Segment Anything model still fits the theme of this blog post due to its zero-shot transfer abilities.
To make training possible, a combination of manual labor and automatic model-in-the-loop training was used to create the largest segmentation dataset as of July 2023. It consists of 1 billion high-quality, diverse masks on 11 million images.
Promptable Segmentation Task
Segment Anything introduces a novel task which aims to produce a valid segmentation mask in response to any prompt. Four different prompt types were considered including a set of foreground and/or background points, a rough box, a rough mask, and free-form text (Figure 9). However, this task faces the problem that prompts can be ambiguous. For instance, a single-point prompt may refer to multiple items, such as the entire body of an animal or merely its head (Figure 10). Therefore, a mask is considered valid if it matches one of those items. Additionally, the model can predict multiple segmentation masks with associated confidence scores per prompt.
Figure 9: One example for every prompt type [18]. | Figure 10: Ambiguity of prompts. Three valid masks for a single point prompt. [18] |
Architecture
The Segment Anything model architecture is comprised of three components: an image encoder, a prompt encoder tailored to the prompt type, and a mask decoder (Figure 11).
The image encoder is a Masked Autoencoder (MAE) [19] pretrained ViT modified to process high resolution inputs [20]. It outputs an image embedding that is applicable to all prompt types. To expedite processing, the obtained image embedding can be reused if the image remains the same.
The prompt encoder is specific to the prompt type. Masks are encoded using convolutions and then summed element-wise with the image embedding. For points and boxes, a positional encoding [21] is created and then summed with learnable embeddings for each prompt type. Free-form text is encoded via an off-the-shelf encoder from CLIP [22].
The mask decoder maps image embedding, prompt embedding, and a class token [CLS] to a mask. It utilizes a modified Transformer decoder block [23] and a dynamic prediction head.
To evaluate the predicted masks in a supervised manner, a linear combination of focal [24] and dice loss [25] is used. During training, input prompts are simulated by randomizing ground truth masks and generating prompts from them.
Figure 11: Overview of Segment Anything architecture [18]. |
Evaluation
Zero-shot Single Point Valid Mask
The goal is to generate a segmentation mask based on a single foreground point prompt located at the center of the ground truth. As mentioned before, this prompt can be ambiguous. Therefore, human annotators are tasked to evaluate the mask quality. The results are illustrated in Figure 12. Segment Anything outperforms the strong RITM [26] baseline on all datasets and is often close to the ground truth, especially if it is allowed to output multiple masks.
Figure 12: Results of the zero-shot single point valid mask experiment. Mask quality ratings from 1 (worst) to 10 (best) of various datasets. Ratings created by human annotators. [18] |
Zero-shot Multi-point Valid Mask
This task is similar to the previous one with the difference being that multiple foreground points are prompted. Instead of ratings by humans, the mIoU score is used. Figure 13 shows the results. It is evident that with an increasing number of points, the performance also increases for all architectures. Although Segment Anything performs best, its advantage over other methods diminishes with an increasing number of sample points.
Replacing center sampling with random point sampling (Figure 14) does not significantly affect the performance of Segment Anything, whereas it negatively impacts the performance of all other architectures.
Figure 13: Results of zero-shot multi-point valid mask experiment. Prompted input points are at the center of the ground truth. [18] | Figure 14: Results of zero-shot multi-point valid mask experiment. Prompted input points are sampled randomly. [18] |
Segment Anything in the Medical Domain
Unfortunately, Segment Anything is not effective in medical image segmentation. This is due to the fact that Segment Anything is trained on natural images, which are very different from medical ones. More specifically, performance is poor for lesion segmentation tasks, and it encounters difficulties in producing consistent boundaries for all 3D tasks [27].
Segment Anything in Medical Images
This paper [27] finetunes the mask decoder on medical tasks while keeping both encoders frozen. The box prompt is used exclusively. During finetuning, it is created from the ground truth mask with a small, random perturbation.
This simple adaptation improves the mean dice score of 21 3D medical segmentation tasks from 58.52% to 81.04%. The mean dice score of 9 2D tasks is improved from 59.62% to 77.22%. Qualitative results are shown in Figure 15.
Figure 15: Qualitative segmentation results in the medical domain of Segment Anything (SAM) [18] and Segment Anything in Medical Images (MedSAM) [27]. |
Medical SAM Adapter
This paper [28] inserts an Adapter module [29] at various positions of Segment Anything’s architecture. This also allows the new architecture to perform native 3D image segmentation rather than segmenting each slice independently. In this case, the point prompt incorporating a sampling strategy that mimics a real user [30] is used. Alternatively, a text prompt can be used.
This approach is evaluated on two datasets for abdominal multi-organ segmentation. On the AMOS2022 dataset [31], state-of-the-art performance is achieved on 12 out of 15 organs. On BTCV [32], it achieves state-of-the-art performance on 11 out of 12 organs. In both cases, the best overall performance is achieved. Moreover, it shows a remarkable generalization ability by performing well on other medical tasks like optic cup, brain tumor, and thyroid nodule segmentation.
Conclusion
This blog post described a self-supervised training framework called DINO for computer vision tasks capable of producing high-quality features that can be utilized for various downstream tasks. In the case of segmentation, an additional model needs to be added after DINO’s feature extraction and trained in a supervised manner to predict the dataset specific pixel-level labels. Its results on various semantic segmentation datasets are close to the state-of-the-art.
Additionally, a promptable foundation model for image segmentation called Segment Anything capable of zero-shot transfer to unseen image distributions and segmentation tasks was examined. Although this model was trained with supervised learning, it is still mentioned in this post due to its zero-shot transfer abilities. Its results are very promising and sparked further research in promptable foundation models.
Personal Review
The paper by Caron et al. [1] on DINOv1 is well-written. The newly introduced DINO training framework is explained in great detail, and numerous ablations are performed. As DINO can generate high-quality features that can be utilized for many downstream tasks, a large variety of experiments can be conducted. Most notably, DINO’s classification performance on ImageNet is compared to many other self-supervised learning frameworks, including the state-of-the-art at the time, under two simple evaluation protocols. One of these protocols aims to minimize the impact of hyperparameters to improve comparability. Regrettably, DINO has not undergone a comprehensive evaluation for any other downstream task besides classification. Notably, semantic segmentation is not explored at all. Thankfully, the feature quality of DINO is compared to a similar supervised approach. However, this section is concise and it would be advisable to conduct more quantitative experiments. Both code and pretrained models are publicly accessible.
The DINOv2 framework by Oquab et al. [8] incorporates several enhancements from previous studies in the field. Unfortunately, the changes made to the original DINOv1 architecture are briefly explained, and the reader is referred to other papers for more details. In particular, it is hard to understand how certain new components are incorporated into the original framework. Thankfully, comprehensive ablation studies are performed for every enhancement. Unlike the DINOv1 paper, this work evaluates DINOv2 in-depth on various downstream tasks including classification, semantic segmentation, and depth estimation, comparing it to state-of-the-art self-supervised and weakly-supervised models. However, it is not explicitly stated that only DINOv2 is trained on the very large and curated LVD-142M dataset, potentially making this comparison a bit unfair in my opinion. Additionally, a qualitative analysis is conducted on selected tasks. The architecture code and the pretrained models are both open source. Regrettably, the LVD-142M dataset used for training will not be released. It is still uncertain whether the automatic data generation pipeline will be released.
Kirillov et al. [18] introduce an impressive, promptable segmentation model called Segment Anything. The authors provide a detailed description of the novel promptable segmentation task, the architecture, and the data engine used to create training data. However, it should be noted that certain details can only be found in the appendix. The method is evaluated on various zero-shot tasks and prompt types including semantic segmentation given foreground points or a box output from an object detector. However, generating segmentations from text prompts was only studied very briefly and in a qualitative manner. It would be interesting to study quantitative results. Generally, in my opinion, one big problem is that Segment Anything is very hard to compare to established non-promptable models. There are two reasons for this. First, a prompt of some kind is required as an additional input. Second, prompts can be ambiguous, leading to multiple possible segmentation masks. The code, the pretrained models, and the huge segmentation dataset used for training are publicly available. Additionally, the authors have published a website with a demo for anyone to try.
References
References
[1] M. Caron et al., “Emerging Properties in Self-Supervised Vision Transformers,” Apr. 2021. [Online]. Available: http://arxiv.org/pdf/2104.14294v2
[2] D. Ruppert, “Efficient estimations from a slowly convergent Robbins-Monro process,” Cornell University Operations Research and Industrial Engineering, 1988. [Online]. Available: https://ecommons.cornell.edu/bitstream/handle/1813/8664/TR000781.pdf
[3] B. T. Polyak and A. B. Juditsky, “Acceleration of Stochastic Approximation by Averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, pp. 838–855, 1992, doi: 10.1137/0330046.
[4] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/html/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.html
[5] S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On Using Very Large Target Vocabulary for Neural Machine Translation,” Dec. 2014. [Online]. Available: http://arxiv.org/pdf/1412.2007v2
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
[7] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Oct. 2020. [Online]. Available: http://arxiv.org/pdf/2010.11929v2
[8] M. Oquab et al., “DINOv2: Learning Robust Visual Features without Supervision,” Apr. 2023. [Online]. Available: http://arxiv.org/pdf/2304.07193v1
[9] J. Zhou et al., “iBOT: Image BERT Pre-Training with Online Tokenizer,” Nov. 2021. [Online]. Available: http://arxiv.org/pdf/2111.07832v3
[10] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments,” in Advances in Neural Information Processing Systems, 2020, pp. 9912–9924. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/70feb62b69f16e0238f741fab228fec2-Paper.pdf
[11] A. Sablayrolles, M. Douze, C. Schmid, and H. Jégou, “Spreading vectors for similarity search,” Jun. 2018. [Online]. Available: http://arxiv.org/pdf/1806.03198v3
[12] H. Touvron, A. Vedaldi, M. Douze, and H. Jegou, “Fixing the train-test resolution discrepancy,” in Advances in Neural Information Processing Systems, 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/d03a857a23b5285736c4d55e0bb067c8-Paper.pdf
[13] Z. Chen et al., “Vision Transformer Adapter for Dense Predictions,” May. 2022. [Online]. Available: http://arxiv.org/pdf/2205.08534v4
[14] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-Attention Mask Transformer for Universal Image Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 1290–1299. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2022/html/Cheng_Masked-Attention_Mask_Transformer_for_Universal_Image_Segmentation_CVPR_2022_paper.html
[15] B. Zhou et al., “Semantic Understanding of Scenes Through the ADE20K Dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019. [Online]. Available: https://link.springer.com/article/10.1007/s11263-018-1140-0
[16] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene Parsing Through ADE20K Dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2017/html/Zhou_Scene_Parsing_Through_CVPR_2017_paper.html
[17] Papers with Code, Semantic Segmentation on ADE20K. [Online]. Available: https://paperswithcode.com/sota/semantic-segmentation-on-ade20k
[18] A. Kirillov et al., “Segment Anything,” Apr. 2023. [Online]. Available: http://arxiv.org/pdf/2304.02643v1
[19] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16000–16009. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper
[20] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring Plain Vision Transformer Backbones for Object Detection,” in Computer Vision ‐ ECCV 2022, 2022, pp. 280–296. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-031-20077-9_17
[21] M. Tancik et al., “Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains,” in Advances in Neural Information Processing Systems, 2020, pp. 7537–7547. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/55053683268957697aa39fba6f231c68-Paper.pdf
[22] A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the 38th International Conference on Machine Learning, 2021, pp. 8748–8763. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
[23] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, 2017. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[24] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. [Online]. Available: https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html
[25] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), 2016, pp. 565–571. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7785132
[26] K. Sofiiuk, I. A. Petrov, and A. Konushin, “Reviving iterative training with mask guidance for interactive segmentation,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 3141–3145.
[27] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang, “Segment Anything in Medical Images,” Apr. 2023. [Online]. Available: http://arxiv.org/pdf/2304.12306v2
[28] J. Wu et al., “Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation,” Apr. 2023. [Online]. Available: http://arxiv.org/pdf/2304.12620v6
[29] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Jun. 2021. [Online]. Available: http://arxiv.org/pdf/2106.09685v2
[30] S. Mahadevan, P. Voigtlaender, and B. Leibe, “Iteratively Trained Interactive Segmentation,” May. 2018. [Online]. Available: http://arxiv.org/pdf/1805.04398v1
[31] Y. Ji et al., “AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation,” Jun. 2022. [Online]. Available: http://arxiv.org/pdf/2206.08023v3
[32] X. Fang and P. Yan, “Multi-Organ Segmentation Over Partially Labeled Datasets With Multi-Scale Feature Abstraction,” IEEE Transactions on Medical Imaging, vol. 39, no. 11, pp. 3619–3629, 2020, doi: 10.1109/TMI.2020.3001036.