1 Introduction
In medical imaging, self-supervised volume segmentation has become a crucial research field with several implications for healthcare. In medical images, the precise and effective description of anatomical structures or unhealthy regions is significantly important for diagnosis, treatment planning, and robot assisted operations. The usual supervised segmentation techniques, unfortunately have a limited ability to scale up and generalize to a variety of contexts since they heavily rely on expensive and time-consuming human annotations. Whereas, self-supervision can leverage huge amounts of unlabeled data.
Self-supervision uses the intrinsic information in unlabeled medical data to get around these problems. These techniques enable automated feature recognition and labeling without the requirement for explicit human annotations by learning the spatial and contextual information contained in the images. This not only addresses the lack of annotated data but also makes it easier to analyze huge amounts of data. Self-supervised volume segmentation also benefits for greater generalization. Due to variations in patients, scanners and techniques, medical imaging may show high variance. The model's ability to adapt to various different cases is improved by self-supervised approaches, which actually learns from unlabeled data and capture a wider range of anatomical variances. The learned knowledge can then be transferred to downstream tasks, leading to improved segmentation performance when fine-tuned on smaller datasets with labels.
Fig 1:Different tissue segmentation on Xray, MRI and US
2 Self-Supervised Volume Segmentation
Due to its capacity to learn from significant volumes of unlabeled data, self-supervised learning has grown into an important topic of research in the field of deep learning. Another benefit of it is the ability to train the self-supervised portion only once and refine the same model repeatedly for other tasks, which considerably reduces the required data.
But why do we need 3D segmentation instead of 2D? With volume segmentation we can obtain improved spatial context thus better segmentation results. By segmenting whole volume instead of slices volume segmentation also promotes smooth and continuous segmentation. Volume segmentation is also better in terms of keeping the differentiation between overlapping regions in medical images. According to a study[1], 3D segmentation also converges faster than 2D. However it requires 20 times more memory. Also obtaining high quality annotations is harder in 3D, thus results in having scarce datasets.
In self-supervised learning, there are often 2 sub-tasks: pre-text task and downstream task. In pre-text tasks, a model is trained to solve a task that doesn't require explicit human-labeled annotations and is designed for it to learn useful representations or features. For example, in image-based tasks, the model might be trained to predict image rotations, image colorization, or image inpainting. Those learned representations can then be fine-tuned with some labeled data for downstream tasks such as classification, regression, or segmentation.
Fig 2:Simple pipeline of SSL
2.1 Metrics
Different metrics are frequently employed in volume segmentation to assess the effectiveness of the segmentation algorithms. These metrics measure the degree of overlap between the ground-truth annotations and the predicted segmentation. In volume segmentation, the following metrics are frequently used: Dice Similarity Score , Intersection Over Union and Hausdorff Distance. Their formulas are given respectively in Fig 3.
Fig 3: Common metrics used in Semantic segmentation
2.2 Benchmarks
Some of the most famous benchmarks and challenges that are used in medical image segmentation and also present in papers in Section 3:
BraTS (Brain Tumor Segmentation) [9]:The Multimodal Brain Tumor Segmentation Challenge dataset focuses on the segmentation of brain tumors from multimodal MRI scans. It includes high-grade gliomas and low-grade gliomas, providing a challenging dataset for brain tumor segmentation algorithms. It consists of multi-parametric MRI, which are T1, T1 with contrast, T2 and Flair. 2020 challenge has 2000 patients with segmentation labels.
PROMISE12 (Prostate Segmentation) [19]:The challenge focuses on the segmentation of the prostate gland from magnetic resonance images (MRI). It has 50 patients with segmentation masks and only T2 modality.
KiTS (Kidney and Kidney Tumor Segmentation)[20]: It consists CT scans along with corresponding manual annotations of kidney and tumor regions. It has 300 patients which 210 of them was used for training and 90 for evaluation.
ACDC (Automated Cardiac Diagnosis Challenge)[11]: The Challenge provides a dataset comprising cardiac MRI scans from different patients, along with corresponding ground truth annotations for various cardiac structures such as the left ventricle, right ventricle, and myocardium. It has 150 patients.
BTCV (Beyond the Cranial Vault Challenge) [8] :This challenge contains only CT images with 13 organs from Abdomen region and 4 organs from Cervix region. It has 50 patients under each category.
3 Methods
3.1 Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation[7]
Main Idea
Recent studies have demonstrated the efficiency of Masked Autoencoder (MAE)[2] in pre-training Vision Transformers (ViT)[3] for analyzing natural images. The MAE works by reconstructing complete images using inputs that have partial masks applied. Through this process, the ViT encoder effectively integrates contextual information to make inferences about the masked regions within the images. There is no large-scale medical image dataset like ImageNet[4] available for pre-training. Therefore, the authors propose a self pre-training paradigm where the ViT[3] is pre-trained on the training set of the target medical image data itself, instead of relying on an external dataset. [7]This approach allows for self-supervised pre-training, which can be particularly beneficial in scenarios where acquiring a large pre-training dataset is challenging.
Methodology
Pre-training is done using MAE[2] and fine-tuning part is done using a ViT[3] encoder and a U-Net[5] decoder. Masked autoencoder’s goal is to generate an input like image without any labels[2]. In this pipeline, The Autoencoder has a Vision transformer encoder and lightweight transformer decoder. ViT[3] encoder takes the input image that has some parts of it masked whereas the decoder is responsible of regenerating the full image. MAE[2] is trained with a mean squared error only on the masked patches. From this pre-training we obtain some learned features and weights to later use for fine-tuning. Results of masked auto encoder can be seen in Fig 5: first row indicates the original image, second row is the masked images and the third row is the reconstructed images by the autoencoder.
After the weight transfer to ViT[3] encoder, the input is the whole image with corresponding ground truth segmentation masks as seen in Fig 4. Similarly, the encoder part is a ViT[3] encoder but the decoder comes from the U-Net transformer (UNETR)[6] since we want a segmentation result at the end. And by using the weights from the masked autoencoder they can successfully segment the regions with only a few labelled data.
Fig 4: Segmentation Pipeline with MAE Self Pre-training and Fine-tuning Fig 5: Reconstruction results from MAE
Dataset & Results
This paper[7] uses 2 datasets for the segmentation task which are BTCV[8] and BraTS[9].They use the same dataset in both pre-training and fine-tuning. This method[7] is actually useful if you do not have sufficient data to pre-train the model on. The proposed method outperforms UNETR[6] pretrianed on ImageNet[4]with a 83.52 Average Dice Score on BTCV[8] dataset. As in Fig6, in the BraTS we similarly observe better results with MAE approach. The proposed method[7] indicated with red box in Fig7, it is compared with ground truths and the baseline Unetr. In the first row the baseline Unet transformer, the orange arrow indicates a false positive region which is eliminated with MAE method. In the second row, the stomach segmentation (red star) in the first column is incomplete when created by the UNETR[6] approach compared to an MAE[2] pre-trained UNETR[6].
Fig 6: Results for BTCV and BraTS Fig 7: Qualitative segmentation results
3.2 Positional Contrastive Learning for Volumetric Medical Image Segmentation[10]
Main Idea
Contrastive learning is an unsupervised learning technique and highly powerful method to learn representations from unlabeled data. It is later can be fine-tuned to use it for down-stream tasks. One of the most critical steps in contrastive learning is creating the contrastive data pairs. It is easier to apply to natural images using simple data augmentations, whereas in medical setting it is harder to construct due to having similar tissue or organs across the dataset. This causes most of the state-of-the-art methods to produce false negatives. To address this problem, this paper[10] introduces a positional contrastive learning (PCL) method which generates contrastive pairs based on the position of a slice inside the volume. Slices that are closer are cosidered as positive pairs since most of the time they share similar anatomical structure. And far slices are considered as negative pairs.
Methodology
In the pre-training stage, they use a set of 2D slices randomly chosen from 3D volumes. These slices are then propagated to a 2D U-Net[5] encoder and then contrastive learning is applied to those embeddings. For the fine-tuning case, those learned representations are used for initializing a normal U-net[5] architecture to train the network with limited labeled data. Position values are between 0 and 1. They indicate the normalized position of the slices along the z axis in the volume. If the position difference between 2 slices is smaller than the threshold it is considered a positive pair and if larger they are a negative pair. This also applies to slices in different volumes/patients. For the loss function, method uses a contrastive loss which is calculated using the cosine similarity function that computes the similarity between two vector representations. Pair examples can be seen in Fig 8.
Fig 8: Pipeline of PCL
Dataset & Results
The paper[10] uses 4 different datasets:. Training is done using 2 cardiac datasets, ACDC[11] contains MRI scans and CHD[12] contains CT scans. For the transfer learning setting they also test in on 2 different cardiac datasets namely; HVSMR[13], MMWHS[14] as in Fig 10. The metric used is Dice Similarity Score. PCL sucessfully outperformes GCL[15] which is the latest medical image segmentation method that uses contrastive learning in both semi-supervised and transfer learning. In the tables below, Fig 9 & 10, M indicates the number of patients used for fine-tuning. We can say that, as M increases the increase in Dice Score decreases. In Fig10, the improvement in the HVSMR dataset is smaller than the MMWHS one since CHD and MMWHS datasets are pretty similar to each other thus making learned features more helpful.
Fig 9: Results for Semi-supervised Learning Fig 10: Results for Transfer Learning
3.3 Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis[16]
Main Idea
This paper[16] reaches State-of-the-Art results on Medical Segmentation Decathlon[9] and BTCV [8] challenge. It actually combines the ideas of the first two paper where masked inpainting and contrastive learning was introduced. They use inpainting, contrastive learning and in addition, a rotation prediction task. They also propose a novel Swin UNET transformer[16] architecture.
Methodology
Input 3D CT scans are randomly cropped into sub-volumes and augmented with random cutout and rotation, then fed to the Swin UNET Transformer (Swin UNETR)[16] encoder as input. Simple pipeline can be seen in Fig11.
Swin Transformer[17] is slightly different than a normal ViT[3]. In normal vision transformer[3], an image is divided into fixed sized patches and transferred to all layers. The Self-attention mechanism in vision transformers[3] allow capturing long-range dependencies and enables global context. However, this requires computing attention scores for every pair of tokens, resulting in a quadratic complexity. Swin transformers[17] eliminate this by decreasing resolution at each step and thus much faster than ViT[3]. They are inspired by the CNNs because in CNNs we decrease the image size and increase the channel size after every convolution. The Swin transformer[17] applies the same method. And to increase the channel size it uses patch merging. Also normal vision transformers lack local attention since it always uses all image patches but Swin transformer[17] divides input into smaller patches and process them in a hierarchical manner. In short ViT[3] focuses on global attention whereas Swin transformer[17] focuses on both local and global context.
In the pre-training part: Swin UNETR[16] encoder is pre-trained on various proxy tasks. As explained above the input images are the randomly cropped sub-volumes. And they are augmented with random rotation and cutouts. One of the pretext task is the mask inpainting, they attach a transpose convolution layer to the encoder as the reconstruction head and it basically tries to predict the cut-out patches and the loss used for this reconstruction task is L1 loss. The second task is rotation prediction, they apply this as a classification task for 90,180, 270 degrees. An MLP classification head is used for predicting the Softmax probabilities. And cross entropy loss was used. Lastly, they implemented the contrastive learning, as usual the goal is to minimize the distance between positive pairs which are the augmented samples from same sub-volume, while maximizing it between negative pairs which are the samples from different sub-volumes. They use Cosine similarity function as the distance measurement. For the actual loss function they sum all 3 losses using lambda coefficients.
Fig 11: Pre-training pipeline of Swin Transformer
Dataset & Results
They pre-train it on 5000 images from 5 datasets and test it on BTCV[8] and Medical Segmentation Decathlon (MSD)[9] which has 10 sub-tasks such as brain, heart, liver etc. The metrics used are Dice Similarity Score and Hausdorff distance. In Fig 12, table indicates the results for BTCV[8] dataset, from the average column on the right we see that it outperforms all previous models. Results for MSD[9] dataset also reaches average Dice of 78.68% on all 10 tasks and achieves the top ranking in the MSD[9] leaderboard. In Fig 13, images indicate the segmentation comparison with DiNTS[21] which was the top-scorer for MSD[9] challenge before this Swin UNETR[16] method.
Fig 12: Results for BTCV[8] dataset
Fig 13: Qualitative Results comparison with DiNTS[21]
3.4 Sli2Vol: Annotate a 3D Volume from a Single Slice with Self-Supervised Learning[18]
Main Idea
Goal of this paper[18] is to segment any arbitrary structures of interest (SOI) in 3D volumes by only annotating a single slice. It reaches over 80% Dice Score on 8 CT and MRI datasets, spanning 9 anatomical structures. It proposes a new generalizable method for different SOIs meanwhile outperforming other supervised and unsupervised approaches in cross-domain experiments. Method is simply propagating the 2D slice segmentation with an affinity matrix between consecutive slices. The paper[18] uses, weighting and copying method together with affinity matrix, Convolutional Net and edge profile generator to generate new slices from the previous slice.
Methodology
For the training part, pair of adjacent slices are randomly selected from a volume. Slices are then given to an edge profile generator, which explicitly represents the edge distributions centered each pixel of that slice and thus force model to pay attention to edges during reconstruction. They are later fed into an ConvNet to obtain feature representations. Named as key and query. These feature representations of slices then reshaped and their feature similarity score is put into an affinity matrix after computing with dot product. Affinity matrix basically contains the information of the slice similarities and it is used for reconstructing the slices from previous slice. As loss, mean squared error is used. To reconstruct the Slice2 from Slice1, the affinity matrix is used to weight and copy pixels from Slice1. Loss is computed between the actual and reconstructed Slice 2.
During inference time, user annotates a single slice from the volume, and then this segmentation in transferred to the whole volume iteratively as seen in Fig 14. But this may cause error accumulation if it predicts a wrong slice at one step and iterates it over to the next one. They propose a simple verification module to overcome this problem and correct the mask after every iteration. They specify 2 regions namely positive and negative, positive refers to the masked region and negative is the surroundings of positive. Then the pixels in the predicted slice, is compared to the positive and negative regions’ intensity values and re-classified accordingly.
Fig 14: Train & Test pipeline of Sli2Vol
Dataset & Results
This method is trained on 4 datasets, 3 of them being CT scans and 1 includes MRI scans. They test it on 8 different datasets and report the scores using Dice score. In Fig 15, in row a and b we can observe the tests done using fully supervised learning. When it is trained on the same domain it reaches very high dice scores but when we have a domain shift it is clearly visible that dice score almost decreases 20%. And when we compare the semi-automatic approaches that are similar to Sli2vol[18] we can see that it outperforms all of them in terms of mean dice score. In Fig 16, middle image indicates the annotated slice and red color denotes the ground truth and blue shows the predictions. In all images we can see that results are pretty close to ground truth except in pancreas it introduces some false positive regions.
Fig 15: Results table of Sli2Vol[18] Fig 16: Segmentation results of Sli2Vol[18]
4 Discussion/Conclusion
Paper | Transferability | Complexity |
Masked Autoencoder[7] | Pre-training and fine-tuning is done with same data | 2 different transformers |
Contrastive learning[10] | Only on cardiac data | Easy to implement |
SOTA[16] | Tested on MSD and BTCV | Solves many tasks, swin transformer is complex |
Sli2Vol[18] | Tested on various data | Easy to implement
|
Table 1: Comparison of methods
Table 1 compares all 4 papers presented above in terms of transferability and complexity. In the first paper with the masked autoencoder[7], they use the same datasets both in pre-training and fine-tuning and report the results in the same datasets as well, so it is not truly known how this method would perform if it was tested on a completely different dataset. The same applies to positional contrastive learning[10] , paper uses 4 different datasets which all belong to cardiac MRIs or CT scans thus it is again not known how it would perform on different datasets. However, in SOTA[16] paper, they use great amount of test data, MSD[9] includes 10 and BTCV[8] includes 13 organs. The same applies for Sli2vol[18], they train it on multiple different datasets like kidney, pancreas and test on brain, spleen etc.
As for the complexity, autoencoder paper[7] uses 2 different transformers, which are ViT[3] and UNETR[6]. And vision transformers[3] tend to have high complexity. Similarly SOTA paper[16] uses Swin transformers[17] which is faster than ViT[3] but they use 3 different proxy tasks which increases the training time. The other 2 methods are actually more straight forward and easy to implement. They benefit from similarity and use simple U-Net[5] and convolutional networks.
As for the metrics they all use Dice score, only SOTA[16] and the autoencoder[7] paper also report Hausdorff distance.
5 References
- Avesta, A., Hossain, S., Lin, M., Aboian, M., Krumholz, H. M., & Aneja, S. (2022). Comparing 3D, 2.5 D, and 2D Approaches to Brain Image Segmentation. medRxiv, 2022-11.
- He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
- Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020." arXiv preprint arXiv:2010.11929 (2010).
- Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International journal of computer vision115 (2015): 211-252.
- Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing.
- Hatamizadeh, Ali, et al. "Unetr: Transformers for 3d medical image segmentation." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2022.
- Zhou, L., Liu, H., Bae, J., He, J., Samaras, D., & Prasanna, P. (2022). Self pre-training with masked autoencoders for medical image analysis. arXiv preprint arXiv:2203.05573.
- B. Landman, Z. Xu, J. E. Igelsias, M. Styner, T. R. Langerak, A. Klein, Miccai multi-atlas labeling beyond the cranial vault - workshop and challenge. 2015, accessed August 2020
Michela Antonelli et al., “The medical segmentation decathlon,” arXiv preprint arXiv:2106.05735, 2021.
- Zeng, D., Wu, Y., Hu, X., Xu, X., Yuan, H., Huang, M., ... & Shi, Y. (2021). Positional contrastive learning for volumetric medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24 (pp. 221-230). Springer International Publishing.
Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging 37(11), 2514–2525 (2018)
Xu, X., Wang, T., Shi, Y., Yuan, H., Jia, Q., Huang, M., Zhuang, J.: Whole heart and great vessel segmentation in congenital heart disease using deep neural net- works and graph matching. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 477–485. Springer (2019)
Pace, D.F., Dalca, A.V., Geva, T., Powell, A.J., Moghari, M.H., Golland, P.: In- teractive whole-heart segmentation in congenital heart disease. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 80–88. Springer (2015)
Zhuang, X.: Challenges and methodologies of fully automatic whole heart segmen- tation: a review. Journal of healthcare engineering 4(3), 371–407 (2013)
Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E.: Contrastive learning of global and local features for medical image segmentation with limited annotations. Advances in Neural Information Processing Systems 33 (2020)
- Tang, Y., Yang, D., Li, W., Roth, H. R., Landman, B., Xu, D., ... & Hatamizadeh, A. (2022). Self-supervised pre-training of swin transformers for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20730-20740).
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
- Yeung, P. H., Namburete, A. I., & Xie, W. (2021). Sli2vol: Annotate a 3d volume from a single slice with self-supervised learning. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24 (pp. 69-79). Springer International Publishing.
Geert Litjens, Robert Toth, Wendy van de Ven, Caroline Hoeks, Sjoerd Kerkstra, Bram van Ginneken, Graham Vincent, Gwenael Guillard, Neil Birbeck, Jindang Zhang, et al. Evaluation of prostate segmentation algorithms for mri: the promise12 challenge. Medical image analysis, 18(2):359–373, 2014.
- Heller, Nicholas, et al. "The kits19 challenge data: 300 kidney tumor cases with clinical context, ct semantic segmentations, and surgical outcomes." arXiv preprint arXiv:1904.00445(2019).
Yufan He, Dong Yang, Holger Roth, Can Zhao, and Daguang Xu. Dints: Differentiable neural network topology search for 3d medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021