Introduction

Medical imaging segmentation is an essential task in medical diagnosis and treatment planning that involves the division of medical images into regions of relevant information. It’s useful because it enables the extraction of information such as size, shape, and texture, which can be used for diagnosis, treatment planning and to monitor disease progression. Additionally, image segmentation can help detect anomalies such as tumours (Fig. 1) and can also be used to enhance anatomic visualisations, making image interpretation easier. Deep learning algorithms are capable of learning patterns and representations from vast amounts of data, making them ideal for image segmentation, and in recent years the accuracy of medical imaging segmentation models improved significantly. However, one of the challenges of deep learning for medical imaging segmentation is the lack of labelled data, as manual annotation of medical images is time consuming, expensive and it’s often hard to obtain large annotated datasets. This lack of labelled data affects the performance of deep learning algorithms, leading to suboptimal results. In this blog post, it will be explored the use of self-supervised learning frameworks as a way to overcome this problem.

Figure 1: Brain tumour MRI segmentation [1]

Self-supervised learning

Self-supervised learning is a machine learning paradigm that’s based on learning data structures and features from unlabeled data, to then employ those features to solve downstream tasks. Usually a self-supervised model training is divided into two parts, as one can see in Fig. 2. The first part is based on solving a pretext task, which is a task that allows the model to extract relevant features from raw data and in which the labels needed for training are automatically generated from this raw data. The second part consists in fine-tuning the pre-trained self-supervised model by training it in a supervised setting with the data available from a small labelled dataset.

Figure 2: Self-supervised learning main workflow [2]

Self-supervised learning has become an active area of research in the field of deep learning, due to its ability to learn from large amounts of unlabeled data. This is particularly useful in situations where labelled data is scarce, such as the medical imaging data aforementioned. Another advantage of this kind of model is the possibility to train the self-supervised part just once and fine-tune the same model multiple times for different kinds of tasks, decreasing significantly the resources needed.

A crucial aspect in the development of a self-supervised model is the choice of the pretext task. The pretext task should not be too easy nor too hard: if the task is too easy the model will not learn enough useful features, if the task is too hard the model will not be able to learn it well and the learned features will not be useful. It’s also important that the selected pretext task is relevant to the final task, so that the learned features are useful for the final step of the training process. For example, for image classification tasks, a good pretext task would be one that requires the model to understand the structure and content of images. Different computer vision models presented in research papers make use of different pretext tasks, such as relative position prediction[3], Jigsaw puzzle solving[4] and Rubik cube recovery[5]. In this Blog post two different approaches based on the pretext task of Jigsaw puzzle solving will be presented.

Pseudo-labelling

Pseudo-labelling is a technique that consists in automatically generating labels from unlabelled data to then train a model in a supervised setting. It is useful in contexts where only few labelled data is available, to increase the size of the training corpus. The most prominent pseudo-labelling methods for segmentation are:

[6] which trains the network on labelled data and then uses the softmax probability maps predicted by the neural network as the pseudo-labels.
[7] generates pseudo-labels by using Monte Carlo dropout, which is regarded as an approximation of Bayesian uncertainty.
[8] also approximate Bayesian uncertainty, through a model ensemble, which trains k networks separately and then averages the softmax probability map of each network to get the ensemble uncertainty.

An analysis on these approaches conducted in [9] found two relevant observations:

Approaches based on the Monte Carlo dropout yield results that are too dependent on the dropout choice
Ensemble methods present the most reliable results, but are computationally expensive

Jigsaw puzzle solving

The authors of [4] introduce the Jigsaw puzzle reassembly task, arguing that solving Jigsaw puzzles is a good way to teach a model that an image is made out of different parts and what these different parts represent. They develop a Context Free Network (CFN), detailed in Fig. 3, which is a Siamese Convolutional Network in which each component employs the AlexNet [10] architecture with shared weights. The image is divided into 9 tiles that form a 3x3 grid. Those tiles are then rearranged according to a random permutation and they are given as input to the CFN. The different fc6 outputs are then concatenated into one fully connected layer and the output of the network will be the predicted permutation with index i, represented by a vector with 1 in the i-th position and 0 in all other positions. In this way each tile is processed individually, and the context is handled only in the last fully connected layers.

Figure 3: Context Free Network

When developing a pretext task it is important to avoid so called shortcuts. A shortcut is a solution found by the model that exploits some features useful for the pretext task, but not suitable for the successive target tasks. In this paper different shortcuts are analysed and corrected, and an ablation study shows how crucial avoiding these shortcuts is to get better performances on a subsequent detection task. The techniques employed to avoid this shortcuts are increased number of permutations on the same image, increased average hamming distance between permutations, creation of gaps between tiles by removing border pixels, normalisation of each tile independently and colour jittering.

Table 1

Table 2

The results of the ablation study are shown in Tab. 1 and 2. As one can see, configurations that don’t consider shortcuts tend to have higher accuracy on the Jigsaw solving task, but lower accuracy on the detection task.

Self-loop uncertainty approach

Figure 4: Pipeline of the FCN

The first approach discussed in this blog post is the one presented in [11], in which the authors introduce a new kind of pseudo-label called self-loop uncertainty. Those labels are generated through the optimization of the network for the pretext task of Jigsaw puzzle solving, and are subsequently used as ground truth labels of the unlabelled data. The authors aim at outperforming the different pseudo-labelling methods that are already present in literature, by using the self-supervised learning task to make the self-loop uncertainty be an approximation of the ensemble uncertainty presented in [8], but with a lower computational cost compared to it.

Fig. 4 shows an overview of the developed model: a Fully Convolutional Network (FCN) where the encoder is recurrently optimised by solving the pretext task . Firstly Q permutations are randomly selected from P’ : {P’₁ , ..., P’_Q ∈ P’}, where P’ is the set of all possible permutations of the tiles generated from the image after splitting it in a grid. Unlike the framework in [4], the tiles are randomly rotated by a 0, 90,180 or 270 degree angle and are assembled back into an image of the same size as the original one, before being fed to the network. Then for each permutation the encoder weights are updated according to the loss L_SS, which is the loss generated by the pretext task. Successively the pseudo-labels y_sl are calculated by , with where S is the segmentation prediction of the FCN, l_iis the calculated self-supervised loss for the i-th permutation and T^-1 is the inverse permutation operation. This way the pseudo-label will be a weighted average of the segmentation predictions, resembling the ensemble method. The uncertainty-guided loss L_UG for the network optimisation with unlabelled data and pseudo-labels y_slis the mean square error and it is defined as where th is the threshold to select only the reliable predictions as labels. The network will be therefore trained combining the uncertainty-guided loss for unlabelled data and the segmentation loss for labelled data.

Multimodal Jigsaw puzzles approach

Figure 5: Pipeline of the network, using 4 modalities

Because modern diagnostic depends greatly on the analysis of multiple imaging modalities, the authors of [12] developed a self-supervised framework that instead of considering each modality independently, adapts the pretext task of jigsaw puzzle solving to deal with multiple modalities at the same time. This approach is called Multimodal Jigsaw puzzle task and an overview of it is given on Fig. 5. Firstly the image is divided into N patches the same way as in [4], with the difference that the patches are taken from different modalities. Then each patch is processed individually, producing as output a feature vector of length N, which then will be concatenated together to obtain a NxN matrix. Unlike the previous approach, the authors decided to make the network solve a permutation task instead of a classification task of permutations, but because a permutation matrix has in each row and column only a 1 and all 0 elsewhere, it is not differentiable. However, according to [13], a non-differentiable permutation matrix can be approximated to a doubly stochastic soft permutation matrix with the use of the Sinkhorn operator, which iteratively normalises rows and columns of the matrix. This matrix is then applied to the input of the network to reconstruct the image, which will be then compared to the ground truth, employing mean square error as loss function. The loss function will therefore look like

Once the model is trained it can be fine-tuned to solve multiple kinds of downstream tasks. In this paper the authors employ this framework for the tasks of brain tumour segmentation, prostate segmentation, liver segmentation, and survival days prediction.

Datasets

The two papers analysed contain experiments based on multiple kinds of downstream tasks that employ different datasets:

Nuclei segmentation: MoNuSeg Dataset, contains histopathological images with annotated nuclei. The training set size is 30 images and the test set size is 10 images. The resolution of the images is 1000 x 1000.
Skin lesion segmentation: ISIC Dataset, contains 2594 dermoscopic images of sizes ranging from 1000 x 1000 to 4000 x 3000. Those images are resized to 512 x 512.
Brain tumour segmentation: BraTS Dataset, contains MRI brain scans, each in 4 different modalities. The training set size is 285 for training and 66 for validation.
Prostate segmentation: Prostate Dataset, contains MRI prostate scans, each in 2 modalities. The training set has 32 images and the testing set has 16 images.
Liver segmentation: CHAOS Dataset, contains MRI and CT liver scans. 20 images are used for training and 20 for testing
Survival Days Prediction (regression task): BraTS Dataset

Results

Table 3

The results of the experiments made in the first paper are the following, shown in Tab. 3. The metric used for the experiment is F1 score.

On the MoNuSeg dataset experiments results there is noticeable a decrease in performance when training the model with less labelled data with all techniques, this decrease is smaller with the model proposed in the paper. One can also see how the self-loop model obtains better results than all other non-supervised methods, and it obtains almost the same performance as the fully supervised one (0.20% difference) using half the labelled data, proving that this method could greatly reduce the need for manual annotations.

The results on the ISIC datasets follow the same pattern as the previous ones. Moreover it is important to notice that even if the results are comparable to the Ensemble method, the latter takes 10 times more for inference.

The results of the second paper are expressed in the following tables. The evaluation metrics used are Dice score, normalised surface distance (NSD) and Mean square error (MSE).

Here we can see how the fine-tuned model outperforms all other current methods in all the different downstream tasks it has been tested on, which are either supervised methods that use only the labelled data (from scratch), the same network developed by the authors but used with one modality only (Single-modal) or other multi-modal approaches. The results show that multimodal frameworks are more adapt than mono-modal methods in medical imaging tasks, and the prove the goodness of the model developed in this paper.

Personal review

In this blog post different self-supervised learning techniques employed in the field of medical imaging segmentation have been examined. For sure this paradigm has some weaknesses such as an easier overfitting to the pretext task, which would lead to poor generalisation. This can be fixed with proper ablation studies and by choosing the right pretext task. Another limit noticeable in the first paper experiments is that the lack of labelled data could make it harder for the model to learn complex and relevant representations of data compared to supervised models. This limit is however compensated by the fact that, as seen in the first paper results, if a self-supervised model is properly developed and trained, it can obtain similar results to supervised models while needing way less labelled data, which is more coherent to real world scenarios. The results of the second paper indicate that taking into account multiple modalities at the same time in the pretext task greatly improves performances. Moreover it’s also shown how the same model structure can work nicely for different kinds of segmentation tasks as well as regressions, proving their good adaptability to different scenarios.

Seitenhierarchie

Self-Supervised/Unsupervised Image and Medical Imaging Segmentation