Kofler F. et al. Frontiers Neuroscience 15:752780 [1]

written by: Pablo Crespo

Elevating Segmentation Reliability in the Clinical Routine


Segmentation is one of the most addressed problems in medical imaging, with a history dating back to the invention of the first X-ray devices. The goal of segmentation is to isolate specific regions or structures of the anatomy, making it a valuable tool in various stages of the patient treatment process, including diagnosis and treatment planning such as radiotherapy or surgery. It can be applied to various medical imaging techniques and is now integrated into almost every imaging device in hospitals. The advancements in technology have made it possible to improve the accuracy and speed of segmentation, making it an essential tool for medical professionals.

With the advent of machine learning, new techniques such as active contours and support vector machines were proposed to further improve the accuracy of segmentation. In recent years, deep learning has revolutionized the field of image segmentation, leading to the development of powerful convolutional neural networks (CNNs) that have proven to be highly effective in solving segmentation tasks. Deep learning algorithms are trained on a dataset of images and their respective segmentation ground truth.

While these models have achieved a high level of accuracy in segmenting anatomy parts and anomalies, their reliability in real-world scenarios remains uncertain as they are not guided by the image properties or the knowledge of the underlying diseases. This lack of understanding of the model's decision-making process raises concerns about the safety and reliability of using these methods, especially in biomedical scenarios. In clinical practice, it is crucial to understand the reasoning behind every decision since they impact directly the patient's health. Therefore, reliability is a significant obstacle that must be addressed before these methods can be safely implemented in clinical routine.


Contributions

Despite the numerous algorithms proposed for evaluating the quality of segmentations, they are inherently limited in scope. They often rely on detecting anomalies within CNNs, which requires knowledge of the model architecture and the specific segmentation task at hand. As a result, they are restricted to specific models and clinical applications, which motivated the development of a Robust, Primitive, and Unsupervised Quality Estimation for Segmentation Ensembles.

The simplicity of this algorithm makes it applicable to every binary segmentation method that employs model ensembling. Model ensembling is a technique where the final segmentation is constructed by combining the outputs of multiple models. This approach is known to enhance segmentation performance, and as such, is widely used in many state-of-the-art segmentation applications. This algorithm exploits the discordance in the results of the individual models to estimate the segmentation quality in an unsupervised form.


Methodology

The method follows these steps to estimate the segmentation quality:

1. Segmentation Fusion

Segmentation Fusion of a 3-model Ensemble

We assume a segmentation solution formed by an ensemble of three models. Every model generates a segmentation proposal, which are fused into a single segmentation. Two fusing algorithms are proposed:

  • Equally weighted majority voting

The class that appears most frequently among the segmentation candidates for each pixel is assigned as the final value for that pixel in the segmentation result.

  • Selective and Iterative Method for Performance Level Estimation (SIMPLE) [2]:

An iterative fusion method, where poorly performing segmentations are discarded in each iteration, while the others are fused using majority voting fusion until convergence.

2. Compute Similarity

Similarity computed between a segmentation candidate and the fused segmentation

A similarity metric is computed to know how similar each model output is to the final segmentation. Two similarity metrics are proposed:

  • Dice Similarity Coefficient (DSC)

DSC sketch [3]

Also known as Dice Score, it is computed as double the intersected area of two sets divided by the individual areas of both of them. The formula is given by:

         

DSC formula

where X is the predicted segmentation, Y is the ground truth, and the vertical bars indicate the cardinalities (i.e. number of pixels). Results in a number from 0 to 1, representing how similar the areas of both segmentations are.


  • Hausdorff Distance

Hausdorff Distance sketch [4]

 It is the maximum distance between any point in one set and the closest point in the other set. The formula is given by:                                                                                                  

Hausdorff Distance Formula

This distance can be used to compare the similarity of two segmentations, with smaller values indicating a higher similarity. It can be sensitive to outliers, meaning that it may be affected by a small number of pixels that are very different between the two segmentations, while the rest of the pixels are very similar.


3. Define the Alarm Threshold

Let S be the distribution of similarities between each candidate of the ensemble and its respective fused segmentation for our test dataset. The median of S, denoted as median(S), is computed as the middle value of the distribution. The median absolute deviation (mad) of S, denoted as mad(S), is computed as the median of the absolute deviations from the median. An alpha scaling factor is introduced to adjust the magnitude of the mad. The Alarm Threshold is computed as follows:

Threshold = median(S) - \alpha \hspace{2} mad(S)

This threshold is used to determine when an alarm should be raised, by comparing the calculated similarities to this threshold. If the similarity is below the threshold, an alarm is raised indicating that the candidate does not conform to the expected quality standards.

4. Rise and Accumulate Alarms

An alarm will be raised when the similarity metric between a segmentation candidate and its fused segmentation falls below the calculated threshold T. This indicates a high level of discordance between the candidate and the fused segmentation. A high number of alarms in one case may indicate a higher probability of model failure and thus, human supervision may be necessary to ensure the robustness of the segmentation.


Experiments

The authors of the article implemented and evaluated the algorithm in two different medical imaging experiments:

MRI Brain Tumor Segmentation

5 different segmentation models are ensembled and tested. All models were developed and trained individually for the BraTS challenge [5]. The algorithm was tested on 68 cases extracted from the Rembrandt dataset [6] and Klinikum Rechts der Isar patients. Alarm counts were computed using both the Hausdorff distance and Dice Coefficient, while the fusion technique was general majority voting.

Results:

The algorithm was evaluated for different values of the alpha scaling factor. Setting alpha to 0.1 resulted in an even distribution of alarm counts across the test set. When using the Dice similarity coefficient, a strong negative correlation between segmentation performance and alarm count is observed, with a Pearson correlation coefficient of -0.77. However, when using the Hausdorff distance, the Pearson coefficient was -0.46, indicating a lower sensitivity detecting discordance within the models.

CT Lung Lesion Segmentation

In this case, a 3-model ensemble is constructed based on the COVID-19 Lung CT Lesion Segmentation Challenge [6] MONAI baseline. The three models are trained with a slightly different set-up in each case. The SIMPLE method is used now for fusing the segmentation candidates, and the results are tested only for the Dice Similarity Coefficient.

Results:

Again, alpha value is set to 0.1, and the experiment shows similar results. A negative correlation is observed between the number of alarms and the dice score, with a Pearson coefficient of -0.7.

As in both experiments, the results suggest that the alarm count is a good indicator of the segmentation quality, I decided to implement the algorithm in my own experiment, and run different tests to determine if this represents a significant improvement in segmentation reliability.

My Implementation

The experiment aims to segment the White Matter from T1-weighted MRI brain volumes using an ensemble of 3 models based on the 3D-UNet architecture, a CNN specifically designed for three-dimensional segmentation. The models were trained on a dataset of 652 cases, with 522, 65, and 65 volumes designated for training, validation, and testing respectively.

The three models in the ensemble differ in terms of the number of trainable parameters and their learning rate. The first and second models have 3 UNet blocks, while the third model has an additional block, increasing the complexity of the network. Additionally, the first and third models have a learning rate of 1e-4, while the second model has a learning rate of 1e-6. All models were trained for 10 epochs using the voxel-wise Binary Cross-Correlation loss function, and the segmentation performance was evaluated using the Volumetric Dice Similarity Coefficient. The quality estimation algorithm was applied to the whole test set and 10 different alpha values.

All the code was implemented in Python, using the Pytorch-Lightning framework to design and train the models.

Results:

After training, the ensemble model achieved an average DSC of 0.93 on the validation set. The algorithm was then applied to the test set, and its computation time was found to be less than 30 seconds per alpha value and the entire test set. The most uniform distribution of alarm counts is observed with alpha = 0.1. As expected, a negative correlation between alarm count and Dice Score was observed, as indicated by a Pearson coefficient of -0.65, which confirms the results of previous experiments.

Segmentation performances vs. alarm counts.

These boxplots show the alarm counts against segmentation performance (DSC). We can observe how high-performing segmentations raise a low number of alarms, while cases with a lower Dice Score tend to generate higher alarm counts, which is consistent with the negative Pearson coefficient. 

Discussion

In this post, we reviewed a primitive and universal method for estimating the quality of segmentation ensembles in clinical routine. This is a significant challenge in the field as medical practitioners currently lack a reliable and standarized method to evaluate the accuracy of automatic segmentations in clinical practice. 

The developers of this algorithm demonstrated its efficacy in two different segmentation applications. Furthermore, I implemented this method in another experiment, obtaining results that lead to the same conclusion. The algorithm's low computational requirements make it easy to integrate into state-of-the-art pipelines. As the alarm count is based solely on the discordance within the models in the ensemble, the method is not limited to specific architectures and requires no prior task-specific knowledge. Additionally, the sensitivity of the alarm threshold can be fine-tuned for specific problems during test-time, allowing for greater generalization.

However, it is important to note that the implementation of the experiments could be improved. A more robust approach would be to adjust the alpha value in the validation set, and then evaluate the method's performance during test time, as was done in our experiment. This method guarantees scientifically valid results and prevents the selection of the alpha value based on the best results.

The proposed solution provides prioritization for physicians, indicating which cases are more likely to result in model failure, thereby maximizing human resources and focusing on the toughest cases. A limitation of this method is its focus on discordance in the ensemble. If all models converge to predict the same error, it is not possible to detect this case as problematic. However, among the three experiments, no such case was found. Furthermore, the probability of finding a case like this reduces when larger ensembles are used, which is what is found in most real-world scenarios.

I believe that the algorithm must be further investigated, and applied on a wider spectrum of applications to decide whether the quality estimation is accurate enough to be deployed in clinical routine. However, it is interesting how such a simple algorithm was able to achieve a good performance in the three experiments performed. 


References

[1] Kofler, F., Ezhov, I., Fidon, L., Pirkl, C. M., Paetzold, J. C., Burian, E., Pati, S., El Husseini, M., Navarro, F., Shit, S., Kirschke, J., Bakas, S., Zimmer, C., Wiestler, B., & Menze, B. H. (2021). Robust, Primitive, and Unsupervised Quality Estimation for Segmentation Ensembles. Frontiers in Neuroscience, 15. https://doi.org/10.3389/fnins.2021.752780

[2] T. R. Langerak, U. A. van der Heide, A. N. T. J. Kotte, M. A. Viergever, M. van Vulpen and J. P. W. Pluim, "Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE)," in IEEE Transactions on Medical Imaging, vol. 29, no. 12, pp. 2000-2008, Dec. 2010, doi: 10.1109/TMI.2010.2057442.Understanding DICE COEFFICIENT. (2020, December 26), from https://www.kaggle.com/code/yerramvarun/understanding-dice-coefficient

[3] Pellerin, Jeanne. (2014). Accounting for the geometrical complexity of geological structural models in Voronoi-based meshing methods. 10.13140/RG.2.1.2719.2169. 

[4] Menze, B. H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., et al. (2015). The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34, 1993–2024. doi: 10.1109/TMI.2014.2377694

[5] Gusev, Y., Bhuvaneshwar, K., Song, L., Zenklusen, J.-C., Fine, H., and Madhavan, S. (2018). The rembrandt study, a large collection of genomic data from brain cancer patients. Sci. Data 5:180158. doi: 10.1038/sdata.2018.158

[6] Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., et al. (2013). The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging 26, 1045–1057. doi: 10.1007/s10278-013-9622-7

  • Keine Stichwörter