This blogpost corresponds to the paper "Multi-task learning for the segmentation of organs at risk with label dependence" by Tao He, Junjie Hu, Ying Song, Jixiang Guo and Zhang Yi



Introduction

Nowadays, the delineation of organs at risk for radiation therapy is critical in order to obtain good rasults. As you can see in the image, a precise delineation in this case of a brain tumor, is critical in order to resect as much tumoral material as possible, without damaging the surrounding areas. The problem is, that the bast majority of this delineation is done manually by experts, being this a very repetitive and time consuming task, that require the expertise of a radiologist, resulting in a very expensive step in the radiation therapy. Also, as the experience of the expert differs from one to the other, the result will vary and it will lack of reproducibility as it is completely subjective. [1] [2] [3]


Figure1. Photon beam therapy [4]


For tackling this issues, automatic segmentation has been taking place in last years, achieving promising results. In this field, Fully Convolutional Neural networks (FCNs) yields the best results and its development is continuously improving for the task of segmentation. 


So, what is the proposal of this paper?

This papers goal is to obtain similar results as state of the art approaches for the segmentation of organs at risk, but reducing the time complexity that this previous approaches present as they increase the network capacity by training multiple models. For this the specifics of the proposal are the following:

  • MTL architecture plus a combined loss function that allows learning the classification and segmentation of the organ at the same
  • False Positive Filtering (FPF) algorithm with a dynamic threshold selection (DTS)
  • Introduce a new dataset of real medical organs: TAOWCH


Methodology

The image shows the particular configuration for the encoder-decoder network proposed, having as output both the segmentation and the multi-label classification result of the MTL architecture.


Figure 2. Encoder-decoder structure

Encoder

The encoder part is particular flexible, as using ResNet and DenseNet block in different ensembles enables multiple configurations that could result in more adapted structures for a defined problem. The only difference shows in the DenseNet block in which the last transition block is omitted. This blocks will be used with transfer learning (TL) pre-trained on ImageNet.

Decoder

In the decoder part, and asymmetric architecture is deployed. The asymmetry is caused by used simpler blocks with less parameters compared with the encoder part. By experimental observation, the authors determined that a more shallow decoder part doesn’t affect the segmentation accuracy, but it reduces drastically the number of parameters, boosting the training process.

False Positive Filtering algorithm

Usually true positive rate (TPR) of the classification is usually greater than the DICE score of the segmentation in this type of tasks. The False Positive Filtering (FPF) algorithm uses this knowledge to apply a unique and very simple rule: always trust the classification over the segmentation. This results in two possible scenarios:

  • An organ is segmented and classified as present in the slice. FPF forwards the segmented image.
  • An organ is segmented but classified as not present. FPF removes the mask for that specific organ (figure 3)


Figure 3. FPF algorithm in case of label not present


Dynamic Threshold Selection

Even if the implementation of the FPF algorithm is easy and the results are improved, it presents some problems:

  • TP being filtered if TPR of classification is not much higher of the segmentation accuracy.
  • The classification will tend to overfit if the model is saved just by its segmentation accuracy.

To overcome this issues, the Dynamic Threshold Selection (DTS) is introduced. The schematic of this process is the following:

  1. After each epoch keep all the ORIGINAL values of the classification after the softmax being applied but without the threshold
  2. Iterate over a range of possible thresholds and plot the result of the true positive rate over the false positive rate or ROC curve
  3. Choose the algorithm that obtains a higher positive rate with a minimum value tmin that is received as an input of the algorithm
  4. Keep that threshold value for the following epoch

Figure 4. Dynamic threshold selection schematic

Classification loss: Weighted mean cross entropy

The main proposal of this paper is the weighted mean cross entropy loss function (WMCE). The idea behind it is to establish a spatial relationship between organs in order to boost the performance.

The normal output of the network is updated by the average of the probabilities of two organs shown together in the same slice and how this relationship is shown in the dataset itself. But, how is this conditional probability measured? This value will be calculated by dividing the number of slices that shows both organs together (Nij) over the number of slides in the dataset that contain a specific organ (Ni). The final result (yj) give us the updated value of the classification an organ. Applying this output to a normal binary cross entropy loss function defines the final loss function (F) for this system. 


Figure 5. Equations for WMCE. xi is the network output with wik as weights for the last layer, sk the output of the previous layer and bi the bias. p(j|i) is the conditional. pij  as combined probability and pi as representation of i in dataset. 

Segmentation loss: Soft Dice

In case of the segmentation, the loss function used is the Soft Dice (SD) loss. This represents the overlapping ratio that don’t match the predicted value with the ground truth. .


Figure 6. Soft dice loss for segmentation


Experiments

About the datasets that are used in this paper: segTHOR [5] and the newly introduced TAOWCH. The most important characteristic of the TAOWCH dataset is that the images are captured after cancer surgery. The segTHOR dataset consists in 40 labeled thoracic CT scans (esophagus, heart, trachea, aorta) and in the case of the TAOWCH there are 49 labeled thoracic and abdominal CT scans (heart, liver, lung, kidney).


Figure 7. Image example from the segTHOR dataset


The configuration for the experiments is the following:

  • Image configuration: 2.5D CT images (3 adjacent axial slices)
  • Optimization method: SGD + momentum 0.9
  • Learning rate: 0.01 + decay 0.95 / epoch
  • Loss function segmentation: Soft dice loss function
  • Loss function classification
    • Binary cross entropy (BCE)
    • Weighted mean cross entropy (WMCE)
  • FPF + DTS: tmin = 0.995
  • Data augmentation
    • Horizontal/vertical random flipping
    • Cropping factor: 0.6 - 1
  • Metrics:
    • Classification: macroTPR (average of TPR of all classes)
    • Segmentation: Jaccard, Dice, Hausdorff and FPR
  • Training time: 8h (segTHOR) / 7h (TAOWCH) on 4x Nvidia Titan XP GPUs
  • Validation, testing: Ensemble voting for final segmentation


First, it was checked if the use of transfer learning for pre-trained networks on ImageNet for the DenseNet and ResNet blocks was succesfull in this particular setup. As the curve in the following figure shows, the use of TL boost the performance and accelerates the training process.


Figure 8. Performance of using TL

For validating the new WMCE loss function, it was compared with a regular BCE in order to check if the conditional dependence makes a difference for the segmentation of organs at risk. As a result, its easy to see that the WMCE loss overcomes the regular BCE so the conditional probability is not harming but improving the confidence of the network after all.


Table 1. TPR on encoder-decoder network with ResNet101 and DenseNet121 encoders (segTHOR).


After validating the potential of the WMCE loss, the following experiments is aimed to validate which configuration is the most suitable considering all the proposals introduced and comparing it to a normal single model. The following table shows the results of the metrics in the segTHOR dataset.



Table 2. Experimental results on SegTHOR dataset. Different configurations using multi-task learning (MTL), false positive filtering (FPF), dynamic threshold selection (TDS), weighted mean cross entropy (WMCE) loss.


Comparing the results for ResNet and DenseNet block shows that DenseNet121 obtains better result for most of the metrics, but as they are really similar, ResNet could still be used for the ensemble in case that the developer consider it more appropriate for the specifics of the task. The Jaccard and Dice coefficients shows the best results when all the methods are used. Finally, the FPR is halved compared in the single model in the case of the configuration number 4.

The result of applying the different configuration to the segTHOR dataset can be visualized in the following picture, where SSM corresponds to the single model.


Figure 9. Segmentation results on segTHOR


A similar to the previous study is executed but this time on the TAOWCH dataset. Also, differs the network model in which the different configuration are applied, in order to demonstrate that they improve the results even with different encoder-decoder structures that the proposed one. In the case of DeepLabV3, all the metrics obtain better results when using all the strategies offered in this paper.


Table 3. Experimental results on TAWOCH dataset


The result of applying the different configuration to the TAOWCH dataset can be visualized in the following picture.


Figure 10. Segmentation results on TAWOCH


For validating this results of the label dependence individually for every organ, the Dice and Jaccard coefficient with the FPR are shown in the following figure. As expected, all values for the organs that shows a strong dependence with others are improved compared with the regular BCE loss. Just the kidney don’t show a significant improvement, validating with the previous result how the label dependence works as expected.


Figure 11. Improvements of WMCE over BCE


Lastly, as the objective of this proposal is mainly reducing the training time without affecting the performance, the results are compared with a single label[6] classification proposal that achieved state of the art results. The following table shows how the performance in the number of the metrics are really similar to the state of the art proposal, but the mean training time per epoch is reduced in a factor of four.


Table 4. Comparison with single model approach


Discussion and conclusion

For the side of the authors, they concluded that the combined result of the strategies proposed improves the performance overall. The segmentation time is reduced compared to other widely used methods as two-step method or ensemble voting methods, having a comparable executing time as a single model. In particular, the WMCE loss improves the performance of the multi-label classification by assigning a depence rule for all the organs. It can be generalized to any problem where the classes have some heavily dependent relationship and in which the training and testing set has a similar conditional probability. For the future work, they propose to generalize the conditional probability to all the organs, obtaining a joint distribution for conditionally dependent variables. Also, they would like to apply this results to other machine learning tasks as in the field of text, image or acoustic analysis.

For my side, I was glad to have the opportunity to read and analyze a proposal that seems rather intuitive but effective, showing that in every step of the proposal they tried to apply common sense in order to obtain the expected results. Also I was impressed of how well the experimental environment and hyperparameters are completely specified, not leaving anything behind to achieve a good replication of the results. Also the validation process shows that they invested a lot of time and effort in checking that every decision was having a positive impact in the final result.

The cons that I have found are few but important in my opinion. There are some cases in which they don’t give a explanation of why they choose to follow a specific strategy, per example, not showing the preliminary study of why the TPR of the classification is higher compared to the DICE score of the segmentation in the FPF algorithm, or the justification of the critical decision of having a more shallow decoder and why it doesn’t affect the segmentation performance just by experimental observation and not actual values to understand this phenomenon.

Less important is the case of the TAOWCH not being publicly available or that I was expecting the conclusions to connect the radiation therapy to the final results, but instead it was mostly a summary of the previous steps.


Bibliography

[1] Baskar, R., Lee, K. A., Yeo, R., & Yeoh, K. (2012). Cancer and Radiation Therapy: Current Advances and Future Directions. International Journal of Medical Sciences, 9(3), 193-199. doi:10.7150/ijms.3635

[2] Gianfaldoni, S., Gianfaldoni, R., Wollina, U., Lotti, J., Tchernev, G., & Lotti, T. (2017). An Overview on Radiotherapy: From Its History to Its Current Applications in Dermatology. Open Access Macedonian Journal of Medical Sciences, 5(4), 521-525. doi:10.3889/oamjms.2017.122

[3] Ramkumar, A., Dolz, J., Kirisli, H. A., Adebahr, S., Schimek-Jasch, T., Nestle, U., . . . Song, Y. (2015). User Interaction in Semi-Automatic Segmentation of Organs at Risk: A Case Study in Radiotherapy. Journal of Digital Imaging, 29(2), 264-277. doi:10.1007/s10278-015-9839-8

[4] Unknown Photographer. (2001). Proton Beam Therapy [A proton beam from the brain during CT scans. Using dozens of CT slices, a computer produced this three-dimensional representation of the eyes and optic nerves (blue and green). The brain stem (green) , and the tumor (red). The yellow line shows the proton beam field-shaping aperture.]. Retrieved 2020, from https://visualsonline.cancer.gov/details.cfm?imageid=2421

[5] Lambert, Z., Petitjeanb, C., Dubrayb, B. ; Ruanb, S. (2019). SegTHOR: Segmentation of Thoracic Organs at Risk in CT images. LMI, INSA Rouen, France.

[6] Han, M., Yao, G., Zhang, W., Mu, G., Zhan, Y., Zhou, X., Gao, Y., 2019. Segmentation of CT thoracic organs by multi-resolution vb-nets. In: Proceedings of the Challenge on Segmentation of THoracic Organs at Risk in CT Images.

  • Keine Stichwörter