This blog post summarises and discusses the paper "Image to Images Translation for Multi-Task Organ Segmentation and Bone Suppression in Chest X-Ray Radiography" by M. Eslami, S. Tabarestani, S. Albarqouni, E. Adeli, N. Navab, M. Adjouadi (IEEE TMI 2020).

Table of Contents

1. Introduction

Chest X-Ray-Imaging (CXR) is among the most important imaging technologies in context with lung- and heart-related diseases. In order to improve the accuracy of CXR-based diagnosis, computer-based methods are used more and more often. This paper aims at improving two tasks in the field of CXR analysis: a) Organ Segmentation, i.e., the segmentation of the left lung, heart, and right lung in the CXR image, and b) Bone Suppression - the process of removing bone shadows in CXR-images (see Figure 1), which makes it easier to examine the image for disease effects and often serves as pre-processing step for further computer-based analysis.

Figure 1. Examples of a Chest-X-Ray-Image (top), and the corresponding Organ-Segmented image (bottom left) as well as bone-suppressed image (bottom right).

For both these tasks, there is already a range of methods available, of which many are based on deep neural networks. However, all these existing models have been trained narrowly on only one of these two tasks, and therefore are not exploiting the fact, that these both tasks are actually very similar. Here, the presented paper comes in: The authors want to make use of a concept called "Multi-Task-Learning": While in real life, "Multi-Tasking" is often perceived as a bad thing, because for humans, it's hard to do several things at the same time, in the context of Machine Learning, it often brings benefits and can improve the performance of all learned tasks and may also be more efficient (in terms of the number of parameters or training time) compared to training multiple separate "single-task" models. [2,8]

Therefore, the goal of the authors is, to train a model, that performs both of the above tasks - Organ Segmentation and Bone Suppression - jointly at the same time to improve model performance and efficiency.

Figure 2. Left: Existing methods are separate single-task models, Right: Proposed "Multi-Task" Model, that performs both tasks simultaneously.

In order to train such a model, the authors extend Pix2Pix, a popular Image-to-Image-Translation model to perform Image-to-Images-Translation, i.e., translate an input image $\begin{array}{l}x\end{array}$ (CXR-Image) to two output images $\begin{array}{l}y_1\end{array}$ and $\begin{array}{l}y_2\end{array}$ (organ-segmented and bone-suppressed image).

2. Related Work

The authors base their architecture heavily on Pix2Pix^[5], a popular Image-to-Image-Translation model. Pix2Pix is a Conditional GAN (cGAN) and therefore consists of a generator, that creates the "translated" output-image and a discriminator, that classifies whether the generated image looks realistic.

More specifically, the Pix2Pix Generator is an Autoencoder-Network of Convolutional Layers which are connected via Skip Connections, similar to the famous U-Net-Architecture.

The Discriminator of the Pix2Pix model is often referred to as "PatchGAN CNN", because it does not classify the whole image as "real" or "fake", but instead, classifies different patches of the image - therefore, instead of a single value, we get a matrix of predictions out. Each value in that matrix corresponds to the prediction of how realistic the corresponding "patch" looks.

Figure 3: The Pix2Pix Discriminator "PatchGAN"

3. Methodology

Figure 4: The proposed architecture

In order to train a Multi-Task model that jointly performs the organ-segmentation and bone-suppression task, the authors extend the existing Pix2Pix architecture to an image-to-images-translation model, which generates two outputs - one for each task. The complete architecture is depicted in figure 4 and will be explained in the following:

Generator

The Chest-X-Ray image serves as the input $\begin{array}{l}X\end{array}$ to the generator-network. Even though CXR-images are usually grayscale images and therefore only consist of a single intensity channel, the authors replicate that intensity channel, in order to match the 3-channel input of the Pix2Pix network. While the authors do not elaborate further on why they made this architectural choice, a probable explanation is, to stay as close as possible to the original Pix2Pix architecture, which proofed to perform very well for 3-channel (RGB) images.

For the most part, the generator network is exactly equal to the original Pix2Pix model. However, one important change the authors did, was the introduction of "dilated" convolutions [3] with a dilation factor of $\begin{array}{l}l=2\end{array}$ in the generator. The authors motivate this modification with the larger growing receptive field, which showed to work out well in an ablation study they performed.

In order to generate 2 images instead of one, the authors also increase the output shape of the network from 3 channels to 6 channels, such that the outputs of the generator network are a concatenation $\begin{array}{l}Y = Y_1 || Y_2\end{array}$ where $\begin{array}{l}Y_1\end{array}$ is the organ segmented image and $\begin{array}{l}Y_2\end{array}$ is the bone suppressed image.

Discriminator

The discriminator is based on the previously described PatchGAN Discriminator from Pix2Pix. The only adjustment the authors did here, was to increase the input size from 6 channels to 9 channels. More specifically, the input tensors to the discriminator are:

a concatenation of the CXR-Image and both Generated Images $\begin{array}{l}F = X || \hat{Y}_1 || \hat{Y}_2\end{array}$ for the fake pairs
or a concatenation of the CXR-Image and both Target Images $\begin{array}{l}F = X || Y_1 || Y_2\end{array}$ for the real pairs

Same as in Pix2Pix, the output of the discriminator will be matrices $\begin{array}{l}D_F\end{array}$ and $\begin{array}{l}D_R\end{array}$ , corresponding to the realism of the input tensors.

Training

In every training iteration, the generator creates the output images to the given CXR-input; then the discriminator classifies the real pairs and fake pairs regarding how realistic they look, and the weights of both networks are updated according to the following losses:

The Generator-Loss aims at maximizing the discriminator output for the generated images, because a high value here means, that the generator is successfully "fooling" the discriminator. Furthermore, the loss contains the L1-difference between the generated images its corresponding target images.

$\begin{array}{l}L_G = \mathop{\mathbb{E}} [- \log (D_F + \epsilon)] + \lambda \mathop{\mathbb{E}}[|Y-\hat{Y}|_1]\end{array}$

The Discriminator-Loss is a typical GAN-loss, that aims at maximizing the Discriminator-output for the "Real"-Pairs, and minimizing the output for the "Fake"-pairs.

$\begin{array}{l}L_D = \mathop{\mathbb{E}} [-( \log (D_R + \epsilon) + \log(1 - D_F + \epsilon))]\end{array}$

4. Experiments & Results

Dataset

Obtaining a suitable dataset for the proposed method can be quite challenging because the dataset must contain target images for both tasks for every sample.

For this reason, the JSRT-Dataset [9] is used, which consists of 247 CXR-Images and is the only public dataset, that contains targets for both tasks:

Segmentation Masks have been created later by Van Ginneken et al. [10]
For the Bone Suppression Targets, there are no real ground truths available, however, the results of another well working method [6] are available for the JSRT dataset.

The images are downscaled to a resolution of 512px, which showed to maintain the quality of results, and data augmentation via rotation & translation is applied, to increase the number of samples by a factor of 5.

Task 1: Organ Segmentation

For the evaluation of the first task, the organ segmentation, the authors use standard overlap-based metrics, Jaccard Index, Dice Coefficient and False Positive- and False Negative rate, and compare their results to the performance of the famous U-Net model. [7]

Figure 5. Organ Segmentation Results.

Left: Quantitative results. (the proposed architecture is depicted as "pix2pix MTdG", which stands for "Multi-Task + dilated Generator")

Right: Qualitative results for the best and worst sample in the dataset.

The results are depicted in figure 5. It is clearly noticeable, that the proposed architecture outperforms the baseline in every metric. To get an impression of the results, the qualitative results for the best and the worst sample in the dataset are also depicted in figure 5 (right).

Task 2: Bone Suppression

Figure 5. Bone Suppression Results. Left: Quantitative Results. Right: Qualitative results for best and worst sample in the dataset.

For the evaluation of the bone suppression task, the authors use two metrics: the Root-Mean-Square-Error (RMSE) between the generated bone suppression images and their targets, and a Structural Similarity Index. The latter one indicates how similar two images are, with a maximum value of 1 ("images are structurally similar") and a minimum value of 0 ("no structural similarity").

As a baseline, the authors compare their results to an Auto-Encoder network, which they claim is the only other open-source implementation of a bone suppression model that is publicly available. [4]

The results are depicted in figure 5. Again, it is clearly noticeable, that the proposed method outperforms the baseline in both metrics.

5. Analysis & Conclusion

As shown above, the proposed Multi-Task model indeed outperforms the baselines for both tasks. In order to evaluate the effectiveness of the proposed changes in architecture, namely applying a Multi-Task-Model and using dilated convolutions, the authors performed an ablation study, where they compared the performance and efficiency of a Multi-Task model with two separate Single-Task models (plain Pix2Pix models).

Figure 6. Ablation Study Results

The results are shown in figure 6. The authors conclude, that the proposed method, i.e. the Multi-Task-Model with dilated convolutions performs best in both tasks.

Apart from this gain in performance, it is however also extremely interesting to analyze the impact on the model efficiency. While training two separate Pix2Pix models requires 2x 57 Mio. parameters, the Multi-Task model only needs half this amount and still performs better. The U-Net model has fewer parameters, however it only performs the segmentation task. Also, the training time for the Multi-Task model was significantly shorter, than the time needed to train two single-task Pix2Pix models.

Generalization to other Applications

To demonstrate that the above findings also generalize to other applications, the authors show results from 2 further experiments.

Application #1: Abdominal CT Images

In the first application, the input data consists of Low Dose CT-Images. Again, the model is trained to perform 2 tasks simultaneously:

a) Image Enhancement

b) Kidney Segmentation.

Some results are shown in the table below. It is noticeable, that the proposed method performs better than the Single Task model for both tasks.

Figure 6. Generalization Experiment - Abdominal CT Images

Application #2: Brain MRI Modality Conversion

In the second generalization experiment, the authors perform a modality conversion of brain MRI images. Easily explained, modalities are different types of MRI-images, that highlight different tissues. Also in this experiment, the proposed Multi-Task model performs better than the Single-Task model.

Figure 7. Generalization Experiment - Brain MRI Modality Conversion

With that, the authors conclude, that their proposed "Multi-Task Image-to-Images-Translation" method is able to improve model performance as well as efficiency, and is also applicable to other domains.

6. Personal Opinion

In this last section, I would like to discuss my personal opinion of the presented paper.

Strengths

The paper is very well structured and easy to follow, even without the corresponding medical background knowledge. I also appreciate, that the source code is publicly shared on GitHub, which was helpful to get an idea of the exact implementation details.

I think the application of multi-task learning to image-translation tasks is an extremely interesting idea, which may have a lot of potential in the domain of medical imaging, as there is a wide range of useful applications, as the paper has very well demonstrated. Furthermore, the proposed method is very general and should also be applicable to completely other domains, apart from the medical field.

Weaknesses

However, the actual contribution of this paper - in terms of additions to existing methods - is rather small, as only minimal changes to the existing Pix2Pix architecture are applied. That being said, I would have wished to see some more extensive research and experiments to evaluate the potential of Image-to-Images-translation.

While the described method is actually very general, the authors only focussed on very specific experiments within the medical domain, using very small datasets. In order to get a better impression of the generalization of this method, it would be desired to see some generalization experiments that differ further from the original tasks. Also, for the two generalization tests that have been performed, only the results for a single sample a reported, which is not very meaningful.

Furthermore, the evaluation of the segmentation task only uses overlapping metrics and would be more meaningful, if also distance-based metrics, such as Average Contour Distance, would have been used.

Also, the authors concluded that training a Multi-Task model is faster than training two separate Single-Task models. I think this comparison is not entirely fair, because training single-task models does not necessarily mean that both models need to be trained from scratch. Especially, if the tasks are very similar, it can make sense to use transfer learning, e.g., use the top-layers of one model as initialization for the other model, which would reduce the training time.

Furthermore, I think it is a weakness, that as targets for the bone-suppression task, the results of another method have been used. Obviously, the results, in this case, cannot be better than this existing method, so the question arises, whether it would not be a better choice, to use another task instead of bone-suppression.

Lastly, an interesting observation is, that the ablation study seems to show, that the dilated convolutions actually had much more impact on the performance of the model, than the application of multi-task learning, on which the authors could elaborate further.

Possible Future Work

While I criticize the lack of more sophisticated experiments and evaluations, I think this leaves a lot of potential for future work, to further investigate the concept of multi-task learning in image-to-images translation.

While in this work, only grayscale-images have been used as network input, it would be interesting to extend this approach further to work with 3-channel RGB images. Also, an extension of this approach to larger resolutions than 512 might be interesting, as this would open a lot of new opportunities.

While the described organ segmentation task is comparably simple, an interesting future work would also be an application with more complex tasks, such as lung nodule detection or disease classification.

Lastly, it may be particularly interesting to investigate, how the proposed method works for the learning of more than two tasks jointly.

References

[1] Eslami, M., Tabarestani, S., Albarqouni, S., Adeli, E., Navab, N., & Adjouadi, M. (2020). Image-to-Images Translation for Multi-Task Organ Segmentation and Bone Suppression in Chest X-Ray Radiography. IEEE Transactions on Medical Imaging (pdf)
[2] Caruana, R. (1997). Multitask learning. Machine learning, 28(1), 41-75. (pdf)
[3] Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. (pdf)
[4] Gusarev, M., Kuleev, R., Khan, A., Rivera, A. R., & Khattak, A. M. (2017, August). Deep learning models for bone suppression in chest radiographs. In 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (pp. 1-7). IEEE. (pdf)
[5] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125-1134). (pdf)
[6] Juhász, S., Horváth, Á., Nikházy, L., & Horváth, G. (2010). Segmentation of anatomical structures on chest radiographs. In XII Mediterranean Conference on Medical and Biological Engineering and Computing 2010 (pp. 359-362). Springer, Berlin, Heidelberg. (pdf)
[7| Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham. (pdf)
[8] Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXivpreprint arXiv:1706.05098. (pdf)
[9] Shiraishi, J., Katsuragawa, S., Ikezoe, J., Matsumoto, T., Kobayashi, T., Komatsu, K. I., ... & Doi, K. (2000). Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules. American Journal of Roentgenology, 174(1), 71-74. (pdf)
[10] Van Ginneken, B., Stegmann, M. B., & Loog, M. (2006). Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database. Medical image analysis, 10(1), 19-40. (pdf)

Seitenhierarchie

Image-to-Images Translation for Multi-Task Organ Segmentation and Bone Suppression in Chest X-Ray Radiography