This is a blog post for the DLMA seminar at TU Munich about the paper ‘Unsupervised X-ray image segmentation with task-driven generative adversarial networks’, published in 2020 in ‘Medical Image Analysis’ and written by Yue Zhang, Shun Miao, Tommaso Mansi and Rui Liao.


Introduction and Motivation

This paper deals with the unsupervised segmentation of medical images, precisely with the pixel-wise annotation of organs. Such segmentation is a crucial task for many clinical applications from pathological diagnosis to surgical planning and a large field of machine learning (ML) applications in medicine as this review paper by Tongxue Zhou[1] shows. In medical imaging, there are many kinds of domains such as classic X-rays, topograms, CT scans, MRIs, and so on. Some domains, however, such as X-rays have features (e.g. overlay of organs and ill-defined texture patterns) which make the annotation process very hard. Other domains, CT scans, for example, are much easier to annotate. But when training a segmentation model on CTs it will hardly be able to assess an image from a different domain such as an MRI. This is called a domain gap. In this paper, the author proposes to use more easily accessible annotated CTs to build a segmentation model for unlabeled X-rays. Since a model is trained on one domain and then applied to another unlabeled domain, this is called unsupervised domain adaptation.

In order to bridge the domain gap between CTs and X-rays, Digitally Reconstructed Radiographs are used. DRRs are simulations of radiographic images produced through a perspective projection from a 3D image onto a 2D image plane. During this reconstruction positions and labels of organs are preserved such that the DRRs are now pixel-wise annotated for the heart, liver, lung, and bones. Figure 1 shows the three types of domains this paper deals with. A network trained on synthetic DRRs (a) will not perform well when applied directly to (b) or (c). This is the domain gap. 


Figure 1: (a) shows a synthetic DRR rendered from a 3D CT scan; (b) a topogram image, (c) a standard Chest X-ray image from NIH public dataset. 

Related Work

Using convolutional networks for image segmentation has become an important field of research since the pioneering work of Long et al. (2015) [2]. Inspired by this work, many other encoder-decoder structured networks have been proposed. Two important benchmarks used in this paper are Segnet (Badrinarayanan et al., 2017) [3] and UNet proposed by Ronneberger et al. (2015) [4]. Most of these networks are so-called image to image (I2I) networks that generate pixel-wise labels. In this paper, a dense adaptation of the UNet will be used to predict organ labels. 

The early work of ( Pan et al., 2011; Long et al., 2013)[5,6] in domain adaptation has come a far way and evolved within medical imaging to adapt between DRRs and X-rays as in Zheng et al. (2017) [7]. Further, successful segmentation on labeled CT scans has been shown by Yang et al. (2017)[2] and training on DRRs has also been used by Albarqouni et al. (2017)[8] for X-ray decomposition. However, since the acquisition of paired images is quite challenging, Tzeng et al. (2017) [9] have proposed a cyclic structure that overcomes this restriction. These models focus on style transfer nonetheless and less on information on a pixel level which is why this paper proposes a task-driven approach that should preserve the important information needed for pixel-wise segmentation. 

It is important to note that the authors have published a processor paper on ArXiv in 2018. The paper discussed in this blog shows an improvement in training data and methods from the original paper Zhang Y  et al. (2018) [10]. 

Data

DRRs

The DRRs are generated from a 3D labeled CT image through a virtual imaging system that simulates the angle and 2D projection of an X-ray. More precisely, the DRRs are determined by the line integrals the Hounsfield values along the virtual ray trajectories as shown by Ruijters et al. (2008)[11]. Since the CTs have pixel-wise annotations, these can be projected onto the DRRs. The authors of this paper leveraged a wide range of sources as well as augmentation techniques such as random translation and scaling to compile a large dataset of 815 CT scans which cover a broad field of view from the neck to the kidneys. With this dataset, the authors try to segment the heart, lung, liver, and bones. All DRRs and the following X-rays were cropped to have the same field of view and a size of 256 x 256 pixels. The dataset was split 60%-20%-20% i.e. using 489 images for training, 163 for validation and 163 for testing the segmentation model

Topograms

A topograms is a special type of X-ray image. In medical applications, they are often used for planning purposes before CT scans thus one can infer the labels from the co-registered CT scan onto the topograms. These co-registered CTs were not used for the segmentation training and the topogram labels are only used for testing purposes. As with the DRRs, the topogram field of view, contrast, and pixel size is adjusted to have the greatest visual similarity between the domains. 

X-rays

The Japanese Society of Radiological Technology (JSRT) has an open-access chest X-ray dataset Shiraishi et al. (2000)[12] which contains 247 images. A segmentation including heart, lung, and clavicle was created by Van Ginneken et al. (2006)[13]. Unfortunately, the field of view does not cover the full lung so that only the heart will be used for pixel-wise comparison of the segmentation model. The data split of 50%-50% and preprocessing as described above is employed. 

The national institute of health (NIH) has released a large dataset of chest X-ray images in 2017 with over 110,00 chest X-rays. Since these annotations are disease diagnosis-related and not precisely labeled, 500 random samples will be used for qualitative analysis. Furthermore, the field of view is limited in these images such that only the lung, heart, and bone can be segmented.

Methods

Overview

The goal is to train an unsupervised multi-organ segmentation model on X-rays by using unpaired pixel-wise annotated DRRs data. This is done by training a deep image-to-image network to label each pixel on the DRR images with the four organs. This network, from now on called DI2I, is then frozen and incorporated into a task-driven cyclic generative adversarial network (TD-GAN) to provide deep supervision for the parsing between DRRs and topograms / X-ray images while preserving pixel-wise information. The trained TD-GAN can then convert X-rays and topograms to (fake) DRRs which then are segmented by the D2I2 as figure 2) shows.

Fig. 2: Overview of the full model including the DI2I and the task-driven generative model framework. After generating the DRRs, the deep image-to-image network is trained to segment the four organs.
Then, the network weights are frozen and employed in the TD cycle-GAN to provide deep supervision for the parsing between DRRs and topograms / X-ray images.

DI2I 

As already mentioned, the authors employ a DenseUNet structure. The DenseUNet is a fully convolutional network that employs an encoder-decoder UNet structure with densely connected convolutional blocks. It follows the ideas of Jégou et al. (2017)[14] and is built on the Unet work of ( Ronneberger et al., 2015)[4]. The dense blocks concatenating all feature outputs to forward layers and thus help alleviate the vanishing gradient problem. Furthermore, the skip connections between layers encourage feature reuse and strengthen feature propagations. The results chapter will show that such networks can achieve superior performance over widely used I2I networks such as UNet and Segnet. The loss function (Eq. 1) is a weighted combination of binary cross entropies between organ channels and a background change,


(1) \mathcal{L}_{seg} = −4 \sum_{i=1}^4 w_i (y_i log(p_i) + (1 −y_i) log (1 −p_i ) ),

where y_i is the ground truth label and p_i the softmax probability score

p_i = exp(x_i) / (exp(x_0)+ exp(x_i))

for i = 1,2,3,4 and x_0 the background channel.


Fig. 3. DenseUNet structure with forward skip connections and dense blocks.

The model is trained on a 12GB NVIDIA TITAN X GPUs with DenseUNet containing 0.745 million trainable parameters which is less than UNet and Segnet, with approximately 1 million each.

TD-GAN

As mentioned, the cyclic structure of the generative adversarial network (GAN) allows training on unpaired data between the two domains of DRRs with topograms and later with DRRs and X-rays. By training the GAN with a set of DRRs and topograms with cycle-consistency on the generators, the general structures and features are preserved within this process and a strong pixel-wise constraint is imposed. In figure 4 the basic structure of a cyclic GAN is shown with visual examples. 

Fig. 4. left: schematic of a simple cyclic GAN with two generators and two discriminators.
right: visualization of a real DRR from which G1 generates a "fake" X-ray for the discriminator to assess and a real X-ray made to a "fake" DRR.


However, this standard GAN serves as appearance transfer of general appearance and will not focus on segmentation features which are clearly needed for the organ segmentation. Thus additional constraints are added as loss functions to this GAN which makes it a conditional or task-driven GAN. In the following, the loss functions of the different paths will be explored, i.e. the paths G1, G2, G1→G2 and G2→G1. 


Fig. 5. shows the four paths in which data is passed and loss functions are applied. Note the DI2I with a grey box which is the trained and frozen segmentation network.


Path 1) Real DRR → Fake X-ray 

After a real DDR is passed through G_1 and then assessed by D_1 tries to distinguish the fake from a randomly selected X-ray from the training dataset.

A standard lossGAN function with data distribution d~p_d  and x~p_x  for DRRs and X-ray images is applied in this path, 

(2) \mathcal{L}_{DX} := E_{x ∼p_x} \{ log [ D_1 (x ) ] } + E_{d∼p_d} { log [ 1 −D_1 (G_1 (d)) ] \} .


Path 2) Real X-ray → Fake DRR 

Similar to path 1), fake DRRs are generated to challenge D_2.  In order to obtain a generator focused on the task of segmentation, the model now applies the fake DRR to the pre-trained DI2I and concatenates the predicted labels. D_2 must distinguish these image-label pairs and thus forces G_2 to particularly give emphasis to the four organs of interest. In order to keep the loss function differentiable, the predicted probability map from the DI2I is not binarized, resulting in the loss function

(3) \mathcal{L}_{DX} := E_{x ∼p_x} \{ log [ D_2 (d|U(d) ) ] } + E_{x∼p_x} { log [ 1 −D_2 (G_2 (x)| U(G_2(x))) ] \} .

while U ( . )  is the application of the DI2I network and (d|U(d) a random DRR data sample with its predicted probability map concatenated.


Path 3) Real X-ray → Reconstructed X-ray

The reconstructed X-ray image should be identical to the original when after passing through both generators. This cycle-consistency is calculated by the l_1 distance,

(4) \mathcal{L}_{XX} := E_{x ∼p_x} \{ || G_1 (G_2 (x) ) - x||_1 \}.


Path 4) Real DRR → Reconstructed DRR

Finally, the fourth path also imposes cycle-consistency but also imposes the segmentation loss from Eq. (1)


(5) \mathcal{L}_{DD} := E_{d ∼p_d} \{ || G_2 (G_1 (d) ) - d||_1 \} \\ \mathcal{L}_{DD} := \mathcal{L}_{seg}


Overall the TD-GAN has a loss function of

(6) \mathcal{L}_{all} := \lambda_1 \mathcal{L}_{DX} + \lambda_2 \mathcal{L}_{XD} + \lambda_3 \mathcal{L}_{XX} + \lambda_4 \mathcal{L}_{DD} + \lambda_5 \mathcal{L}_{seg}

with \lambda_i > 0 being constant weights. When compared to a standard GAN, the minimization of \mathcal{L}_{XD} and \mathcal{L}_{seg} enforce additional constraints which force the model to preserve segmentation specific information and focus on the four organs. 

Results

DI2I

The authors trained their DenseUNet as well as two comparison models, UNet and SegNet, which all displayed a rather strong ability to segment the organs. In table 1) these dice score results can be seen. 

Table 1) shows dice results (mean ± standard deviation) of the DRRs with different networks

TD-GAN on topograms

As stated in the Data section, all DRRs and 150 topograms are used for training the GAN while the remaining 100 topograms are used for testing. Note that these dice results are acquired by running the test sample x into the GAN G_2(x) and then evaluate the fake DRRs with the DI2I U(G_2(x)). In table 2), several "modes" are compared to each other: 

  • Vanilla: pre-trained DI2I is applied to topograms without any use of the GAN
  • Cycle-GAN: use a standard (non-TD) cycle-GAN 
  • TD-GAN-A/S: applying only one of the conditional loss functions \mathcal{L}_{XD} and \mathcal{L}_{seg} to the GAN during training
  • TD-GAN-Segnet/UNet: use the alternative segmentation networks instead of DenseUNet
  • TD-GAN-DenseUNet: Full framework as discussed up to now
  • Supervised: Train and test the DI2I on labeled topograms

Table 2) shows the dice results (mean ± standard deviation) of multi-organ segmentation on topograms for different versions of the model

Figure 6) visualizes the results of the vanilla (top row) and TD-GAN-DenseUNet (bottom row) method. 
The red line displays the ground truth while the colored parts represent the predictions

In both quantitative as well as qualitative analysis, a strong improvement from the direct application of the network (vanilla) to the conditional GANs can be observed which proves the TD-GAN to be good at bridging domain gaps and preserving segmentation information.

TD-GAN on chest X-rays

The same methodology from the topograms is applied to the two X-ray datasets. Table 3) shows quantitative results for the heart from the JSRT data. Again, the strength of the TD-GAN can be seen and an apparently "wider" gap between DRRs and X-rays can be implied since the difference between Vanilla and Supervised is even larger than for topograms. 


Table 3) shows the dice results (mean ± standard deviation) for three different methods on the JSRT dataset. 


Finally, a qualitative analysis of the NIH data can be seen in Figure 6). Note how there are no ground truths on this data set but it is clear how the vanilla method is not able to detect the heart while this task improves considerably with a TD-GAN.   

Figure 7) shows a qualitative analysis of the NIH X-rays with the segmentation results of the Vanilla method (top) and the TD-GAN method (bottom)


Discussion

Authors comments

The authors include a brief discussion in their paper in which they point out that this approach should be seen as a more general framework. By replacing the DI2I with any other classifying or evaluation function, the framework could be adapted to preserve other features of the images when moving from one domain to the other.

Furthermore, they note that the method of this paper can be directly applied to a real-world situation. Clinicians can perform annotations on image types easy to delineate organs (or other features) and then use that data and the model to annotate images that require more effort as is done in this paper. 

Personal comments

As often seen in Medical Image Analysis, this paper is very well structured and does a good job of making the method accessible to non-specialists. The method, data, and results are detailed, and especially the in-depth comparison to different versions of the model as well as to the "vanilla" and supervised method give a fair representation of the strengths of this approach. Keeping this in mind, the results are quite impressive considering this is a fully unsupervised method, the rather small amount of unpaired data, and the large domain gap between DRRs and X-rays. 

However, the analysis of the X-ray results with only one organ to quantitatively evaluate is a bit disappointing and the discussion and conclusion section of the paper is very limited. They do not attempt an explanation of why the model performs much better on topograms or how the results for the chest X-rays could be improved. 

Nonetheless, I find this paper well structured and the methodology novel and forth further investigating.

Future work

The authors mention two main directions in which further investigation seems promising.

Firstly, it is important to note that the TD-GAN method is not limited to segmentation problems but can be adapted to many other scenarios in Machine Learning diagnosis such as lesion classification, anatomical landmark localization, or abnormal motion detection. In these cases, the DI2I would be replaced with the respective supervision network and again force the cycle GAN to preserve the important style information of the problem. Thus far I have found no papers following this path which is likely due to the fact that this paper is less than four months old. 

Furthermore, the method could be adapted to a semi-supervised method. The motivation for this is that often labeled data is available but not enough to train a large network such as the DenseUNet in which case the available X-ray labels incorporated into the TD-GAN method, in addition to the labeled DRRs. 

Conclusion

In order to bridge the domain gap between different topological imaging domains and become less dependent on the detailed, pixel-wise annotation of data, this paper successfully employs a task-driven GAN to accomplish unsupervised domain adaptation. This is done by training a deep image to image segmentation network on labeled Digitally Reconstructed Radiographs. Then, the network weights are frozen and employed in a conditional cycle-GAN which provides the necessary parsing between DRRs and topograms images as well as preserving the organ information. They show how this method fastly improves the accuracy of the unsupervised target domain. 

To summarize, this paper solves an...

unsupervised multi-organ segmentation problem on X-ray images with a novel task-driven GAN model by taking labeled DRR images as input and through domain adaptation produce meaningful segmentation on real X-rays.



Sources

All images were taken from the original paper "Unsupervised X-ray image segmentation with task driven generative adversarial networks"


Sources mentioned in this blog:

[1] A review: Deep learning for medical image segmentation using multi-modality fusionTongxueZhou

[2] Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440.

[3] Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39 (12), 24 81–24 95.

[4] Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image Computing and Computer-Assisted Intervention. Springer, pp. 234–241

[5] Pan, S.J. , Tsang, I.W. , Kwok, J.T. , Yang, Q. , 2011. Domain adaptation via transfer com- ponent analysis. IEEE Trans. Neural Netw. 22 (2), 199–210 .

[6] Long, M. , Wang, J. , Ding, G. , Sun, J. , Yu, P.S. , 2013. Transfer feature learning with joint distribution adaptation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2200–2207.

[7] Zheng, J. , Miao, S. , Liao, R. , 2017. Learning CNNs with pairwise domain adaption for real-time 6d of ultrasound transducer detection and tracking from x-ray im- ages. In: International Conference on Medical Image Computing and Comput- er-Assisted Intervention. Springer, pp. 646–654 .

[8] Albarqouni, S. , Fotouhi, J. , Navab, N. , 2017. X-ray in-depth decomposition: revealing the latent structures. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 4 4 4–452 .

[9]  Tzeng, E. , Hoffman, J. , Saenko, K. , Darrell, T. , 2017. Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition (CVPR), 1, p. 4 .

[10] Yue Zhang, Shun Miao, Tommaso Mansi, Rui Liao Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation

[11] Ruijters, D. , ter Haar Romeny, B.M. , Suetens, P. , 2008. GPU-accelerated digitally re- constructed radiographs. BioMED 8, 431–435 .

[12] Shiraishi, J., Katsuragawa, S., Ikezoe, J., Matsumoto, T. , Kobayashi, T. , Komatsu, K. , Matsui, M. , Fujita, H. , Kodera, Y. , Doi, K. , 20 0 0. Development of a digital image database for chest radiographs with and without a lung nodule: receiver op- erating characteristic analysis of radiologists’ detection of pulmonary nodules. Am. J. Roentgenol. 174 (1), 71–74 .

[13] Van Ginneken, B. , Stegmann, M.B. , Loog, M. , 2006. Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database. Med. Image Anal. 10 (1), 19–40 . 

[14] Jégou, S. , Drozdzal, M. , Vazquez, D. , Romero, A. , Bengio, Y. , 2017. The one hundred layers Tiramisu: fully convolutional densenets for semantic segmentation. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Con- ference on. IEEE, pp. 1175–1183 .

  • Keine Stichwörter