Fully Convolutional Architectures for Multiclass Segmentation in Chest Radiographs

This is the blogpost for the paper 'Fully Convolutional Architectures for Multiclass Segmentation in Chest Radiographs'.

1. Introduction

Radiography is a very useful diagnostic tool. Combined with ultrasonography conventional X-ray imaging helps with up to 80% of all cases nowadays [1]. Fast and cheap chest radiographs (CXR) are one of the most common medical images taken. Main points of interest in CXR imaging are lung fields, clavicle [2] and heart [3].

High-precision segmentation of CXR images can help with early diagnosis of serious diseases like cardiomegaly and emphysema.

Medical analysis of CXR images is a routine task, but a really complex one even for an experienced radiologist: high interpersonal variations in shape and size of central organs (due to age, size and gender), ambiguous organ boundaries (due to organ overlaps and organ movement) and artifacts caused by patient's movements and image modality intrinsics (e.g. inconsistent contrast of soft tissues) [4].

Also when it comes to automating the process of segmentation algorithmic approaches face additional challenges: unequal distribution between CXR's classes. Segmentation in Chest Radiographs (SCR) database: manual segmentations for lung ﬁelds, heart and clavicles for the Japanese Society of Radiological Technology (JSRT) image set:

73.53% of pixels belong to lungs;
4.62% to clavicles;
21.85% to hearts.

This demonstrates a severe between-class imbalance in the data.

Algorithmic approaches (rule-, shape- and graph-based methods, pixel classiﬁcation and statistical approaches, fully-convolutional neural networks) combat this problem with varied success [5][6][7]. The FCN approach, U-Net shows promising results [8].

Thus, there is a demand for a tool that can segment organs of interest in an CXR with high precision and outperform human reader.

This paper introduces a new architecture: InvertedNet for multi-class segmentation of CXR as a fully-convolutional modification of a U-Net.

2. Methodology

2.1. Formal description of multi-class approach

Data: $\begin{array}{l}n\end{array}$ images of size $\begin{array}{l}m_1 \times m_2\end{array}$ , $\begin{array}{l}\textbf I = \{ I_1,... ,I_n \}\end{array}$ ; pixels $\begin{array}{l}\textbf x =(x_1,x_2)\end{array}$ with intensities $\begin{array}{l}I_i(x) \in \textbf {I} \subset \textbf {R}\end{array}$ .

Ground-truths: for each image exists a sequence of masks $\begin{array}{l}M_l = (M_{I,l})^{m}_{l=1} \in \textbf {M} = M^{m_1 \times m_2}({0, 1})\end{array}$ , for $\begin{array}{l}m\end{array}$ semantic labels $\begin{array}{l}\textbf {L} = \{l_1, ...,l_m\}\end{array}$ .

Multi-class masks: $\begin{array}{l}\textbf M^{'} = M^{m_1 \times m_2}(\{0,..., |\textbf {L}|\})\end{array}$ obtained by the mapping $\begin{array}{l}g: \textbf {M} \rightarrow \textbf M^{'}\end{array}$ , $\begin{array}{l}g(M_I) = \sum_{l=1}^{|\textbf {L}|}lM_{I,l}\end{array}$ that maps $\begin{array}{l}M_I\end{array}$ into $\begin{array}{l}\textbf M^{'}\end{array}$ . For $\begin{array}{l}I \in \textbf {I}\end{array}$ let $\begin{array}{l}G_I \in \textbf M^{'}\end{array}$ be the ground-truth matrix and $\begin{array}{l}\pi_l: \textbf M^{'} \rightarrow \textbf M, \pi_l(G_l) = M_{I,l}\end{array}$ the projection of $\begin{array}{l}G_I\end{array}$ onto $\begin{array}{l}\textbf M\end{array}$ for the semantic class $\begin{array}{l}l \in \textbf {L}\end{array}$ .

For training and evaluation purposes the dataset $\begin{array}{l}\textbf I\end{array}$ is split into three non-overlapping sets: $\begin{array}{l}\textbf {I}_{TRAIN}, \textbf {I}_{VALID}, \textbf {I}_{TEST}\end{array}$ .

During training network is passed with minibatches $\begin{array}{l}\textbf {K} \in \textbf {N}\end{array}$ , where $\begin{array}{l}\textbf {N}\end{array}$ is a complete partition of $\begin{array}{l}\textbf {I}_{TRAIN}\end{array}$ . Since network is a function $\begin{array}{l}\textbf {F}: \textbf {I} \rightarrow \textbf M^{'}\end{array}$ , $\begin{array}{l}\textbf {F}(I)\end{array}$ derives for each pixel $\begin{array}{l}\textbf {x}\end{array}$ of $\begin{array}{l}I\end{array}$ it's semantic class $\begin{array}{l}l \in \textbf {L}\end{array}$ in a single step with some probability using loss function $\begin{array}{l}\Lambda: \textbf {I} \times \textbf M^{'} \rightarrow \textbf {R}\end{array}$ that estimates the error of the network outcome that is used for parameter update (During test phase accuracy is estimated over $\begin{array}{l}\textbf {I}_{TEST}\end{array}$ - unseen images).

Output: 3 channels - clavicles, heart, lung fields.

2.2. Proposed architectures

Keeping the balance between a size of the network in use and the amount of available training data is the key for evading overﬁtting which is prominent in every domain where the training data is scarce and hard to come by.

While FCNs can usually be trained well on small sets [9], apparently, in case of U-Net this property is highly domain specific.

Figure 1. U-Net's feature maps grouped by their activation level significance.

U-Net's convolution kernels were grouped before the third downscaling step by their inﬂuence on activations for a random test image. The overlap score difference between unaltered U-Net and U-Net with randomly selected 25% of lower contributing kernels being deactivated were measured. Measurements showed that no significant difference occurred. Fig. 1 illustrates that original U-Net (as depicted schematically in Fig. 2a) suffers from a decreasing feature map depth dependence.

Inspection shows that U-Net has potential for further tuning and more effective training and can be adapted for domain-specific intrinsic needs. Thus, the fully-convolutional modification of U-Net is proposed. Fig. 2 illustrates original U-Net and proposed modifications of U-Net.

Figure 2. Overview of the proposed architectures a) Original U-Net, b) All-Dropout, c)All-Convolutional, d)InvertedNet

2.2.1. All-Dropout

All-Dropout network is a U-Net modification with restrictive regularization as depicted schematically in Fig.2b.

In this setup regularization is as a way to decrease high generalization test error since the availability of the data, as mentioned before, is limited while the network architecture is deep enough. In general, adding a dropout layer has become a common practice in modern deep network architectures [10].

Using Gaussian Dropout is proposed, which is equivalent to adding a Gaussian distributed random variable with zero mean and standard deviation deﬁned as follows:

$\begin{array}{l}\displaystyle \sigma = \sqrt {\frac {d}{1.0 - d}}\end{array}$

where $\begin{array}{l}d \in [0, 1]\end{array}$ is the drop probability.

2.2.2. All-Convolutional

All-Convolutional network is a All-Dropout modification with learning pooling as depicted schematically in Fig.2c.

Each pooling layer of U-Net is replaced by a convolutional layer with ﬁlter size equal to the pooling size of the replaced pooling layer. The idea is about an attempt to simplify the architecture. This modification introduces new parameters into the network and it is reported that such modifications can improve final results [11].

2.2.3. InvertedNet

InvertedNet is a modification of All-Dropout with reordered (inverted) number of feature maps in convolutional layers and special stride parameter values as depicted schematically in Fig.2d.

Another way of dealing with overfitting is reducing the expressivity of the function [12]. This leads to a proposed reducing of a solution space via reordering the number of feature maps in the convolutional layers: starting with a large amount of feature maps and decreasing them by a factor of two after every pooling layer in the contraction part of the network, while increasing the by a factor of two after every upsampling layer in the expansion part. (Number of feature maps per convolution layer :U-Net's 64-128-256-512-1024-512-256-128-64, InvertedNet 256-128-64-32-16-32-64-128-256. Notice that the InvertedNet's amount of maps is somewhat inverted in comparison with U-Net as seen in Fig.2a and Fig.2d.)

2.3.1. InvertedNet training

Large differences in sizes between organs may lead to the bias towards the dominating class [13].

As an answer class weights

$\begin{array}{l}\displaystyle r_{K,l} = \frac { c_{l,K} }{C_K}\end{array}$

(where $\begin{array}{l}c_K\end{array}$ is the total pixel count of a batch $\begin{array}{l}K, l \in L\end{array}$ - semantic class) are introduced in the loss functions

$\begin{array}{l}\displaystyle \Lambda(I,G_I)= - \sum_{l \in \textbf {L}}r_{ \textbf {K},l}^{-1}d_l(I,G_I)\end{array}$

which is minimized over the batch $\begin{array}{l}\textbf K\end{array}$ and the complete partition.

Distance functions $\begin{array}{l}d\end{array}$ : cross-entropy (typical choice for neural networks) and Dice (natural choice for segmentation problems).

Output functions $\begin{array}{l}p_{l}\end{array}$ : since these two distance functions differ in their domain (cross-entropy maps probability distribution to real numbers while Dice just works with binary masks) the final output of the network must differ also:

Softmax for cross-entropy;

$\begin{array}{l}\displaystyle p_{l}^{softmax}( \textbf {x}) = \frac {e^{\textbf {a}_{l} } ( \textbf x) }{ \sum_{k \in \textbf {L}} e^{\textbf {a}_{k} } ( \textbf x) }\end{array}$

Sigmoid for Dice.

$\begin{array}{l}\displaystyle p_{l}^{sigmoid} (\textbf x) = \frac {1}{1 + e^{- \textbf {a}_{l} (\textbf x)} }\end{array}$

where $\begin{array}{l}\textbf {a}_{l} (\textbf {x})\end{array}$ is activation at feature $\begin{array}{l}l\end{array}$ and pixel $\begin{array}{l}\textbf {x}, \textbf {a}_{l} (\textbf x) \in [0,1]\end{array}$ .

The ﬁnal network outputs may be interpreted as an approximated probabilities of the pixel $\begin{array}{l}\textbf x\end{array}$ not belonging to background.

Loss functions $\begin{array}{l}d_l\end{array}$ :

Cross-entropy

$\begin{array}{l}\displaystyle d_{l}^{cross-entropy}(I, G_I) = \frac {1}{c_{\textbf K}} \sum_{ \textbf {x} \in I}X_{\pi_{l}(G_I)}(\textbf{x})logp_{l}(\textbf{x})\end{array}$

Dice

$\begin{array}{l}\displaystyle d_{l}^{dice}= 2 \frac {\sum_{\textbf x \in I}X_{\pi_{l}(G_I)}(\textbf x)logp_{l}(\textbf x)}{\sum_{\textbf x \in I}(X_{\pi_{l}(G_I)}(\textbf x) + p_{l}(\textbf x))}\end{array}$

where $\begin{array}{l}X_{\pi_{l}(G_I)}(\textbf x)\end{array}$ is a characteristic function.

3. Experimental Setup

Data: JSRT dataset both for training and testing: 247 posterior-anterior (PA) chest radiographs with a resolution of 2048×2048, 0.175 mm pixel size and 12-bit depth [14]. Networks are trained on 128×128 and 256×256 resolutions.

Two sets of ground-truth masks:

$\begin{array}{l}\textbf G_{dice}\end{array}$ : left and right lung ﬁelds including clavicles; left and right clavicles; heart. Used for all training runs with the Dice-based loss function.
$\begin{array}{l}\textbf G_{entropy}\end{array}$ : background, left and right lung ﬁelds without clavicles, left and right clavicles, heart. Used for all training runs with the cross entropy-based loss function.

All images are zero-centered and normalized, zero-padding in convolutional layers in all architectures.

Optimization: ADAM, ﬁxed initial rate of $\begin{array}{l}10^{-5}, \beta_{1} = 0.9, \beta_{2} = 0.999\end{array}$ .

Activation function: ELU.

Performance metrics: Dice( $\begin{array}{l}D\end{array}$ ) and Jaccard ( $\begin{array}{l}J\end{array}$ ) similarity coefficients are used.

Dice

$\begin{array}{l}\displaystyle D(I,G_I) = 2 \frac {|P_l(I) \cap \pi_l(G_l)|}{|P_l(I)|+|\pi_l(G_I)|}\end{array}$

Jaccard

$\begin{array}{l}\displaystyle J(I, G_I) = \frac {2}{2 - D(I, G_I)}\end{array}$

where $\begin{array}{l}P_l(I) = \{ \textbf x: \textbf x \in I \wedge |p_l(\textbf x) -1| < \epsilon\}\end{array}$ - set of pixels where model is certain that they do not belong to the background, $\begin{array}{l}\epsilon = 0.25\end{array}$ - threshold parameter, $\begin{array}{l}\pi_l(G_I)\end{array}$ - a $\begin{array}{l}[0,1]\end{array}$ mask for label $\begin{array}{l}l\end{array}$ .

Also symmetric mean absolute surface distance $\begin{array}{l}S_d\end{array}$ is computed [15].

4. Results and Discussions

4.1. Cross-entropy

4.1.1. 256 × 256 imaging resolution

Table 1: Evaluation results of four compared architectures at 256 × 256 imaging resolution.

Table 1 represents evaluation results of four proposed architectures. Scores are computed on the testing set with three-fold cross-validation using networks trained with the loss function based on the cross entropy distance at 256 × 256 imaging resolution. It is shown that All-Dropout and InvertedNet outperform the U-Net. Also, it is visible that clavicle segmentation is a challenging task due to them being underrepresented in the dataset and due to their high shape variation.

4.1.2. 128 × 128 imaging resolution

Table 2: Evaluation results of four compared architectures at 128 × 128 imaging resolution.

Table 2 represents the same evaluation results with only difference being the 128 × 128 imaging resolution. InvertedNet displays the best performance on the clavicle segmentation.

4.1.3. Ensemble

Figure 3. Examples of features extracted after the penultimate upsampling step for All-Dropout (left), All-Convolutional (center) and InvertedNet (right). The same test image was use in all three cases. Higher colour intensities correspond to higher activation values.

Inspection of the derived features of the proposed networks (as shown in Fig. 3) indicates that networks have some degree of specialization, thus motivated by successful implementation of ensemble networks [16], [17] ensembles of pairs and the triplet of proposed architectures were evaluated. The results are shown in Table 3.

Table 3: Evaluation results of ensembles of networks on the combination of the three proposed architectures at 256 × 256 imaging resolution.

Output masks are computed using averaging the outputs of networks with later thresholding using majority voting. The pair InvertedNet + All-Dropout slightly improves the previous scores for all organs with the largest difference achieved for the challenging clavicle segmentation task.

4.2. Dice coefficient

Table 4: Evaluation comparison of InvertedNet architecture for different training and validation splits for the weighted and non-weighted loss functions based on the Dice coefﬁcient.

The performance of the overall best architecture InvertedNet has been evaluated with several splits of input data into training and validation sets and for two loss functions (with and without class weights) based on the Dice coefficient. As shown in Table 4 in the presence of severe between-class imbalance in the data, it is still important to use class weighting.

4.3. Comparison with the state-of-the-art methods

Table 5: InvertedNet architecture comparison with state-of-the-art methods; (*) single-class algorithms trained and evaluated for different organs separately; ”-” the score was not reported.

As shown in the Table 5, while InvertedNet could not surpass the best approaches presented earlier [5][18] it outperformed the human observer on the lung segmentation task. Also, InvertedNet outperformed best method and human observer [6] on the task of heart segmentation. While being unable do outperform human observer on the clavicle segmentation, it surpassed best methods.

4.4. Timing and size

Table 6: Overview of the proposed architectures for 256 x 256 imaging resolution.

As shown in the Table 6, InvertedNet is the fastest multi-class segmentation approach for CXR images segmentation to date. InvertedNet's size and timings can be useful in large clinical environments.

5. Conclusions

This paper proposed an end-to-end approach for multiclass segmentation of anatomical organs in X-ray images, introduced and evaluated three fully convolutional architectures (namely All-Dropout, All-Convolutional, InvertedNet) which matched or even outperformed the state- of-the-art methods on all considered organs. Best architecture - InvertedNet - outperformed the human observer results for lungs and heart, compares favourably to the state-of-the-art methods in the challenging clavicle segmentation task.

6. References

[1] S. Sandstro¨m. The WHO manual of diagnostic imaging, radiographic technique and projections. [Online]. Available: http://www.who.int/diagnosticimaging/publications/dim/radiotech/en/
[2] B. van Ginneken, S. Katsuragawa, B. M. ter Haar Romeny, K. Doi, and M. A. Viergever, “Automatic detection of abnormalities in chest radiographs using local texture analysis,” IEEE Transactions on Medical Imaging, vol. 21, no. 2, pp. 139–149, 2002.
[3] N. Nakamori, V. Sabeti, H. MacMahon et al., “Image feature analysis and computer-aided diagnosis in digital radiography: Automated analysis of sizes of heart and lung in chest images”, Medical Physics, vol. 17, no. 3, pp. 342–350, 1990.[4] L. G. Quekel, A. G. Kessels, R. Goei, and J. M. van Engelshoven, “Miss rate of lung cancer on the chest radiograph in clinical practice,” CHEST Journal, vol. 115, no. 3, pp. 720–724, 1999.
[5] D. Seghers, D. Loeckx, F. Maes, D. Vandermeulen, and P. Suetens, “Minimal shape and intensity cost path segmentation,” IEEE Transactions on Medical Imaging, vol. 26, no. 8, pp. 1115–1129, 2007.
[6] B. van Ginneken, M. B. Stegmann, and M. Loog, “Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database,” Medical Image Analysis, vol. 10, pp. 19–40, 2006.
[7] H. Boussaid, I. Kokkinos, and N. Paragios, “Discriminative learning of deformable contour models,” in International Symposium on Biomedical Imaging (ISBI 2014), April 2014, pp. 624– 628.
[8] C. Wang, “Segmentation of multiple structures in chest radiographs using multi-task fully convolutional networks,” in Image Analysis, P. Sharma and F. M. Bianchi, Eds. Cham: Springer International Publishing, 2017, pp. 282–289.
[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional Networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer- Assisted Intervention (MICCAI 2015). Springer, 2015, pp. 234–241.
[10] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[11] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” International Conference on Learning Representations (ICLR 2015), 2014
[12] K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5353–5360.
[13] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
[14] J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K.-i. Komatsu, M. Matsui, H. Fujita, Y. Kodera, and K. Doi, “Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules,” American Journal of Roentgenology, vol. 174, no. 1, pp. 71–74, 2000.
[15] K. O. Babalola, B. Patenaude, P. Aljabar, J. Schnabel, D. Kennedy, W. Crum, S. Smith, T. Cootes, M. Jenkinson, and D. Rueckert, “An evaluation of four automatic methods of segmenting the subcortical structures in the brain,” Neuroimage, vol. 47, no. 4, pp. 1435–1447, 2009.
[16] C. Ju, A. Bibaut, and M. J. van der Laan, “The Relative Performance of Ensemble Methods with Deep Convolutional Neural Networks for Image Classiﬁcation,” ArXiv e-prints, Apr. 2017.
[17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” ArXiv e-prints, Sep. 2017.
[18] B. Ibragimov, B. Likar, F. Pernu, and T. Vrtovec, “Accurate landmarkbased segmentation by incorporating landmark misdetections,” in International Symposium on Biomedical Imaging (ISBI 2016), April 2016, pp. 1072–1075.
[19] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. Cook, A. de Marvao, T. Dawes, D. ORegan, B. Kainz, B. Glocker, and D. Rueckert, “Anatomically constrained neural networks (acnn): Application to cardiac image enhancement and segmentation,” IEEE Transactions on Medical Imaging, vol. PP, no. 99, pp. 1–1, 2017.
[20] C. F. Baumgartner, L. M. Koch, M. Pollefeys, and E. Konukoglu, “An exploration of 2d and 3d deep learning techniques for cardiac MR image segmentation,” CoRR, 2017. [Online]. Available: http://arxiv.org/abs/1709.04496

Seitenhierarchie