This is a summary blog post for the paper Automatic 3D Bi-Ventricular Segmentation of Cardiac Images by a Shape-Refined Multi-Task Deep Learning Approach by Jinming Duan, Ghalib Bello, Jo Schlemper, Wenjia Bai, Timothy J.W. Daws, Carlo Biffi, Antonio de Marvao, Georgia Doumou, Declan P. O'Regan, and Daniel Rueckert.

The following is a list of the abbreviations used in order of appearance:

DL = Deep Learning
HR = High-Resolution
LR = Low-Resolution
CMR = Cardiac Magnetic Resonance
SSLLN = Simultaneous Segmentation and Landmark Localization Network
LL = Landmark Localization
AP = Atlas Propagation
SR = Shape Refinement
LVW = Left Ventricular Wall
LVC = Left Ventricular Cavity
RVW = Right Ventricular Wall
RVC = Right Ventricular Cavity
LVV = Left Ventricular Volume
LVM = Left Ventricular Mass
RVV = Right Ventricular Volume
RVM = Right Ventricular Mass
PH = Pulmonary Hypertension
LR = learning rate

Overview

Applications of DL in medical imaging segmentation are increasing. The most common approaches for CMR volumes segmentation include the use of 3D FCN like the one proposed in [1], or 2D FCN like the one proposed in [2]. Most models are trained with healthy patients' data and do not consider image artifacts in the segmentation process. While 3D FCN guarantees spatial consistency along the long axis, achieving more accurate results, it is more computationally expensive than 2D FCN. Similarly, 2D FCN is less computationally expensive but does not consider spatial context, reducing its accuracy.

Duan et al [3] take into account the limitations from previous works, proposing a two-stage automatic pipeline(see Fig.1) that produces an artifact-free, smooth 3D bi-ventricular segmentation model applicable to HR and LR CMR volumes.

Figure 1. SSLN Pipeline (image modified from [3])

The network, SSLLN, consists of three primary components: Segmentation, LL, and AP. The first two form the first stage, being that the reason for the approach being multi-task. A 2.5D FCN predicts the segmentation masks and the localized landmarks, simultaneously.

Two versions of the network are used. The first, SSLLN-HR, is trained with HR CMR volumes. Since these are artifact-free, the output model is smooth and accurate. The second, SSLLN-LR, is trained with LR volumes. These are normally acquired from non-healthy patients through several breath-holds and are characterized by image artifacts due to increased slice thickness and patient motion. These propagate into the model’s segmentation(see Fig.1). To compensate for this, the second stage of SSLLN, SR through AP, is proposed.

For SR, the 3D models acquired from SSLLN-HR, are considered atlases. These are propagated onto SSLLN-LR’s initial segmentation using the localized landmarks. Thus, achieving an artifact-free model.

Methodology

Methods: Segmentation and LL

The 2.5D FCN(see Fig. 2) is a supervised learning approach (pre-labeled CMR volumes were used for training). Also, it is a multi-class classification network since four different cardiac regions were segmented: LVW, LVC, RVW, RVC; and six landmarks localized: LV lateral wall mid-point, two RV insert points, RV lateral wall turning point, apex and center of the mitral valve.

Figure 2. 2.5 Fully Convolutional Network Architecture (image from [3])

The inputs are 3D volumetric images as multi-channel vectors, in which each 2D slice is considered a channel, in a similar structure to RGB images. This guarantees spatial consistency along the long axis. However, since most operations(after first layer and before the last) are 2D, it requires less computational power for training than 3D FCN and allows deeper network architectures, in this case fifteen layers were used.

Differentiable Dice(L_D(W)) was used as loss function for segmentation (Eq. 1):

This measures the similarity between ground-truth probability (given by the indicator function) of a specific voxel j having label k, and the estimated Softmax probability. Thus, minimizing its negative maximizes the similarity.

Class-Balanced Weighted Categorical Cross-Entropy was used as loss function for LL(L_L(W)) (Eq. 2):

The weight compensates for the landmark-to-background voxels ratio. This ensures that the network learns the landmark features and not the background.

To achieve multi-task learning, these two loss functions were combined into one (Eq. 3):

where α = 0.8 and β = 0.2.

Methods: Shape Refinement

Figure 3. Landmark Localization Process (image modified from [3])

SR introduces anatomical shape prior knowledge into SSLLN-LR’s initial segmentation prediction, to avoid artifacts. The process is explained in Fig.3. During experimentation, 231 SSLLN-HR output models were used as atlases, of which the five most similar to the target were used for non-rigid registration. For non-rigid registration, only the segmentations were used. To find the proper transformation, label consistency was used as a loss.

Experiments

Two datasets were used. Dataset 1 included 1831 HR CMR volumes from healthy patients. Dataset 2 included 649 CMR volumes from PH patients, 20 of which were HR pairs of LR CMR volumes for patients that were able to hold their breaths long enough.

Data pre-processing involved image resizing and intensity re-scaling. As for parameters, they used 50 epochs, 8-volumes batch size, 0.0001 LR, and Adam optimizer for training. Data augmentation was done to ensure more diverse inputs.

SSLLN-HR Segmentation & Dataset 1

Here, SSLLN-HR’s segmentation results were compared to those of 3D and 2D FCNs. Dice Index and Hausdorff Distance were used for evaluation metrics, as with most experiments in the paper. The comparison can be seen in Fig.4 and Table 1.

Figure 4. Segmentation comparison among SSLLN-HR, 3D FCN, and 2D FCN (image modified from [3])

Table I. (Taken from [3])

It is clear that SSLLN-HR’s Segmentation outperforms 2D FCN, due to the failure of the latter to consider spatial context. While its results are similar to those of 3D FCN, this last is more computationally expensive during training.

SSLLN-HR LL & Dataset 1

Here, the LL of the clinical expert who labeled the training dataset (inter-user 1) is taken as ground-truth. LL SSLLN-HR’s prediction is compared to that of inter-user 1 and a second clinical expert, inter-user 2, using Point-to-Point Euclidean distance as evaluation metric(see Fig.5 and Table II).

Figure 5. Comparison between SSLLN-HR LL results and ground-truth landmarks (image modified from [3])

Table II. (Taken from [3] - Title modified)

Although I think the results would have more impact if an external clinical expert was taken as ground-truth, the difficulty in finding a third clinical expert for data-labeling is understandable, and they show that the difference between SSLLN-HR LL predictions and inter-user 1 is significantly less than that between two clinical experts.

SSLLN-LR+SR & Dataset 1

This experiment, where artifacts were simulated on HR volumes, was performed in order to obtain quantitative evaluation results due to the lack of smooth ground-truth LR models. The results of SSLLN-HR and SSLLN-LR+SR comparison, are seen in Tables III-V.

Table III. (Taken from [3])

Table IV. (Taken from [3])

Table V. (Taken from [3])

Tables III-IV show that SSLLN-HR outperforms SSLLN-LR+SR, and although the paper mentions the results are comparable, Table III shows there is a significant impact on the network’s accuracy when being trained on HR versus LR volumes. However, Table V shows how SSLLN-LR+SR outperforms other state-of-the-art cardiac segmentation approaches.

SSLLN-LR+SR & Dataset 2

Due to the lack of smooth LR ground-truth models, a clinical expert visually evaluated the segmentation, confirming its accuracy. The 20 pairs of HR and LR volumes were used to quantitatively evaluate the results. These can be seen in Table VI.

Table VI. (Taken from [3])

According to the authors, the high p-values are due to low amount of volumes used and the accuracy of the results.

Discussion

Several reasons are given for SSLLN-LR+SR outperforming other approaches: AP imposes SR explicitly while other 3D-Network approaches impose shape constraints implicitly, 2.5D FCN allows deeper network architectures than 3D, and label-based registration is more accurate than intensity-based.

SSLLN-LR+SR gives accurate results for PH patients’ volumes because: landmark-based affine transformation captures global and local deformations between subjects, using multiple atlases for registration avoids diseased cases producing healthy results, and label-consistency is used as a loss function for non-rigid registration.

Using LL is justified by performing the segmentation using tissue classes to initialize spatial alignment, obtaining low-quality results. The explanations for this are: Landmarks effectively reflect the pose, size and shape of the heart, and computing an affine transformation using LL is a convex problem.

Conclusion

SSLLN combines the computational advantage of 2D FCN with the spatial consistency of 3D FCN. The inducement of shape prior information into SSLLN-LR segmentation resulted in an artifact-free and smooth 3D Bi-ventricular model.

Limitations include it being a 2-stage approach, meaning a high dependence from the results of the second stage on the first. Also, when deployed, SR takes 15-20 minutes per subject.

Future work involves training a single network to compute smooth shapes from artifact-corrupted LR CMRs, improving computational speed of SR, and applying the network on classification of healthy vs. non-healthy patients.

Student’s Reflection

The authors present an original and elegant solution to several previously unsolved problems for cardiac segmentation with high-quality results, which value is recognized by other authors [4]. In the same way, the code for the network being public facilitates the understanding of the proposed method. However, there were a few discussion points that caught my attention.

For instance, the computational expense of SR is mentioned as a limitation. Nevertheless, no point of comparison is given. While the 3D ACNN has a Dice Index result of 0.004 lower than that of SSLLN-LR+SR, its inference time is of 0.06s per subject [1] versus 15-20 min per subject. While this might not be significant when considering few subjects, it can definitely be when considering large datasets or the possibility of using this model for a classification network. In these cases, whether the difference in time is worth the difference in quality could be questionable.

Also, when talking about the results in Table VI, the authors just mention the high p-values, resulting in no significant difference between the clinical measurements given by the network and those by a clinical expert. However, it can be observed that the small p-value for the RVV proves a significant difference in the measurements. This, I think, is especially important since the main difference between a PH patient and a healthy patient is that the RV is bigger for a patient with PH. This means that the use of 5 atlases for registration might sometimes not be sufficient to avoid diseased cases producing healthy results.

Overall, I think the proposed method is innovative, efficient, and creative. I truly enjoyed preparing the presentation on this paper and when a few typos made some of the equations a bit hard to understand, the author was kind enough to explain the corrections.

Bibliography

[1] Oktay, O., Ferrante, E., Kamnitsas, K., Heinrich, M., Bai, W., Caballero, J., ... & Kainz, B. (2017). Anatomically constrained neural networks (ACNNs): application to cardiac image enhancement and segmentation. IEEE transactions on medical imaging, 37(2), 384-395.

[2] Bai, W., Sinclair, M., Tarroni, G., Oktay, O., Rajchl, M., Vaillant, G., ... & Zemrak, F. (2018). Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. Journal of Cardiovascular Magnetic Resonance, 20(1), 65.

[3] Duan, J., Bello, G., Schlemper, J., Bai, W., Dawes, T. J., Biffi, C., ... & Rueckert, D. (2019). Automatic 3D bi-ventricular segmentation of cardiac images by a shape-refined multi-task deep learning approach. IEEE transactions on medical imaging, 38(9), 2151-2164.

[4] Narang, A., & Freed, B. H. (2019). The Future of Imaging in Pulmonary Hypertension: Better Assessment of Structure, Function, and Flow. Advances in Pulmonary Hypertension, 18(4), 126-133.

Seitenhierarchie

Automatic 3D Bi-Ventricular Segmentation of Cardiac Images by a Shape-Refined Multi- Task Deep Learning Approach