Disentangled representation learning in cardiac image analysis

Agisilaos Chartsias, Thomas Joyce, Giorgos Papanastasiou, Michelle Williams, David Newby, Rohan Dharmakumar, and Sotirios A. Tsaftaris

Patients, who suffer from brain tumor, may get both CT and MRI scans during diagnosis procedure. If you look at both imaging, you will find some common anatomical structures, rendered by the different imaging modalities style. In this work and using deep learning, we factorise medical image into those distinct factors, which are anatomical structure and imaging specific characteristics. This factorisation is called disentangled representation.

Disentangled representation can be defined as Information Representation using some meaningful independent factors. Based on prior knowledge about medical images, These images can be decomposed into:

anatomical factor: which is very useful for medical image processing tasks such as segmentation and registration.
modality factor: which is used to render this anatomy into the specific medical image.

Previous related work can be divided into:

Factorised representation learning: [Fidon et al.]¹ Factorizes information into unique modality factor and those shared between them, but not semantic representation.
Style and content disentanglement : This representation is the same as style and content approach. [Esser et al.]²expressed content as a shape estimation (using an edge extractor and a pose estimator) and combined it with style obtained from a VAE.
Semi-supervised segmentation: [Zhao et al.]³ addresses a multi-instance segmentation task in which they have bounding boxes for all instances, but pixel-level segmentation masks for only some instances.
Cardiac segmentation: [Oktay et al.]^{4 i}nvestigates this problem of 2D, 3D and temporal segmentation respectively, but the majority of recent methods use convolutional networks with full supervision for multi-class cardiac segmentations.
Interpretable semantics Approach: [Chartsias et al.]⁵enforces the spatial factor to be a binary myocardial segmentation. but, remaining anatomies must be encoded in the non-spatial factor, which violates the concept of explicit factorisation into anatomical and modality factors.

In this section, we will build up the used architecture(Spatial Decomposition Network) as shown in the figure, which is based on convoltional neural network and generative models:

Spatial Decomposition Network

1) Anatomical Representation

In order to enforce the network f_anatomy to learn anatomical factor. We assume that each pixel belongs to one channel . The used network for encoding anatomy is UNet, which takes input 2D grayscale image and maps it to C anatomical channels which have width and height of same input image. C is hyperparameter. In order to enforce a factorisation of the spatial factor in distinct channels and make each pixel belong to one channel, softmax is used. Thresholding technique during forward pass to binaralize our channels after softmax, which gives continuous distribution. This technique reduces the capacity of the spatial factor and prevents modality information from being encoded. To update our model during backpropagation, we will use the continuous distribution and skipping the thresholding. The network trained can be found to capture some in individual anatomy (e.g. MYO, LV, RV). in some channels, while remaining channels contain the surrounding image structures which are not anatomically distincts.

2) Modality Representation

Modality style is represented by N dimensional vector. The network f_modality maps input image concatenated with anatomical channels to the modality vector. Variational autoencoder is used to be able to generate samples from the modality encoder, based on learnt posterior distribution of modality vector. The divergence loss L_KL function is used between the estimated posterior distribution and prior Gaussian normal distribution.

3) Image Reconstruction

The estimated anatomical and modality factors should be able to retain the input image. To enforce that, we use network g_decoder which maps anatomical channels modulated by the modality factor to reconstruct the input image. L_rec is used between reconstructed image and original image. A feature wise linear modulation(FiLM)⁶ layer is used to modulate anatomy factor which is 3D volume by modality N dimensional vector. FiLM is implemented as affine transformation, which is used per channel with learnable scale and offset parameters for each feature-map channel.

Another challenge, that happens during training, is VAE posterior collapse. It is a degenerate condition, where the decoder is ignoring some factors of Z. Even though the reconstruction is accurate, not all the data variation is captured in the underlying factors. For that, We need loss L_zrecfunction to penalise the VAE for ignoring dimensions of the latent distribution and encourage each encoded image to produce a low variance Gaussian.

4) Segmentation

We can make use of the anatomical channels for tasks based on anatomy such as segmentation. Another small network is added to map anatomical channels to segmented structure. Dice loss L_segm is used between estimated segmentation and ground truth.

5) Semi supervised segmentation

In medical applications, there is intensive labour to annonate images. Number of unannotated images are much more. In order to use images with no corresponding segmentations in training phase, an adversarial loss is defined using a discriminator over masks DM, based on LeastSquares-GAN⁷. Networks f_anatomy and h are trained to maximise the adversarial objective against DM which is trained to minimise it.

6) Other tasks

We can use the learnt anatomical channels to train a regressor to estimate some anatomy volume (e.g. Left Ventricle). Training the whole network with these auxiliary tasks is found to increase segmentation accuracy.

In this section, we will show experimental results for the proposed approach.

There is four dataset used:

MM-WHS: 3,626 MR and 2,580 CT images(segmented) for Myocardium(MYO), Left atrium(LA), Left ventricular(LV), Right atrium(RA), Right ventricular(RV), Ascending aorta (AO), and Pulmonary artery (PA)
ACDC: 1,920 MR images(segmented) and 23,530 MR images images(not segmented) for Myocardium(MYO), Left ventricular(LV), and Right ventricular(RV)
QMRI: 241 MR images(segmented) and 8,353 MR images images(not segmented) for Myocardium(MYO) and Left ventricular(LV)
Espree: 129 cine-MR and 264 CP-BOLD images(segmented) for Myocardium(MYO)

The following will explain different analysis for proposed network:

1) Semi supervised segmentation

We will use ACDC and QMRI for both experiments. In both experiments, we use fixed 1200 unlabelled images:

a) For ACDC test for LV, MVO and RV: the results are showed compared to other state of art networks for segmentation. In figures, as annotated data decreases the proposed network still has good dice score for segmented labels.

b) For QMRI test for LV and MYO: we will find the same attitude for the proposed network.

Segmentation example for different numbers of labelled images from the ACDC dataset. Blue, green and red show the models prediction for MYO, LV and RV respectively. This shows how networks with GAN are preferable to train semi supervised segmentation model.

2) Left ventricular volume estimation

Using the QMRI dataset, we first calculate the ground truth left ventricular volume (LVV) for each patient. We fine-tune the 6% labelled data with QMRI model from previous experiment, while training the area regressor using ground truth values. Multi-task objective used to fine-tune the whole model also benefits test segmentation accuracy. For both labels individually: MYO accuracy rises from 63.3% to 70.6% and LV accuracy rises from 81.9% to 89.9%.

3) Multimodal learning
a) This experiment shows how training segmentation task using multimodel can compensate lack in dataset. The following shows different percentage for MR and CT labelled data and Dice score (%) on MM-WHS (LV, RV, MYO, LA, RA, PA, AO) data.

b) In next figure, we show how we can synthesize images from different modality using modality vector instead of the original one. 100% of the MR and CT in the MM-WHS dataset are used.

4) Modality type estimation

a) This experiment shows how we can estimate the type of modality from only Z vector. 100% MR and 100% CT images model used from multimodal learning. We trained regressor model for the 8 dim modality vector. The accuracy was 92% on test data.

b) We try to figure out which dimension is more influence on modality rendering. 8 single input logistic regressors one for each dimension of. z5 obtains an accuracy of 82%, whereas the remaining dimensions vary from 42% to 66% accuracy.

5) Latent space arithmetic

This experiment shows how the spatial and modality factors interact to reproduce the output. ACDC(MYO, LV, RV) dataset using 100% of the labelled training images.

a) Arithmetic on the spatial factor s: Changing anatomy channels while modality vector is fixed.

Move MYO to LV and nulling MYO: intensity of the myocardium is now the same as the intensity of the left ventricle
Swap MYO and LV: reverse intensities for the two substructures
Randomly shuffling the spatial channels.

b) Arithmetic on the modality factor z: the modality vector dimensions are changed individually, while anatomy channels are fixed. An image x is encoded to factors s and z. The prior over z is an 8-dimensional unit Normal distribution, 99.7% of its probability mass lies within three standard deviations of the mean. The probability space is almost fully covered by values in the range [-3; 3]. By interpolating each z-dimension between -3 and 3, while keeping the values of the remaining dimensions fixed.

We can decode synthetic images that will show the variability induced by every z-dimension. Correlation between images show large positive or negative correlation between each z dimension and most pixels of the input image. It demonstrates that

z mostly captures global image characteristics.
Local correlations are also evident for example between:
- z1 and all pixels of the heart
- z4 and the right ventricle
- z5 and the myocardium

The difference image is calculated for each row by subtracting the image in the last column position on the grid (j = 3) with the first position on the grid (j = -3):

Different magnitude changes z1 and z4 seem to alter significantly the local contrast

References

[1] Lucas Fidon, Wenqi Li, Luis C Garcia-Peraza-Herrera, Jinendra Ekanayake, Neil Kitchen, Sebastien Ourselin, and Tom Vercauteren. Scalable multimodal convolutional networks for brain tumour segmentation. In Medical Image Computing and Computer-Assisted Intervention.

[2] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. de Marvao, T. Dawes, D. P. O‘Regan, B. Kainz, B. Glocker, and D. Rueckert. Anatomically constrained neural networks (acnns): Application to cardiac image enhancement and segmentation. IEEE Transactions on Medical Imaging.

[3] Agisilaos Chartsias, Thomas Joyce, Giorgos Papanastasiou, Scott Semple, Michelle Williams, David Newby, Rohan Dharmakumar, and Sotirios A. Tsaftaris. Factorised spatial representation learning: Application in semi-supervised myocardial segmentation. In Medical Image Computing and Computer Assisted Intervention.

[4] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. de Marvao, T. Dawes, D. P. O‘Regan, B. Kainz, B. Glocker, and D. Rueckert. Anatomically constrained neural networks (acnns): Application to cardiac image enhancement and segmentation. IEEE Transactions on Medical Imaging.

[5] Agisilaos Chartsias, Thomas Joyce, Giorgos Papanastasiou, Scott Semple, Michelle Williams, David Newby, Rohan Dharmakumar, and Sotirios A. Tsaftaris. Factorised spatial representation learning: Application in semi-supervised myocardial segmentation. In Medical Image Computing and Computer Assisted Intervention.

[6] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. In AAAI, 2018.

[7] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. On the effectiveness of least squares generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence

Seitenhierarchie

Disentangled representation learning in cardiac image analysis