Data Augmentation Using Learned Transformations for One-Shot Medical Image Segmentation

This is a blog post for the paper ‘Data Augmentation Using Learned Transformations for One-shot Medical Image Segmentation’.
Written by Amy Zhao, Guha Balakrishnan, Fredo Durand, John V. Guttag, Adrian V. Dalca.

Introduction and Problem Statement

Semantic segmentation is an important task in medical image analysis since it may be used in various application settings, including disease diagnosis and treatment planning [6]. A wide range of methods is used to improve the segmentation accuracy including supervised [1], semi-supervised [2], and unsupervised [3] learning approaches. However, the current state-of-the-art results are achieved with supervised learning models that require a big chunk of manually annotated data. Collecting and labeling medical images is not only a laborious task, but it is also complicated by the discrepancies in image intensities, resolution, and subject positioning, caused by variation in individual subjects and the equipment used [4].

Data Augmentation

Data augmentation techniques can be used in order to increase the size of the training dataset. It is common practice to use simple transformations to the available data, such as rotations, translations, or random crops [9]. However, these augmentation schemes rely on hand-tuned parameters which makes data augmentation an implicit form of parameter tuning. Also, as biomedical images differ significantly in anatomy, contrast, and texture, simple transformations are insufficient to model such complex variations.

Transformation Models

Recently it has become popular to learn transformations that can be later on used to synthesize new data from the available labeled data. In [10] the authors propose to learn the transformations that align a pair of images - spatial transformations, for the MNIST dataset, and use them for data augmentation [Figure 1]. In addition to spatial transformations, many works have been known to use the appearance transformations that are accountable for the intensities variations [7].

Figure 1: Example of learned spatial transformations. In the top row there is original data and in the second row the images with the random transformation from the model applied [10].

In this paper, the authors drive their ideas from the combination of spatial and appearance transformation models to propose a novel data augmentation technique, which works even in a more challenging scenario of a few-shot semantic segmentation, where just a few annotated instances may be used to perform semantic segmentation. The high-level idea of the proposed method is to learn the transformation from the data and use them to generate a naturally looking dataset that can capture the complexity of biomedical image real variation.

Methodology

In order to synthesize realistically-looking labeled scans the proposed approach requires only one labeled image and a number of unlabeled instances. Two separate transformation models are used to learn and compute a set of spatial and appearance transformations between each pair of the labeled and unlabeled samples. Later on, these transformations can be randomly picked from the set and applied to the labeled instance in order to produce a new synthetic example [Figure 2].

Figure 2: The overview of the proposed method [11].

Spatial Transformation Model

Spatial transformation model is used to align two given images and find anatomical correspondences from scan to scan. In this paper, the results from an unsupervised learning model VoxelMorph [5], that performs spatial transformation for atlas-based segmentation, are used. Having a labeled reference volume $\begin{array}{l}(x,l_x)\end{array}$ - an atlas, and an unlabeled set of biomedical image volumes $\begin{array}{l}\{y^{(i)}\}\end{array}$ , the transformation to align $\begin{array}{l}x\end{array}$ to $\begin{array}{l}\{y^{(i)}\}\end{array}$ is modeled as a voxel-wise displacement field $\begin{array}{l}u\end{array}$ [Figure 4]. The deformation function is defined as $\begin{array}{l}\phi = id + u\end{array}$ , where $\begin{array}{l}id\end{array}$ is an identity function. To account for the natural variability in the data, the deformation that wraps atlas $\begin{array}{l}x\end{array}$ to each volume $\begin{array}{l}y^{(i)}\end{array}$ is modeled with $\begin{array}{l}\phi^{(i)}=g_{\theta_s}(x,y^{(i)})\end{array}$ , where $\begin{array}{l}g_{\theta_s}(x,y^{(i)})\end{array}$ is a parametric function that is learned by a convolutional neural network based on the U-Net [13] architecture [Figure 3].

Figure 3: Overview of the spatial transformation model [5].

Figure 4: An example of displacement field from aligning the moving image to the fixed one. The displacement field shows with vectors where each of the voxels should be moved [12].

The loss function for this network consists of two parts:

$\begin{array}{l}\mathcal{L}(x,y^{(i)},\phi^{(i)})=\mathcal{L}_{sim}(y^{(i)}, x\circ\phi^{(i)})+\lambda \mathcal{L}_{smooth}(\phi^{(i)})\end{array}$

Similarity loss $\begin{array}{l}\mathcal{L}_{sim}(y^{(i)}, x\circ\phi^{(i)})\end{array}$ that is used to measure image similarity. It is based on cross-correlation, which makes it robust to intensity variations [8].
$\begin{array}{l}\mathrm{\mathcal{L}_{sim}(y^{(i)},\,x\circ\phi^{(i)}) = -CC(y^{(i)}, x\circ\phi^{(i)})}\end{array}$
$\begin{array}{l}\mathrm{CC(y,\,x\circ\phi)=\sum_{p \in \Omega}{\frac{\Big(\sum_{p_i}\big(y(p_i)-\hat{y}(p)\big)\big([x\circ\phi](p_i)-[\hat{x}\circ\phi](p)\big)\Big)^2}{\Big(\sum_{p_i}\big(y(p_i)-\hat{y}(p)\big)^2\Big)\Big(\sum_{p_i}\big([x\circ\phi](p_i)-[\hat{x}\circ\phi](p)\big)^2\Big)}}, where \hat{f}(p) = \frac{1}{n^3}\sum_{p_i}f(p_i) - local mean intensity}\end{array}$
Smoothness loss $\begin{array}{l}\mathcal{L}_{smooth}(\phi^{(i)})\end{array}$ that makes the deformation field smooth and is modeled with a spatial gradient of the deformation field
$\begin{array}{l}\mathrm{\mathcal{L}_{smooth}(\phi)=\sum_{p\,\in \Omega}||\nabla u(p)||^2}\end{array}$

Appearance Transformation Model

The appearance transformation model is responsible for modeling variations in intensities. It is modeled as per-voxel addition in intensities.

$\begin{array}{l}\tau_a^{(i)}(x)=x+\psi^{(i)}\end{array}$

It should be noted that we are trying to find such transformation that can be applied directly to our atlas in order to synthesize new images. Because of that, we have to register our unlabeled instance to the atlas space first, using an inverse spatial transformation $\begin{array}{l}\phi^{-1^{(i)}}\end{array}$ . The inverse spatial transformation is implemented as a separate convolutional neural network, identical to the one described in the previous section.

The naive solution to estimate $\begin{array}{l}\psi\end{array}$ would be to simply subtract the intensities between the atlas and unlabeled instances in the atlas space. However, we would capture all of the errors associated with the incorrect image registrations. In order to avoid that, similarly to spatial transformation, we model $\begin{array}{l}\psi\end{array}$ as a parametric function $\begin{array}{l}h_{\theta_a}(x, y^{(i)}\circ\phi^{-1^{(i)}})\end{array}$ , that can be learned by a convolutional neural network based on the U-Net [13] architecture [Figure 5].

Figure 5: Overview of the appearance transformation model [6].

Similarly to spatial transformation model the loss function consists of a similarity and a smoothness term:

$\begin{array}{l}\mathcal{L}(x,y^{(i)},\phi^{(i)},\phi^{-1^{(i)}},\psi^{(i)},c_x)=\mathcal{L}_{sim}((x+\psi^{(i)})\circ\phi^{(i)},y^{(i)})+\lambda_a \mathcal{L}_{smooth}(c_x,\psi^{(i)})\end{array}$

Similarity loss $\begin{array}{l}\mathcal{L}_{sim}((x+\psi^{(i)})\circ\phi^{(i)},y^{(i)})\end{array}$ is computed as a mean squared error $\begin{array}{l}\mathcal{L}_{sim}(\hat{y},y)=||\hat{y}-y||^2\end{array}$
Smoothness term $\begin{array}{l}\mathcal{L}_{smooth}(c_x, \psi)=(1-c_x)\nabla\psi\end{array}$ uses segmentation labels $\begin{array}{l}l_x\end{array}$ to compute binary image of anatomical boundaries $\begin{array}{l}c_x\end{array}$ , which makes the smoothness term prevent any big intensity changes within the same anatomical structure.

Synthesizing new examples

For generating a new labeled instance, the system samples two unlabeled images for which it extracts previously saved spatial and appearance transformation. Then the sampled intensity transformation is overlayed on the atlas. On top of that, the spatial transformation is applied to make the voxels structurally match an unlabeled scan. In the end, the labels are mapped to the new structures, by application of the same spatial transformation to them.

Figure 6: Overview of the process for synthesizing new training samples [6].

Results and Discussion

Experimental Setup

For the experiments, the task of brain MRI scans segmentation with 30 anatomical labels was chosen. For that, the same dataset of T1-weighted MRI brain scans as in VoxelMorph [5] was taken and the ground-truth segmentation labels were generated using FreeSurfer [14]. The authors selected a state-of-the-art segmentation method [15], for which the training dataset is augmented using the proposed technique. They used 101, 50, 100 brain volumes for the training, validation and testing set accordingly. For training, all of the methods are allowed to use labels only for one sample and leverage the other 100 instances without labels.

Segmentation Baselines

Single-atlas segmentation (SAS) - uses VoxelMorph atlas-based segmentation [5] to warp the atlas labels with the same spatial transformation as in this work
Data augmentation using single-atlas segmentation (SAS-aug) - use SAS segmentation as ground-truth labels for the 100 unlabeled training samples and include them as training examples for supervised segmentation model
Hand-tuned random data augmentation (rand-aug) - hand-tuned data augmentation that synthesizes a new random brain in each training iteration using random smooth deformation fields [13] and global intensity multiplicative factor [16]
Supervised Segmentation - segmentation network that uses 101 ground truth labels in its training dataset. Serves as an upper-bound to all other methods

Method Variants

Independent sampling (ours-indep) - spatial and appearance transformations are sampled independently, resulting in 10000 new training instances.
Coupled sampling (ours-coupled) - spatial and appearance transformations are sampled from the same unlabeled image resulting in only 100 new training instances
Ours-indep + rand-aug - both ours-indep and rand-aug methods are used in alternation to synthesize new examples

Segmentation results

Mean segmentation performance

For testing the segmentation accuracy the Dice score, which measures the overlapping area between two segmentations, was selected. All of the versions of the proposed method outperformed other methods both in mean Dice score and in mean pairwise Dice improvement [Figure 7], which corresponds to the mean value of the improvement over SAS baseline on each testing sample.

Figure 7: Segmentation performance in terms of dice score [6]

The pairwise improvement for the proposed method is also more consistent than random augmentations [Figure 8] and is higher than other methods for almost every sample in the testing dataset [Figure 9].

Figure 8: Pairwise Dice improvement over the SAS baseline [6].

Figure 9: Pairwise Dice improvement over the SAS baseline for each test sample [6].

Segmentation accuracy for different structures

The authors also obtained interesting results when comparing the segmentation accuracy across different segmented structures [Figure 10].

Figure 10: Segmentation accuracy across different brain structures with the following abbreviations of the labels: white matter (WM), cortex (CX), ventricle (vent), and cerebrospinal fluid (CSF). The volume taken by each structure is shown in parenthesis [6].

While for the large brain structures all of the methods were comparably accurate, the proposed approach outperformed all other approaches on smaller structures, such as the hippocampus [Figure 11].

Figure 11: Examples of two test instances (rows) hippocampus segmentations [6].

Conclusions

In this paper, a novel approach for data augmentation that can be applied to a one-shot segmentation scenario was introduced. The method requires only a single labeled instance and a bunch of unlabeled samples, from which it is able to learn and emulate real variation of medical images. The proposed approach has proven to outperform all other segmentation methods and produce more accurate segmentation labels overall. As this approach is very general and can be applied to different medical problems the next steps for this method are trying it out on various anatomies and imaging modalities. The other improvement for this work is to try and make the transformations space continuous by using the interpolation or composition of transformations.

Own Review

Overall, I think the paper is well structured and has very nice accompanying tables and pictures, that really enhance the understanding of this work. Also, the source code for all models has been published and other people can use it to test the method on different problems. I liked that the approach presented here is quite broad and one can see many other problems that can benefit from this technique.

However, the experimental setup and the results presented by the authors, in my opinion, are insufficient to claim the superiority of this approach. A big disadvantage of this work is that no manual labels were used for the segmentation and all of the labels were automatically generated by third-party software. Also, all of the baseline methods are either created or modified by the same research group. The authors do not mention in the paper any other state-of-the-art augmentation techniques that they can compare their work to.

Another point is that I think not all of the experimental setup has been explained in detail. For example, the authors only mention the data for training the supervised segmentation model. However, they omit whether the transformation models have been trained on the same small subsample of the data, or the full dataset was used for it. According to my research on VoxelMorph [5], from which the spatial transformation model was taken, it was trained on a far bigger dataset than 100 images. Also, there is no time discussion required for training or inference provided in the paper.

To sum up, I think this work has a good idea for data augmentation, but it would be important to see how well it works in other experiments first. In particular, it seems interesting to check whether naturally looking images with unhealthy anatomies (e.g. tumors) can be synthesized this way.