Deep Learning-based medical image registration

Abstract: This blog post is mainly about deep learning-based medical image registration. The blog post firstly gives a brief introduction to this academic field. Then it will introduce some related works in recent years based on a classification of medical image registration methods into three categories: deep similarity-based iterative registration, supervised learning-based registration, and unsupervised learning-based registration. Next, the blog post will go deeper into the three chosen papers, introduce their motivations and methodologies, and discuss their results. Finally, the blog post will give a brief review of these papers and this area.

Blog post author: Unbekannter Benutzer (ge26hab)

Catalogue:

1.Introduction

Medical image registration is a challenging area within medical image analysis. It aims at finding an optimal spatial transform between images so that they are spatially aligned into a uniform coordinate, which is crucial for some clinical applications such as image fusion. The registered images can be multi-modal and multi-dimensional, and registration can be either rigid or deformable. Traditionally, medical image registration is usually operated by clinicians. However, with the growth of the deep learning(DL) community in recent years, many deep learning methods have been proposed to tackle the image registration task. This blog post serves as an introduction to this area.

Figure 1. Image Registration Example

2.Related Works

Deep learning-based registration approaches can be roughly divided into four categories: deep learning similarity metric-based iterative registration, reinforcement learning-based registration, supervised and unsupervised learning-based registration.

2.1 Deep Learning Similarity Metric-Based Iterative Registration

This category includes methods that use deep learning to learn a similarity metric for registration as Figure 2. Before DL becomes widely used, various hand-crafted metrics have already been designed. These metrics include: the sum of squared differences (SSD), cross-correlation (CC), mutual information (MI), normalized cross-correlation (NCC), and normalized mutual information (NMI). These metrics are manually crafted and static, whereas deep learning learned metrics are dynamic.

Figure 2. Typical Architecture of Deep Similarity-Based Iterative Registration

Wu et al. [2, 3] tried to use deep learning to learn a similarity metric for 3D brain MR volumes using a convolutional stacked autoencoder as Figure 3. Blendowski et al. [4] introduced using both CNN-based and MRF-based metrics for lung CT registration. The result showed that for uni-modal cases, DL-based metrics could hardly outperform manually crafted ones, nevertheless, DL can still be used as complementary sources of feature information.

Figure 3. Convolutional Stacked Autoencoder Architecture

As for multi-modal cases [5, 6, 7], deep learning has greater advantages. Simonovsky et al. [5] utilized CNN to handle aligned 3D T1 and T2 weighted brain MR volumes, which outperformed traditional MI metrics. Sedghi et al. [6] used 5-layer neural network to learn metrics for 3D US/MR rigid registration which also beat MI-based registration. Its framework is shown in Figure 4.

Figure 4. Deep Metric Registration (DMR)

Recent works have proved that DL methods are able to learn a reasonable similarity metric for multi-modal medical image registration. However, the performance when applied to uni-modal cases is not that satisfactory, therefore, this category of methods is mostly used as complementary tools when dealing with uni-modal medical image registration. Another problem is that this category of algorithms can hardly achieve real-time registration.

2.2 Reinforcement Learning-Based Registration

This category's classical architecture is as Figure 5. The agent is trained to perform registration tasks instead of a pre-defined optimizer. Usually, this paradigm is applied for rigid registration, however, it is also possible to use the method for deformable registration.

Figure 5. Typical Architecture of Reinfocement Learning-Based Registration

Liao et al. [8] used reinforcement learning to handle rigid registration of cardiac and abdominal 3D CT and cone-beam CT images. They proposed a greedy supervised approach for end-to-end training and used an attention-driven hierarchical strategy, which outperformed MI-based registration. Miao et al. [9] utilized a multi-agent framework instead of a single agent for rigid registration on X-Ray and CT images of spines as Figure 6. The approach performs better than previously state-of-the-art similarity metrics [10].

Figure 6. Dilated-FCN Framework for Multi-Agent Registration

2.3 Supervised Learning-Based Registration

This category is about the supervised-learning-based medical image registration methods, which can be divided into fully-supervised and dual/weakly supervised approaches. These methods use supervised learning to estimate the optimal spatial transform, which greatly improves the speed of the registration pipeline. The typical framework is as Figure 7.

Figure 7. Typical Architecture of Supervised Learning-Based Registration

2.3.1 Fully-Supervised Methods

Miao et al. [11, 12] used CNN to predict the spatial transform of rigid registration of 2D/3D X-ray attenuation maps and 2D X-ray images, which outperformed most of the traditional MI and CC-based approaches both in accuracy and efficiency. Chee et al. [13] used CNN to obtain transform parameters of rigid 3D brain MR volumes registration. For deformable registration [14, 16, 17], Yang et al. [14] used FCN to learn the deformable registration of 2D/3D brain MR volumes, whose architecture is similar to U-Net [15]. Cao et al. [16] used CNN to map input images of a pair of 3D brain MRI volumes to corresponding displacement vectors. These vectors were used to construct the deformation field of the registration. Its architecture is shown in Figure 8.

Figure 8. Similarity-Steered CNN Regression Architecture

There are some methods using both real datasets and synthesized datasets. Uzunova et al. [18] used statistical appearance models (SAMs) to generate ground truth data and CNN to estimate the deformation field of the 2D brain MRI and 2D cardiac MRI registration. They trained FlowNet [19] using ground truth data generated by SAMs to achieve better performance than models trained with randomly generated ground truth data. Its approach is shown in Figure 9. Ito et al. [20] used a CNN to learn plausible deformations to generate ground truth data instead of using random transformations or manually crafted approaches for ground truth data generation. The paper experimented on the 3D brain MR volumes in the ADNI dataset and proved that the approach had better performance than the MI-based approach in [21].

Figure 9. Model-Based Data Augmentation Approach

Fully-supervised methods achieve real-time registration, however, they also have drawbacks. The performance of the models largely depends on the quality of the ground truth data, furthermore, it depends on the clinicians who label the data. The ground truth data is quite difficult and expensive to acquire. Besides, the size of the dataset is usually not sufficiently large, which makes it hard for the training of fully-supervised methods.

2.3.2 Dual/Weakly-Supervised Methods

For dual-supervised methods, the model is trained using the combination of ground truth data and some metrics that measure the image similarity as shown in Figure 10.

Figure 10. Typical Architecture of Dual-Supervised Methods

Fan et al. [22] utilized hierarchical dual-supervised learning to predict the deformation of 3D brain MR registration. The paper applied “gap-filling” techniques and coarse-to-fine guidance to the original U-Net, which produced a better result than traditional U-Net and therefore proved the dual-supervision strategy is helpful for medical image registration. Its approach is shown in Figure 11.

Figure 11. The Method Proposed by Fan et al.

Yan et al. [23] built a neural network based on GAN [24] to tackle the rigid registration of 3D MR and TRUS volumes. The generator is trained to estimate the rigid transformation, whereas the discriminator is used to distinguish between the images aligned using the real spatial transform and the predicted transform. Euclidean distance and an adversarial loss are applied to form the loss function. The typical architecture is as Figure 12.

Figure 12. Typical Architecture of GAN-Inspired Methods

Weakly-supervised methods refer to the approaches that use the segmentation result of corresponding anatomical structures to design the loss function, typically as Figure 13.

Figure 13. Typical Architecture of Weakly-Supervised Methods

Regarding weakly-supervised methods [25, 26, 27], Hu et al. [25, 26] used label similarity to train a network to deal with MR-TRUS registration. They proposed two networks, namely one local-net and one global-net, to give an estimation of the 12-degree-of-freedom global affine transformation and the local dense deformation field. In the subsequent work [26], the authors modified these two networks into an end-to-end network, which outperformed NMI-based registration. The architecture proposed by [26] is shown in Figure 14.

Figure 14. Architecture combining Local-Net and Global-Net

Dual/weakly-supervised methods alleviate the constraints that limit the fully-supervised methods, which also work for multi-modal cases. As a result, more and more research focused on this paradigm is conducted.

2.4 Unsupervised Learning-Based Registration

Although the above-mentioned categories of methods are quite successful, the difficulty of acquiring high-quality data remains. This fact has motivated the development of unsupervised learning-based approaches as Figure 15.

Figure 15. Typical Architecture of Unsupervised Learning-Based Registration

Li et al. [28, 29] used self-supervised FCN to perform deformable registration of 3D brain MR volumes. The loss function is constructed by NCC and some other regularizers. Balakrishnan et al. [30, 31] designed a framework for unsupervised learning-based registration based on manually crafted similarity metrics. This framework is the very classical model VoxelMorph. Kuang et al. [32] built a novel framework using CNN and spatial transformer network (STN) [33] to tackle the deformable registration of T1-weighted brain MR volumes, and the loss function is constituted by NCC and a regularizer. The experiments in the paper have proved that the architecture outperforms VoxelMorph. Its framework is shown in Figure 16.

Figure 16. FAIM Architecture

Unsupervised learning-based registration makes it possible to use deep learning frameworks to tackle the registration problem without any labeled data, which is especially meaningful for medicine applications since retrieving labeled medical data is very difficult and expensive. So far, the unsupervised learning-based methods focusing on learning feature representations to obtain the optimal transformation are gradually drawing more and more attention from the research community.

3.Selected Papers

In the following chapters, the blog post will focus on three selected papers [34, 35, 36] and go deeper into them. For each of them, the blog post will firstly give a brief introduction to the motivation and the problem that the paper tries to solve. Then the key methodologies proposed in this paper will be covered, including its network architecture, loss function, and dataset. Finally, the experiment results and conclusions are to be discussed.

3.1 Orientation Estimation of Abdominal Ultrasound Images with Multi-Hypotheses Networks

3.1.1 Problem Statement and Motivation

Multi-modal registration involving ultrasound is very challenging due to several reasons. Firstly, the ultrasound images are noisy and they have various artifacts. Secondly, the tissues can get deformed during the image acquisition because it’s necessary to put a certain pressure on the tissue to obtain good images. The most important reason is that a single frame of ultrasound recordings can only show a restricted field of view that is not sufficient to give a complete view on the whole tissue. As a result, the algorithms for US registration need to be initialized within a close range to the ground truth.

This paper proposes a method to perform regression on the global orientation of the 2D US images and improve a subsequent multi-modal intensity-based registration algorithm. Besides, the method also gives the confidence estimation of each prediction.

3.1.2 Methodology

The paper uses unit quaternions to represent the orientation of the ultrasound probe instead of rotation matrices since they don’t need any additional constraints except for a normalization. The angular distance denoting the rotation angle required to get to one quaternion from another is given by

$\begin{array}{l}\displaystyle d(q_{1}, q_{2})=cos^{-1}(2< q_{1}, q_{2}>^{2}-1)\end{array}$

Architecture: The paper uses ResNet-18 [37] as the backbone because it is a powerful and mature architecture for computer vision tasks. Furthermore, the author applies the idea multi-hypotheses network [38] to ResNet-18 and modifies its last layer into several parallel fully-connected layers. Each fully-connected layer will output four real numbers representing a quaternion. The architecture is as Figure 17.

Figure 17. CNN-Based Multi-Hypotheses Network

Loss Function: The following meta-loss is used. With this meta loss, the network learns to spread out its predictions for ambiguous images, and concentrates on common predictions for easy-to-predict images.

$\begin{array}{l}\displaystyle L(f(I), q)=(1-\varepsilon \frac{M}{M-1})\min_{i=1,...,M}d(f_{i}(I), q)+\frac{\varepsilon}{M-1}\sum_{i=1}^{M}d(f_{i}(I), q)\end{array}$

Dataset: Three datasets are used in this paper. Dataset A is synthetic and is used to initially validate the proposed approach. It is selected from the LITS dataset [39] and is modified into random orientations and offsets. Dataset B is made through experiments using a research ultrasound system (Cephasonics Cicada) on volunteers. Dataset C is a realistic medical imaging dataset acquired from 16 patients before surgery with a different ultrasound system. This dataset is the most complicated among the three, since this dataset contains abdominal structures, and its image quality is not optimal due to morphology and other constraints. Besides, the size of dataset C is smaller.

Because of the small size of the real datasets, the paper uses 4-fold cross-validation. As data augmentation, every sweep is mirrored alongside the horizontal axis.

The paper aims at registering all the US frames to a pre-operative volume. When a 3D tracked acquisition is available, the corresponding 3D position and orientation among the frames are known. Every US frame will be input into the network. The entire US sweep can be aligned using the predicted orientations along with the relative orientations among these frames. After obtaining the rotation, two trained U-nets -one for ultrasound, one for the pre-operative modality, are applied to segment the relevant anatomical structures visible in both images. In this way, the optimal translation can be obtained. With this approach, a single rigid transform matrix can be computed, which can be used as an initialization for the registration.

3.1.3 Result and Conclusion

The paper makes a comparison of four kinds of models: ResNet-18, Monte Carlo-Dropout, ensemble, and ResNet-18-based multi-hypotheses network. For orientation estimation, the global errors are greatly higher than most-confident-frame errors, which indicates that the confidence estimation approaches are successful. Generally, ResNet-18-based multi-hypotheses network and ensemble methods work better than MC dropout and ResNet-18. The correlation between the uncertainty and the error of the MC dropout prediction is much lower than that of other models. Considering training and test time, ResNet-18-based multi-hypotheses network achieves the best trade-off.

Table 1. The Experiment Results

Then for uncertainty estimation, the paper finds that frames with a high uncertainty have much higher error. Figure 18 (b) also proves this argument. This clearly shows that when facing different datasets and distributions, the model finds it hard to estimate the orientation nicely.

Figure 18. (a) Average Errors of Dataset B with Uncertainty (b) Uncertainty and Angular Error

The algorithm is applied to a complete pipeline of registration to a pre-op image in the ImFusion Suite software that registers US and MR volumes with a deformation model [40]. The paper [40] has indicated that the approach in this paper needs a proper initialization to converge. When applying the proposed framkework to initialize the registration, the result shows that using the orientation estimation framework can significantly improve the registration error.

3.2 Cross-Modal Attention for MRI and Ultrasound Volume Registration

3.2.1 Problem Statement and Motivation

Multi-modal medical image registration lies a foundation for image-guided interventional procedures. However, as discussed above, multi-modal medical image registration is much more challenging than uni-modal registration. Traditionally, MI is used as the metric that guides the multi-modal registration, which performs much worse when dealing with images with complex textures, such as the registration between US and MRI. In recent years, deep learning-based methods have become widely used in many computer vision areas, and they have been proved to be a promising choice for multi-modal medical image registration since a proper deep learning framework can learn the feature representations to find the optimal spatial transformation between images.

Instead of using CNN architecture, this paper proposed a novel cross-modal attention mechanism to make use of the spatial correspondence to improve the performance of the framework. Its general framework is as Figure 19.

Figure 19. The General Framework

3.2.2 Methodology

The cross-modal attention mechanism is proposed based on the non-local attention mechanism [41]. Its structure is as Figure 20.

Figure 20. The Cross-Modal Attention Mechanism

This cross-modal block is designed for multi-modal registration tasks. It can capture the local features and their global correspondences. By inserting this block into the deep learning framework, better performance of multi-modal registration can be achieved.

Architecture: The TRUS volume is regarded as the moving image, while the MRI volume is the fixed image. The entire architecture proposed in this paper is as Figure 21. In the feature extraction part, CNN-based networks are built to extract local features of the volumes. Then the author uses the cross-modal attention block to capture the local features and the global correspondences. The output is input into the deep registration network that performs information fusion between modalities to obtain the transform parameters. The registration network is pretty light-weighted since it only consists of three convolutional layers.

Figure 21. The Entire Architecture

The two input feature maps of the cross-modal attention block are denoted as primary input $\begin{array}{l}P\end{array}$ and cross-modal input $\begin{array}{l}C\end{array}$ , $\begin{array}{l}P\in R^{LWH\times32},C\in R^{LWH\times32}\end{array}$ . LWH denotes the size of each flattened 3D feature channel. The cross-modal feature attention is computed as follows:

$\begin{array}{l}\displaystyle y_{i}=\frac{ {\textstyle \sum_{\forall j}^{}}f(\theta(c_{i})^{T}\phi(p_{j}))g(p_{j})}{{\textstyle \sum_{\forall j}^{}}f(\theta(c_{i})^{T}\phi(p_{j}))}\end{array}$

Where $\begin{array}{l}c_{i}\end{array}$ and $\begin{array}{l}p_{j}\end{array}$ are features from C and P at location i and j. $\begin{array}{l}\theta(∙)\end{array}$ , $\begin{array}{l}\phi(∙)\end{array}$ and $\begin{array}{l}g(∙)\end{array}$ are linear embeddings. $\begin{array}{l}f(∙)\end{array}$ is $\begin{array}{l}exp(∙)\end{array}$ , and it outputs a scalar that represents the correlations between the features. The output $\begin{array}{l}y_{i}\end{array}$ is a normalized result of features at all locations of $\begin{array}{l}P\end{array}$ weighted by their correlations with $\begin{array}{l}C\end{array}$ at location $\begin{array}{l}i\end{array}$ . These $\begin{array}{l}y_{i}\end{array}$ form the matrix $\begin{array}{l}Y\end{array}$ contains spatial correspondence information between $\begin{array}{l}P\end{array}$ and every position of $\begin{array}{l}C\end{array}$ . The final output $\begin{array}{l}Z\end{array}$ is produced by adding $\begin{array}{l}Y\end{array}$ and $\begin{array}{l}P\end{array}$ . As a result, the entry of $\begin{array}{l}Z\end{array}$ at location k stores the spatial correlation between the primary input feature map and the cross-modal input feature map at location $\begin{array}{l}k\end{array}$ .

The input MRI volumes can be replaced by their corresponding segmentation labels [42, 43, 44]. This approach can reduce the negative effect caused by low-quality images. The corresponding networks are named as Attention-Reg (image) and Attention-Reg (label).

Loss Function: The output of the framework is the predicted rotation and translation, which have six parameters. Mean square error (MSE) is used as the loss function.

Dataset: The paper uses 528 cases of MRI-TRUS volume pair to train the network, 66 cases to validate, and 68 cases to test. Such a pair contains a T2-weighted MRI volume and a 3D US volume. Every MRI volume has size $\begin{array}{l}512\times512\times26\end{array}$ with 0.3mm resolution, while the US volumes are obtained from an electro-magnetic tracked freehand 2D sweep. The performance of the registration is measured by surface registration error (SRE).

3.2.3 Result and Conclusion

Table 2 shows the comparison of the proposed approach and traditional iterative methods Mutual Information [45] and SSD MIND [46]. As is shown in Table 2, the initial SRE is set to 8mm and 16mm respectively. The experiment result shows that the approach proposed in this paper outperforms those traditional methods. Besides, the network that uses segmentation labels as MRI input achieves slightly better results.

Table 2. Performance of Attention-Reg and Traditional Methods

Table 3 shows the results comparing the Attention-Reg with other end-to-end deep learning-based methods MSReg [47] and DVNet [48] for rigid registration. Since MSReg is a two-staged network, the paper trains the Attention-Reg twice on two training sets with different distributions. The model for the first stage is trained and tested on a generated dataset with an initial uniformly distributed SRE within [0, 20mm], whereas the dataset for the second stage is set as a uniform distribution within [0, 8mm]. These networks are concatenated to form a two-staged network. As shown in Table 3, the Attention-Reg outperforms the MSReg, and the parameter number and runtime of Attention-Reg are much less than those of MSReg. This proves the outstanding capability of Attention-Reg to efficiently extract and understand the global correspondences of local features from multi-modal images. The paper also trains the network without the attention block named as Feature-Reg. From the comparison, it can be concluded that the cross-modal attention block has significantly improved the performance.

Table 3. Performance of Attention-Reg and Other DL Methods

The author utilizes Grad-CAM [49] to visualize the final output of the two multi-modal attention blocks shown in Figure 22. There are four pairs of feature maps, the top two are from Attention-Reg (image), and the bottom two are from Attention-Reg (label).

Figure 22. The Visualization

3.3 Global Multi-Modal 2D/3D Registration via Local Descriptors Learning

3.3.1 Problem Statement and Motivation

Multi-modal medical image registration is very important for many medical applications such as procedure planning and the subsequent intervention for diagnostic and treatment. Usually, it needs a proper initialization, in which the clinicians should select some landmarks from the images and complete the registration. The process requires expertise from the clinician and a certain amount of time.

This paper mainly focuses on the registration between US and MRI images. The US images can be noisy and have various kinds of imperfections. Besides, the involved tissue is deformed during acquisition. To solve these problems in order to improve the multi-modal registration, the author proposes a novel approach to extract and match the key points from US and MR images. Furthermore, this approach can be applied to multi-modal registration with cross-dimensional inputs.

3.3.2 Methodology

The proposed registration approach in this paper is shown as Figure 23, whose basis is the detection and matching of the local features of different modalities. The framework uses the pose as the only supervision for this end-to-end training. RANSAC [50] is applied after obtaining the matches.

Figure 23. The End-to-End Architecture

Regarding feature extraction, an adapted version of LoFTR [51] is developed in this paper. To deal with the multi-modal data, the author jointly trains two networks to produce cross-modality descriptors. As the ground truth may not be sufficiently accurate due to imperfections, the paper deploys a detector-free architecture that uniformly distributes the key points on a grid with 1/8 resolution.

Architecture: The detector-free networks extracting local features in this paper are similar to U-Net without the last upsampling layers. Specifically, the network for the US images is like 2D U-Net, while the network for MR volumes is like 3D U-Net. The networks use leaky ReLUs [52] as activation functions and instance normalization [53]. The output is a uniform grid of 32-dimensional descriptors at 1/8 of the input resolution, which is denoted as $\begin{array}{l}f^{US}\end{array}$ and $\begin{array}{l}f^{MR}\end{array}$ respectively. The indices $\begin{array}{l}i \in \Omega_{US}\end{array}$ and $\begin{array}{l}j \in \Omega_{MR}\end{array}$ identify a corresponding position on the grid.

It is necessary to make the mathematical operations differentiable to train the deep learning framework. Therefore, the similarity between descriptors is defined as below. These similarities are stored in a matrix that has dimension $\begin{array}{l}|\Omega_{US}| \times |\Omega_{MR}|\end{array}$ .

$\begin{array}{l}S_{ij}= \begin{cases} & < f_{i}^{US},f_{j}^{MR}>\quad if\, i\le |\Omega_{US}|\,and\,j\le |\Omega_{MR}| \\ & \qquad \alpha \qquad \qquad \qquad \qquad otherwise \end{cases}\end{array}$

The last row and the last column of S are filled with a learned value $\begin{array}{l}\alpha\end{array}$ that represents a sink for the key points that are not matched. The dual-softmax operator [54] is applied to transform S to a soft assignment matrix A.

$\begin{array}{l}\displaystyle A_{i,j}=Softmax_{j}(S_{i,∙})∙Softmax_{i}(S_{∙,j})\end{array}$

Loss Function: For every cell i of the US grid, the author computes its center $\begin{array}{l}p_{i}'s\end{array}$ position and applies the ground truth deformable registration to get the corresponding position on MR grid $\begin{array}{l}q_{i}\end{array}$ . Then the corresponding cell in the MR grid can be determined as $\begin{array}{l}m(i)\end{array}$ . In order to deal with the inaccuracy of ground truth, a softer loss function is proposed, which will not over-penalize those matches that are not exactly aligned with the ground truth. The loss function has the following form:

$\begin{array}{l}\displaystyle L=-\frac{ {\textstyle \sum_{i,j}^{}w(i,j)log(A_{i,j})} }{{\textstyle \sum_{i,j}^{}w(i,j))}}\end{array}$

$\begin{array}{l}w(i,j)= \begin{cases} & exp(-\beta||j-m(i)||) \quad if\quad i \le |\Omega_{US}|\,and\,j \le |\Omega_{MR}| \\ & 1 \qquad \qquad \qquad \qquad \qquad \qquad \quad otherwise \end{cases}\end{array}$

Dataset: The dataset contains the T1-weighted MR images and the US sweeps before surgery of 16 patients. The paper uses four-fold cross-validation to make the best use of the dataset. Both MR and US images are resampled to have uniform spacing. Because the number of transversal orientation sweeps is much larger than that of the intra-coastal sweeps, the intra-coastal sweeps are sampled more frequently to balance the dataset. Regarding data augmentation, the paper applies Gaussian noise and random cropping. Another point worth mentioning is that the author notices that the networks are pretty sensitive to the US field of view. To solve this problem, the author masks the outer part of the US images with a random convex polygon, which forces the networks to learn feature representations using image information rather than the US frame geometry.

3.3.3 Result and Conclusion

The experiment uses the approach proposed in the paper as the initialization of the registration algorithm in [55]. The baseline set for comparing performance is the approach in [56]. Table 4 shows the comparison results. The author evaluates the method using both a single US frame and an entire sweep. The approach outperforms the baseline in either experiment setup. And the superiority of the approach is proved by statistical analysis.

Table 4. The Comparison Results

The visualized results are shown in Figure 24, including the pose error with mean, standard deviation, and median for rotation and translation. Statistical significance is also depicted as the width of the shape. Black rectangles denote boxplots, while white dots denote medians.

Figure 24. The Visualization of Results

The author further designs an experiment on the PolyCrop technique and the network size. The results are shown in Table 5. It can be concluded that the PolyCrop data augmentation does improve the performance of registration. The smaller network achieves better results on all the metrics except for the median error. This result may be due to the improvement of the over-fitting issue. After all, the network is smaller and has a weaker capability for learning feature representations, and the median error gets slightly higher.

Table 5. Comparison on PolyCrop and Different Network Sizes

4.Review

• First Paper Strength

The first paper's most impressive contribution is that it uses a certainly simple architecture, namely ResNet-18 based multi-hypothesis network, to properly initialize the registration to improve the performance. The design of utilizing quaternions to simplify the network output is also impressive. Besides, the architecture also outputs the estimation of predictions' uncertainty, which is proved to be reasonable through its experiments. The framework performs generally better than other approaches, such as ensemble and MC dropout, and achieves an ideal trade-off at a lower cost for training and testing time.

• First Paper Weakness

One of the characteristics of the US sweep is that it has many frames like videos. The network can make use of this temporal information among these frames, however, the architecture proposed in this paper doesn’t utilize this kind of information. Besides, there are several representations for spatial transformations, so it is possible to have some experiments to determine the representation with the highest accuracy.

• Second Paper Strength

The second paper proposed a novel design for handling multi-modal medical image registration. This cross-modal attention block originates from the attention mechanism and has a better performance compared with CNN sized ten times the block. Another point is that the author tries to replace the input MRI volumes with their segmentation masks and has experimented on both types of models as well as some classical methods, which gives a profound experiment to compare the performance of various approaches. The Grad-CAM visualization part of this paper enhances the interpretability of the approach showing that the network does learn the spatial correspondences of the features.

• Second Paper Weakness

More data augmentation techniques can be applied since the training set size is not ideally large. Besides, k-fold cross-validation can be considered to use as well. The loss function used is MSE to measure the differences of the six degrees of freedom between the prediction and the ground truth, and it is possible to improve the spatial transformation representation and the loss function to boost performance.

• Third Paper Strength

In the third paper, the author proposes a novel architecture for multi-modal medical image registration. This architecture acts more like a traditional 2D computer vision pipeline, which obeys the keypoint-detection-matching paradigm. Besides, the approach is completely automatic and can deal with multi-modal and multi-dimensional input data. The experiments in the paper have proved that even using this network to register a single US frame, approximately 50% of results lie within an acceptable error range, which can achieve significant improvements in medical procedures. With the help of the method's relatively high accuracy and low computational cost, the blog post believes that image-guided surgery systems can be further improved.

• Third Paper Weakness

Like the first paper, the temporal information of US sweeps is not used in the approach as well. Besides, the author doesn’t explicitly introduce what data augmentation techniques they have used, so there might be space for further improvement in data augmentation. Since the dataset size is not large, the dual-supervision paradigm might be considered. Introducing another source of information, such as the segmentation mask to the loss function, should be able to improve the performance.

• Comparison

These three papers all deal with medical image registration in a fully-supervised approach. The second and the third papers tackle specifically multi-modal registration, which is the main focus of the current registration research community. The architectures in the first and the third paper are mainly based on CNN, whereas the framework in the second lies its foundation on the attention mechanism. Furthermore, there are recent works beginning to use vision transformers as their architectures. The limitation of training data size is a common problem when applying deep learning methods to medical applications. And data augmentation techniques are widely used to solve this problem, like the methods in these three papers. The entire frameworks in the three papers are from different ideas and have various goals. The first paper proposes a simple network to estimate the orientation and its uncertainty and uses this network to initialize a subsequent registration pipeline. The second paper builds an end-to-end architecture to directly output the spatial transformation parameters of multi-modal registration. The third paper designs a network learning to extract features and match them to register the volumes.

References

[1] Grant Haskins, Uwe Kruger, and Pingkun Yan. Deep learning in medical image registration:a survey.Machine Vision and Applications, 31(1):1–18, 2020.

[2] Wu, G., Kim, M., Wang, Q., Gao, Y., Liao, S., and Shen, D. (2013). Unsupervised deep feature learning for deformable registration of mr brain images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 649–656. Springer.

[3] Wu, G., Kim, M., Wang, Q., Munsell, B. C., and Shen, D. (2016). Scalable high-performance image registration framework by unsupervised deep feature representations learning. IEEE Transactions on Biomedical Engineering, 63(7):1505–1516.

[4] Blendowski, M. and Heinrich, M. P. (2018). Combining mrf-based deformable registration and deep binary 3d-cnn descriptors for large lung motion estimation in copd patients. International journal of computer assisted radiology and surgery, pages 1–10.

[5] Simonovsky, M., Guti´errez-Becker, B., Mateus, D., Navab, N., and Komodakis, N. (2016). A deep metric for multimodal registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 10–18. Springer.

[6] Sedghi, A., Luo, J., Mehrtash, A., Pieper, S., Tempany, C. M., Kapur, T., Mousavi, P., and Wells III, W. M. (2018). Semi-supervised deep metrics for image registration. arXiv preprint arXiv:1804.01565.

[7] Haskins, G., Kruecker, J., Kruger, U., Xu, S., Pinto, P. A., Wood, B. J., and Yan, P. (2019). Learning deep similarity metric for 3d mr-trus image registration. International Journal of Computer Assisted Radiology and Surgery, 14:417–425.

[8] Liao, R., Miao, S., de Tournemire, P., Grbic, S., Kamen, A., Mansi, T., and Comaniciu, D. (2017). An artificial agent for robust image registration. In AAAI, pages 4168–4175.

[9] Miao, S., Piat, S., Fischer, P., Tuysuzoglu, A., Mewes, P., Mansi, T., and Liao, R. (2017). Dilated fcn for multi-agent 2d/3d medical image registration. arXiv preprint arXiv:1712.01651.

[10] De Silva, T., Uneri, A., Ketcha, M., Reaungamornrat, S., Kleinszig, G., Vogt, S., Aygun, N., Lo, S., Wolinsky, J., and Siewerdsen, J. (2016). 3d–2d image registration for target localization in spine surgery: investigation of similarity metrics providing robustness to content mismatch. Physics in Medicine & Biology, 61(8):3009.

[11] Miao, S., Wang, Z. J., and Liao, R. (2016a). A cnn regression approach for real-time 2d/3d registration. IEEE transactions on medical imaging, 35(5):1352– 1363.

[12] Miao, S., Wang, Z. J., Zheng, Y., and Liao, R. (2016b). Real-time 2d/3d registration via cnn regression. In Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on, pages 1430–1434. IEEE.

[13] Chee, E. and Wu, J. (2018). Airnet: Self-supervised affine registration for 3d medical images using neural networks. arXiv preprint arXiv:1810.02583.

[14] Yang, X., Kwitt, R., and Niethammer, M. (2016). Fast predictive image registration. In Deep Learning and Data Labeling for Medical Applications, pages 48–57. Springer.

[15] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

[16] Cao, X., Yang, J., Zhang, J., Nie, D., Kim, M., Wang, Q., and Shen, D. (2017). Deformable image registration based on similarity-steered cnn regression. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 300–308. Springer.

[17] Lv, J., Yang, M., Zhang, J., and Wang, X. (2018). Respiratory motion correction for free-breathing 3d abdominal mri using cnn-based image registration: a feasibility study. The British journal of radiology, 91(xxxx):20170788.

[18] Uzunova, H., Wilms, M., Handels, H., and Ehrhardt, J. (2017). Training cnns for image registration from few samples with model-based data augmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 223–231. Springer.

[19] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., and Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766.

[20] Ito, M. and Ino, F. (2018). An automated method for generating training sets for deep learning based image registration. In The 11th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: BIOIMAGING, pages 140–147. INSTICC, SciTePress.

[21] Ikeda, K., Ino, F., and Hagihara, K. (2014). Efficient acceleration of mutual information computation for nonrigid registration using cuda. IEEE J. Biomedical and Health Informatics, 18(3):956–968.

[22] Fan, J., Cao, X., Yap, P.-T., and Shen, D. (2018b). Birnet: Brain image registration using dual-supervised fully convolutional networks. arXiv preprint arXiv:1802.04692.

[23] Yan, P., Xu, S., Rastinehad, A. R., and Wood, B. J. (2018). Adversarial image registration with application for mr and trus image fusion. arXiv preprint arXiv:1804.11024.

[24] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.

[25] Hu, Y., Modat, M., Gibson, E., Ghavami, N., Bonmati, E., Moore, C. M., Emberton, M., Noble, J. A., Barratt, D. C., and Vercauteren, T. (2018b). Label-driven weakly-supervised learning for multimodal deformable image registration. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1070–1074. IEEE.

[26] Hu, Y., Modat, M., Gibson, E., Li, W., Ghavami, N., Bonmati, E., Wang, G., Bandula, S., Moore, C. M., Emberton, M., et al. (2018c). Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis, 49:1–13.

[27] Hu, Y., Gibson, E., Ghavami, N., Bonmati, E., Moore, C. M., Emberton, M., Vercauteren, T., Noble, J. A., and Barratt, D. C. (2018a). Adversarial deformation regularization for training image registration neural networks. arXiv preprint arXiv:1805.10665.

[28] Li, H. and Fan, Y. (2017). Non-rigid image registration using fully convolutional networks with deep self-supervision. arXiv preprint arXiv:1709.00799.

[29] Li, H. and Fan, Y. (2018). Non-rigid image registration using self-supervised fully convolutional networks without training data. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1075–1078. IEEE.

[30] Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J., and Dalca, A. V. (2018a). An unsupervised learning model for deformable medical image registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9252–9260.

[31] Balakrishnan, G., Zhao, A., Sabuncu, M. R., Guttag, J., and Dalca, A. V. (2018b). Voxelmorph: A learning framework for deformable medical image registration. arXiv preprint arXiv:1809.05231

[32] Kuang, D. and Schmah, T. (2018). Faim–a convnet method for unsupervised 3d medical image registration. arXiv preprint arXiv:1811.09243.

[33] Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025.

[34] Timo Horstmann, Oliver Zettinig, Wolfgang Wein, and Raphael Prevost. Orientation esti-mation of abdominal ultrasound images with multi-hypotheses networks. InMedical Imagingwith Deep Learning, 2021.

[35] Xinrui Song, Hengtao Guo, Xuanang Xu, Hanqing Chao, Sheng Xu, Baris Turkbey, Brad-ford J Wood, Ge Wang, and Pingkun Yan. Cross-modal attention for mri and ultrasound vol-ume registration. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 66–75. Springer, 2021.

[36] Viktoria Markova, Matteo Ronchetti, Wolfgang Wein, Oliver Zettinig, and Raphael Pre-vost. Global multi-modal 2d/3d registration via local descriptors learning.arXiv preprintarXiv:2205.03439, 2022.

[37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[38] Fabian Manhardt, Diego Martin Arroyo, Christian Rupprecht, Benjamin Busam, Tolga Birdal, Nassir Navab, and Federico Tombari. Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6841–6850, 2019.

[39] Patrick Bilic, Patrick Ferdinand Christ, Eugene Vorontsov, Grzegorz Chlebus, Hao Chen, Qi Dou, Chi-Wing Fu, Xiao Han, Pheng-Ann Heng, Jürgen Hesser, et al. The liver tumor segmentation benchmark (LITS). arXiv preprint arXiv:1901.04056, 2019.

[40] W. Wein, A. Ladikos, B. Fuerst, A. Shah, K. Sharma, and N. Navab. Global registration of ultrasound to mri using the lc2 metric for enabling neurosurgical guidance. September 2013.

[41] Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018)

[42] Bashkanov, O., Meyer, A., Schindele, D., Schostak, M., T¨onnies, K., Hansen, C., Rak, M.: Learning multi-modal volumetric prostate registration with weak intersubject spatial correspondence (2021)

[43] Thomson, B.R., Smit, J.N., Ivashchenko, O.V., Kok, N.F., Kuhlmann, K.F., Ruers, T.J., Fusaglia, M.: MR-to-US registration using multiclass segmentation of hepatic vasculature with a reduced 3d u-net. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 275–284. Springer (2020)

[44] Zhang, Y., Bi, J., Zhang, W., Du, H., Xu, Y.: Recent advances in registration methods for mri-trus fusion image-guided interventions of prostate. Recent Patents on Engineering 11(2), 115–124 (2017)

[45] Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE TMI 16(2), 187–198 (1997)

[46] Heinrich, M.P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, F.V., Brady, S.M., Schnabel, J.A.: MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration. Medical Image Analysis 16(7), 1423 – 1435 (2012)

[47] Guo, H., Kruger, M., Xu, S., Wood, B.J., Yan, P.: Deep adaptive registration of multi-modal prostate images. Computerized Medical Imaging and Graphics 84, 101769 (2020)

[48] Sun, Y., Moelker, A., Niessen, W.J., van Walsum, T.: Towards robust ct-ultrasound registration using deep learning methods. In: Understanding and Interpreting Machine Learning in Medical Image Computing Applications, pp. 43–51. Springer (2018)

[49] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: GradCAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

[50] Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)

[51] Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: Detector-free local feature matching with transformers. CVPR (2021)

[52] Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml. vol. 30, p. 3. Citeseer (2013)

[53] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)

[54] Rocco, I., Cimpoi, M., Arandjelovi´c, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. Advances in neural information processing systems 31 (2018)

[55] Wein, W., Ladikos, A., Fuerst, B., Shah, A., Sharma, K., Navab, N.: Global registration of ultrasound to MRI using the LC2 metric for enabling neurosurgical guidance. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. pp. 34–41. Lecture Notes in Computer Science, Springer (2013)

[56] M¨uller, M., Helljesen, L., Prevost, R., Viola, I., Nylund, K., Gilja, O., Navab, N., Wein, W.: Deriving anatomical context from 4D ultrasound. 4th bi-annual Eurographics Workshop on Visual Computing for Biology and Medicine (2014)

Seitenhierarchie