Deep Learning for Non-rigid 2D-3D Registration

This blog post is written by Shucheng Yang, and guided by Baochang Zhang

Table of Content

Introduction
Previous Researches
Representative Papers of Deep-learning-based Method
Summary
Discussions
References

Introduction

Minimally invasive surgery has become increasingly common as an alternative to traditional invasive and open surgical techniques. Although minimally invasive surgery is widely recognized as a safe and effective approach, the smaller operating space poses higher demands for precise manipulation of surgical instruments. One major challenge faced by surgeons is the inability to directly visualize the surgical site, thus increasing the complexity of spatial cognition. To overcome this challenge, one solution is to utilize additional tracking hardware to register real-time intraoperative images with finely detailed preoperative anatomical models. However, the real-time acquired images are often limited to be two-dimensional (2D), leading to information loss such as overlapping of tissue organs. To address this issue, non-rigid 2D-3D registration techniques are frequently employed. This technique serves multiple purposes. On one hand, it integrates the preoperatively acquired 3D volumetric information into the 2D images, providing enhanced images with additional information. On the other hand, it enables the projection of 2D positions onto the 3D images for surgical navigation. Furthermore, non-rigid registration techniques can simulate the deformation of organs and tissues, thereby improving the accuracy of registration results.

The non-rigid 2D-3D registration problem can be generally described by Eq. 1 [R-1]. Here, $\begin{array}{l}\mathrm{I_{3D}}\end{array}$ represents the preoperative 3D volume, $\begin{array}{l}\mathrm{I_{2D}}\end{array}$ represents the real-time image obtained during the procedure. F denotes the preprocessing applied to both images. For the processed 3D image, $\begin{array}{l}\mathrm{F(I_{3D})}\end{array}$ , it undergoes elastic deformation D and rigid projection P to generate the digitally reconstructed radiograph (DRR). A successful registration is equivalent to minimizing the distance between the obtained DRR and $\begin{array}{l}\mathrm{F(I_{2D})}\end{array}$ , where $\begin{array}{l}\mathrm S\end{array}$ characterizes the distance between the DRR and $\begin{array}{l}\mathrm{F(I_{2D})}\end{array}$ .

(1)	$\begin{array}{l}\displaystyle \theta^{*}=\mathop{\mathrm{argmin}}_{\theta}\mathrm{S(P\circ D(F(I_{3D})),F(I_{2D}))}\end{array}$

Previous Research

In prior research, optimization-based methods have been used for 2D-3D registration tasks [V-2 to 15] [S2-S5], but they require significant computation and multiple iterations, often getting stuck in local optima. Feature-based methods heavily depend on feature design for registration accuracy [V-10 to 15]. These limitations make it challenging for these methods to meet the need for fast registration in clinical applications.

To address this, researchers turned to machine learning techniques [S6-S7], such as linear regression [V-16] and multilayer perceptron (MLP) models [V-17]. However, these methods have drawbacks like the need for comprehensive 3D image datasets or lack of precision compared to optimization-based methods.

Deep learning, particularly convolutional neural networks (CNNs) [S8-S10], emerged as an alternative [V-18 to 23]. Deep learning methods are capable of discovering patterns from a large volume of training data, posing the greatest challenge of data quantity and quality. This challenge is not limited to the training phase but also extends to the inference phase of the models. Specifically, there are three typical examples. Firstly, the input data used during inference may be suboptimal, leading to significant errors in the inference results [V-1]. Secondly, obtaining labeled samples from real data is difficult, and in some extreme cases, even acquiring unlabeled training data is challenging. This presents difficulties for models trained under supervised learning. To address this challenge, one approach is to generate realistic synthetic data [L-1]; however, models trained on such data often exhibit limited generalization capabilities when applied to real data [S11-S12]. On the other hand, developing weakly-supervised [V-24 to 27] or unsupervised training models can be attempted, but it introduces the third challenge: these unsupervised models typically require specialized equipment to collect data from multiple perspectives as model inputs [V-27]. To tackle these three challenges, three representative research papers that employ deep learning methods to address the 2D-3D registration problem are selected.

A Weakly Supervised Framework for 2D-3D Vascular Registration Oriented to Incomplete 2D Blood Vessels [V-1]

Problem Statement

This paper explores the use of a ResNet-based model architecture for registration of CT and X-ray or DSA images in vascular intervention procedures. Digital Subtraction Angiography (DSA) provides clear visualization of vascular structures when contrast media is flowing through the vessels. However, the vessels appear incomplete in 2D images when the contrast media flows out of the field of view. Due to the potential harm caused by the excessive use of contrast agents, this study addresses the registration problem of incomplete vessel images as input.

Method

As shown in Fig. v-2, a two-step training approach is employed to train a CNN-based regression network architecture. As shown in Fig. v-1, The regression network takes the initial image and the target image as input. The initial image corresponds to the preoperative 3D volume, which is rendered as the DRR, while the target image is the real-time 2D X-ray. The network's backbone structure consists of multiple residual blocks [V-28]. This network outputs six degrees of freedom of rigid transform.

Fig. v-1. The architecture of the regression network.

Fig. v-2. Overview of the framework. Yellow lines denote preprocessing and cyan lines denotes the forward propagation.

In the first stage, the model is pretrained on synthetic data. To ensure algorithm robustness and patient safety, the initial DRR and target DRR are preprocessed by segmentation and binarization, eliminating the influence of different modalities [V-29]. The processed images are then used as inputs to the network. Since both the initial and target images are synthetic data, the rigid transformation parameters between them are known and can serve as ground truth. By comparing the predicted transformation parameters with the ground truth, the mean squared error (MSE) loss is calculated to equip the CNN regressor with the capability of generating vessel images that match the ideal complete vessel image.

To further enhance the registration performance of the model on real DSA images, unsupervised training is employed for fine-tuning the network. Following the pretrained regressor, a 3D-2D mask projection module is added. This module takes the 3D vessel volume extracted from preoperative CT slices as input and combines it with real-time predictions from the regressor to generate the Modified DRR (MDRR), which is a projected image. As shown in Fig. v-3, the workflow of the projection module consists of three steps: first, the module uniformly and randomly samples multiple feature points on the initial DRR and determines their corresponding positions in the 3D volume. Second, based on the predictions of the regressor, the feature points collected in the first step undergo rigid transformations. Third, to generate the transformed MDRR image, a series of information such as color, texture, and pixel intensity are collected from the corresponding positions of these feature points in the 3D volume.

Fig. v-3. The process of the 3D-2D mask projection module.

The incomplete vessels result in content differences between the Digitally Reconstructed Radiograph (DRR) and DSA, making traditional image similarity measurement methods inadequate. The similarity between the projected vessels and the vessels in DSA is often low when the projected DRR fills in the missing vessel parts in DSA with high registration accuracy. This can lead to the issue of incorrect training direction for the network. Therefore, the model is trained using an image patch similarity loss between MDRR and the target image (see Eq. 2). As shown in Fig. v-4, firstly, MDRR and DSA are divided into multiple equally sized patches. For each patch, the MSE loss is calculated between the two images. These losses are then summed, taking into account certain weights. If a patch in DSA contains abundant vessel information, it will be assigned a higher weight, while patches with less vessel information will receive lower weights.

(2)	$\begin{array}{l}\displaystyle \mathrm{L_{patch}=\sum_{i=1}^{N_{w}}\sum_{j=1}^{N_{h}}(W_{ij}MSE_{ij})}\end{array}$

Fig. v-4. The calculation method of the patch-based content loss function. The transformed DRR is on the left, and the target DSA is on the right.

Evaluations

This paper employs four metrics to evaluate the performance of the model. The mean Target Registration Error (mTRE) (Eq. 3) [V-30] quantifies the error between the landmark points on 3D volumes. The Mean Absolute Error (MAE) (Eq. 4) calculates the error between parameters. The Dice coefficient (Eq. 5) assesses the vascular overlap on a two-dimensional image plane. However, due to the potential incompleteness of the true vessels, even if the model produces accurate registration results, it might obtain a low Dice value. Hence, it is essential to calculate the recall metric (TP) (Eq. 6) for vascular pixels.

(3)	$\begin{array}{l}\displaystyle \mathrm m\mathrm T\mathrm R \mathrm E=\dfrac{1}{\mathrm N}\displaystyle\sum_{\mathrm i=1}^{\mathrm N}\\|\mathrm T(\mathrm p_{\mathrm i})-\mathrm T_{\mathrm g\mathrm t}(\mathrm p_{\mathrm i})\\|\end{array}$

(4)	$\begin{array}{l}\displaystyle \mathrm{MAE}=\frac{1}{\mathrm{K}}\sum_{\mathrm{i=1}}^{\mathrm{K}}\|\mathrm{\theta}_{\mathrm{i}}-\mathrm{\theta}_{\mathrm{gt},\mathrm{i}}\|\end{array}$

(5)	$\begin{array}{l}\displaystyle \mathrm{Dice}=\frac{2\|\mathrm A\cap \mathrm B\|}{\|\mathrm A\|+\|\mathrm B\|}\end{array}$

(6)	$\begin{array}{l}\displaystyle \mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{P}}\end{array}$

In the conducted ablation study on simulated data with annotations, four different settings were compared to assess their performance:

Architecture	Pretraining	Fine-tunning	Fine-tunning loss
ResRegNet	Synthetic dataset	Synthetic dataset	Global MSE loss
ResRegNet	Synthetic dataset	Synthetic dataset	Global MSE loss
RayCastNet	Synthetic dataset	Synthetic and real dataset	Global MSE loss
RayCastNet	Synthetic dataset	Synthetic and real dataset	Patch-based loss

Results

According to Fig. v-5, among these settings, RaycastNet-MSE-patch demonstrated superior performance across multiple metrics. However, it slightly underperformed ResRegNet-F in terms of MAE for specific parameters. The other three methods exhibited improvements over ResRegNet in multiple metrics, indicating the effectiveness of fine-tuning. Training with a global MSE loss between 2D images showed slight performance enhancement compared to ResRegNet and ResRegNet-F, suggesting that the inclusion of unlabeled real DSA images can enhance registration performance. The last two rows of comparison demonstrated the effectiveness of patch-based similarity loss. The aforementioned result is visualized in Fig. v-6.

Fig. v-5. Comparison of methods with different setups.

Fig. v-6. Registration result comparison. Each line is an example, where the top left column is the target segmentation image, and the other columns are the registration results (red - target, green - projection).

In the clinical evaluation using real DSA images, a comparison was made between complete and incomplete vessels. According to Fig. v-7, for complete vessels, the proposed method significantly outperformed other methods. However, for incomplete vessels, all methods yielded lower Dice scores. Nevertheless, the proposed method exhibited significantly higher recall rates, surpassing a 20% improvement over other methods. In summary, the proposed method achieves accurate registration of complete vessels and possesses high pixel recall to cover incomplete vessels. The visualization of the result can be seen in Fig. v-8.

Fig. v-7. Quantitative registration result comparison of NMI, LMI, WLMI and RaycastNet-MSE-Patch on real DSAs.

Fig. v-8. The visualized registration result on real DSA images. The top left 2 columns show the target DSA and its vessel segmentation, and the other 4 columns show the results of different methods. In result images, green parts denote the segmentation and red parts denote the projected vessels.

>>back to table of content

CNN-based Real-time 2D-3D Deformable Registration from a Single X-ray Projection [L-1]

Problem Statement

Due to various factors, the tumor position in preoperative images becomes uncertain during surgery. These factors include patient positioning, respiratory motion, instrument compression, tumor elasticity, intraoperative bleeding, and tumor displacement. To address this, a non-rigid 2D-3D registration technique is used to align real-time intraoperative 2D images with preoperative 3D images, ensuring accurate tumor localization.

Previous studies on 2D-3D registration had limitations. They did not consider imaging device positioning errors and changes in patient posture during surgery [L-2 to 5]. Some methods required preoperative 4D-CT scans for training data [L-2, 4, 5]. Additionally, multiple perspective images were often needed as input [L-3].

Method

This research introduces techniques to overcome these limitations. It improves robustness by incorporating small pose variations of the C-arm to account for differences between planned and actual poses. It avoids the need for temporal preoperative images using domain randomization [L-6] and the LDDMM framework [L-7]. The approach ensures smooth and reversible displacement fields. Notably, it simplifies data collection and reduces computational burden by using a single perspective image as input.

In the preoperative stage shown in Fig. l-1, the patient's 3D CT scan is obtained, and relevant structures are segmented. C-arm positioning and patient posture are planned. During treatment, as shown in Fig. l-2, the patient and C-arm are positioned accordingly [L-8, 9]. The non-rigid registration model calculates a 3D displacement field to account for deformations and maps the segmented structures onto the perspective image.

Fig. l-1. Using a single 3D CT scan of the patient, the anatomical structures of interest are segmented and the optimal C-arm pose is determined. A neural network is trained using randomized deformations.

Fig. l-2. At the time of the intervention, the C-arm is adjusted to the planned pose, an X-ray image is acquired, and the network predicts in real-time the 3D deformation field from which the fluoroscopic image is augmented.

The proposed method employs a fully convolutional network architecture for 2D-3D registration [L-10]. It takes intraoperative perspective images as input and outputs a 3D displacement field representing non-rigid + rigid transformations from preoperative CT to intraoperative images.

As shown in Fig. l-3, the network consists of an encoder-decoder structure and a transformation module. The transformation module converts 2D feature maps to 3D and passes them to the decoder. The network is trained using MSE loss.

Fig. l-3. The network is based on an encoder-decoder architecture. The abbreviations k and s stand for kernel size and stride, respectively. The network takes as input one fluoroscopic image and outputs a sub-sampled 3D vector field on the CT volume.

The dataset used in this study consists of pairs $\begin{array}{l}\mathrm{\{(DRR_{i}, U_{total,i})\}}\end{array}$ , where $\begin{array}{l}\mathrm{DRR_{i}}\end{array}$ represents 2D Digitally Reconstructed Radiographs (DRRs), and $\begin{array}{l}\mathrm{U_{total,i}}\end{array}$ represents 3D displacement vector fields (DVFs) that represent the combined effect of rigid projection $\begin{array}{l}\mathrm{p_{i}}\end{array}$ and non-rigid deformation $\begin{array}{l}\mathrm{u_{i}}\end{array}$ .

To generate the training data (see Fig. l-4), we first employ domain randomization to generate a series of deformation fields (DFs), denoted as $\begin{array}{l}\mathrm{\{u_{i}\}}\end{array}$ . These DFs represent the deformation applied to the preoperative $\begin{array}{l}\mathrm{CT_{0}}\end{array}$ image, resulting in corresponding CT images $\begin{array}{l}\mathrm{CT_{i}}\end{array}$ . Subsequently, using the DeepDRR framework, we perform rigid projection $\begin{array}{l}\mathrm{p_{i}}\end{array}$ on the deformed CT images $\begin{array}{l}\mathrm{CT_{i}}\end{array}$ to obtain the corresponding 2D DRR images $\begin{array}{l}\mathrm{DRR_{i}}\end{array}$ .

Fig. l-4. Overview of the training data generation process.

As the elastic deformations occurring in various organ tissues during surgery are smooth, reversible, and without tearing or breaking, the deformation field samples need to satisfy two important constraints: reversibility and diffeomorphism. Therefore, it is not feasible to uniformly sample ui (deformation vectors) from the entire domain of DFs. Instead, we should sample ui from a linear subspace uniformly.

Considering the transformation $\begin{array}{l}\mathrm{\Phi(x)=x+u(x)}\end{array}$ that registers the initial image $\begin{array}{l}\mathrm{I_{0}}\end{array}$ to the target image $\begin{array}{l}\mathrm{I_{1}}\end{array}$ , the corresponding inverse transformation is $\begin{array}{l}\mathrm{\Phi^{-1}(x)=x-u(x)}\end{array}$ . Satisfying the reversibility condition is equivalent to minimizing $\begin{array}{l}\mathrm{||I_{0}\circ\Phi^{-1}(x)-I_{1}||^{2}}\end{array}$ . Additionally, the deformation field $\begin{array}{l}\mathrm{u(x)}\end{array}$ can be expressed as the integral of $\begin{array}{l}\mathrm{v(x, t)}\end{array}$ with respect to $\begin{array}{l}\mathrm t\end{array}$ . To ensure the reversibility condition, we introduce a series of control points that allow $\begin{array}{l}\mathrm{v(x, t)}\end{array}$ to be further represented in the form of a reproducing kernel (see Eq. 7) [L-11].

(7)	$\begin{array}{l}\displaystyle \mathrm{v(x,t)~=~\sum_{k=1}^{N_{cp}}K(x,c_{k}(t))a_{k}(t)}\end{array}$

To avoid the computation of integrals, we iteratively generate the deformation field $\begin{array}{l}\mathrm{u(x)}\end{array}$ following these steps:

For each time point t, sample $\begin{array}{l}\mathrm{a_{k}(t)}\end{array}$ and $\begin{array}{l}\mathrm{c_{k}(t)}\end{array}$ from a uniform distribution.
Compute $\begin{array}{l}\mathrm{v_{i} = v(x_{i}, t_{i})}\end{array}$ , where $\begin{array}{l}\mathrm{x_{i}}\end{array}$ is the current position.
Check if the diffeomorphism condition [L-12] is satisfied, i.e., $\begin{array}{l}\mathrm{\left\|\mathrm{v}_{\mathrm{j}}\right\|_{\mathrm{W}^{1,inf}}< 1}\end{array}$ .
Update $\begin{array}{l}\mathrm{u_{i+1}(x_0) = v_{i}(x_{0}) + u_{i}(x_{0})}\end{array}$ , where $\begin{array}{l}\mathrm{x_{0}}\end{array}$ corresponds to our $\begin{array}{l}\mathrm{CT_{0}}\end{array}$ image, and the initial deformation field $\begin{array}{l}\mathrm{u_{0}(x_{0})}\end{array}$ is 0.
Update the position $\begin{array}{l}\mathrm{x_{i+1} = x_{i} + v_{i}}\end{array}$ .

It should be noted that $\begin{array}{l}\mathrm{x_{0}}\end{array}$ corresponds to our $\begin{array}{l}\mathrm{CT_{0}}\end{array}$ image, and the corresponding deformation field $\begin{array}{l}\mathrm{u_{max}(x_{0})}\end{array}$ is the label $\begin{array}{l}\mathrm{u_{i}}\end{array}$ for the deformed CT image $\begin{array}{l}\mathrm{CT_{i}}\end{array}$ .

The DeepDRR framework [L-13] simulates the C-arm as a pinhole camera and utilizes the extrinsic matrix $\begin{array}{l}\mathrm E\end{array}$ to generate DRR. The rotation and translation values of matrix $\begin{array}{l}\mathrm E\end{array}$ are determined during the surgical planning stage.

To account for errors between the actual pose and the preoperative planned pose of the C-arm, the training data should include variations in pose. The pose variation matrix $\begin{array}{l}\mathrm P\end{array}$ follows the same parameterization as $\begin{array}{l}\mathrm E\end{array}$ , consisting of a translation $\begin{array}{l}\mathrm{T_{P}}\end{array}$ and a rotation $\begin{array}{l}\mathrm{R_{P}}\end{array}$ . Firstly, the translation vector $\begin{array}{l}\mathrm{T_{P}}\end{array}$ is uniformly sampled, parallel to the image plane, with an amplitude between 0 and 1. $\begin{array}{l}\mathrm{T_{P}}\end{array}$ is then scaled by $\begin{array}{l}\mathrm{a\sim N(0,\sqrt{\frac{Tmax}{2}})}\end{array}$ , where $\begin{array}{l}\mathrm{T_{max}}\end{array}$ represents the maximum translation amplitude in the dataset. The rotation parameter $\begin{array}{l}\mathrm{R_{P}}\end{array}$ is vectorized, uniformly sampled from the Haar distribution, and converted back to a rotation vector. Similarly, the rotation vector $\begin{array}{l}\mathrm{R_{P}}\end{array}$ is scaled by a normal distribution $\begin{array}{l}\mathrm{b\sim N(0,\sqrt{\frac{Rmax}{2}})}\end{array}$ , with $\begin{array}{l}\mathrm{R_{max}}\end{array}$ representing the maximum rotation amplitude in the dataset. Lastly, to remove outliers, scaling values exceeding the threshold are truncated. The pose sampling follows a normal distribution to prioritize poses close to the planned pose, as the C-arm's most likely pose during treatment should resemble the planned pose more than poses far from it.

Evaluations and Results

The metrics used in the experiment were the 3D target registration error (TRE) and the projection distance (PD) between the predicted and true positions [L-14]. The model was validated on a dataset containing pose variations and domain-randomized deformations, achieving small errors. This demonstrates that the proposed method can compete with existing techniques even for larger displacements and consider pose errors.

Fig. l-5. The summary of the experimental results.

To further validate the method, a series of deformations of 10 respiratory-correlated lung 4D CT scans from the 4D-Lung dataset [L-15] were tested. The first test dataset did not introduce pose variations, and the network only recovered non-rigid displacements. As shown in Fig. l-5, Table 2, using domain randomization for generating the displacement field is reasonable because the network performs well on the real data test set, even though it was trained on synthetic data . The second test dataset incorporated pose variations, generated by altering the pose . From the results presented in Fig. l-5, Table 3, it can be seen that the network has been trained to handle such pose variations, and its accuracy remains high. The visualization of the result on real data shown in Fig. l-6.

Fig. l-6. The visualization of the result on real data.

>>back to table of content

Non-rigid 2D-3D Registration using Convolutional Autoencoder [S-1]

Problem Statement

Long-term orthodontic treatment for the skull requires tracking the 3D morphological variations of the skull over time and conducting a comprehensive analysis of treatments and growth patterns. This process involves acquiring a reference CT scan during the patient's initial visit to obtain a high-resolution 3D representation of the skull. Subsequently, during regular follow-up visits, the doctor takes additional CT scans of the patient's skull and compares them to the initial scan to evaluate the effectiveness of the treatment. However, frequent CT scans within a short period may expose the patient to a significant amount of radiation.

To address this concern, one solution is to capture X-ray images of the patient during follow-up visits (see Fig. s-1), which results in significantly lower radiation doses compared to CT scans [S-2]. These X-ray images can then be registered to the reference CT scan. By employing this approach, doctors can effectively monitor the treatment's progress without the need for frequent CT scans, thereby reducing the overall radiation exposure to the patient.

Fig. s-1. Cranial orthodontic treatment process. Patients undergo CT scans and obtain initial models at the first visit to the doctor, and in subsequent follow-up visits, X-rays are taken and the obtained pictures are registered to observe the progress of treatment.

Method

The study proposes an unsupervised encoder-decoder architecture, which allows for the reconstruction of a 3D skull model from a single lateral cephalogram. This method formulates the task by embedding X-ray images as a code vector. As shown in Fig. s-2, first, an average skull model is computed from a 3D cranial CT image dataset. Then, the deformation field, represented as ui, between the average skull model and each sample in the dataset is calculated. This process yields a dataset of deformation fields. PCA (Principal Component Analysis) is employed to project the elements of the dataset onto a lower-dimensional linear space. Each deformation field $\begin{array}{l}\mathrm{u_{i}}\end{array}$ corresponds to a spatial coordinate $\begin{array}{l}\mathrm{c_{i}}\end{array}$ in the space. By leveraging subspace projection, the volumetric deformation field is downsized to a low-dimensional code vector, enabling efficient representation.

Fig. s-2. Obtaining low-dimensional embeddings of deformation fields using PCA.

Additionally, the approach utilizes unsupervised learning directly from real unlabelled datasets, eliminating the need for labeled data or adaptation modules. Nonetheless, the method only requires a single X-ray image as input for the model, leading to a significant reduction in radiation exposure. Unlike previous methods that rely on multiple angle inputs to provide geometric constraints for registration, this approach incorporates a multi-level pyramid feature loss to provide additional feature constraints for registration (see Fig. s-3). This addresses the limitations of relying solely on a single input image for geometric constraints.

Fig. s-3. A pyramidal feature extraction network is used to obtain multi-level features, and constraints are provided for registration by controlling the distance between features at each level.

In the process of model training shown in Fig. s-5, the X-ray imaging system and patient's position are adjusted to the desired configuration, and X-ray images are acquired. These images are then input into a convolutional encoder to obtain a code vector, representing the deformation field's coordinates in a low-dimensional linear space. Using the code vector and the average skull model, the current 3D skull model of the patient is reconstructed (see Fig. s-4). Next, a synthetic Digital Reconstructed Radiograph (DRR) is computed using a ray-casting projection model of the X-ray source, with the projection parameters known in advance, such as the focal length, position, and orientation. A specially designed loss function is employed to measure the similarity between the rendered DRR and the 2D X-ray images of the patient.

Fig. s-4. An overview of the process of reconstructing the current skull 3D model. First, input the obtained X-ray image into the encoder to obtain the code vector. Use the code vector to calculate the deformation field and deform the population average model to obtain the current skull 3D model.

Fig. s-5. An overview of the training process of the convolutional autoencoder-based non-rigid 2D-3D registration framework.

The loss function (Eq. 8) consists of multiple terms. The first term calculates the difference between the generated DRR and the input X-ray image using the normalized cross-correlation (NCC) error. A function g is utilized to extract multi-scale features from X-ray images, and an unsupervised convolutional autoencoder is pre-trained to facilitate this feature extraction. The second term minimizes the dissimilarity between the stacked feature maps of the input image and the reconstructed image. The third term incorporates Tikhonov regularization, which penalizes the encoder when the predicted code vector deviates significantly from the samples in the dataset, indicating outliers. By including this regularization term, the encoder is encouraged to produce more plausible code vectors.

(8)

$\begin{array}{l}\displaystyle \mathrm{L}=\sum_{\mathrm{I}\in \mathrm J}\|\mathrm I-\mathrm{I'}\|_{\mathrm{NCC}}^{2}+\lambda_{\mathrm{p}}\sum_{\mathrm{I}\in \mathrm J}\sum_{\mathrm{k}=1}^{\mathrm{L}}\|\mathrm g_{\mathrm{k}}(\mathrm{I})-\mathrm g_{\mathrm{k}}(\mathrm{I'})\|_{2}^{2}+\lambda_{\mathrm{r}}\sum_{\mathrm{i}=1}^{\mathrm{K}}(\frac{\mathrm c_{\mathrm{i}}}{\sigma_{\mathrm{i}}})^{2}\end{array}$

Evaluations

the study evaluated the performance using both 3D and 2D measures. The 3D metrics included the target registration error (TRE) on 3D volume and the gross failure rate (GFR).The 2D metric considered was the landmark registration error (LRE) on the X-ray image

In the experiments, the following methods were compared:

Method		Training
OPM, OPC [S-2]	Optimization	No need
RBF [S-13], PLSR [S-6], ResNet [S-14]	Regression	Supervised
EN	Renderer-based decoder	Unsupervised
ENd	NN-based decoder
ENp	Renderer decoder + pyramid feature loss

Results

Results on synthetic data (Fig. s-6 and visualization in Fig. s-7) showed that the EN method outperformed optimization methods (OPC and MSD), indicating its effectiveness in capturing 3D craniofacial shape distributions. EN also achieved better performance compared to supervised regression-based models (RBF and PLSR) due to its larger parameter space. EN showed comparable performance to the supervised ResNet model.

Fig. s-6. The TRE, the LRE, and the GFR of the proposed method (EN), traditional optimization-based methods of the OPM and the OPC, the RBF, the PLSR, and the restnet-based methods on synthetic X-ray images.

Fig. s-7. Registration results of synthetic X-ray images. From left to right: the semi-transparent axial, sagittal, and coronal cross-sectional overlapping before and after the non-rigid 2D-3D registration (gray-target, red-reference before and after the registration), and the 2D landmark overlapping on the DRR of the reference volume (top) and the target X-ray image (bottom). (green-landmarks on synthetic DRR, red-landmarks on target X-ray image).

As shown in Fig. s-8 and visualization in Fig. s-9, when evaluating real data, EN demonstrated the best performance, surpassing even the supervised ResNet model. This highlighted the challenge of generalizing the ResNet model trained solely on synthetic data. ENd was more efficient to train as it didn't require volumetric deformations and DRR evaluations. However, EN outperformed ENd due to its accurate inference of X-ray images. The feature pyramid was found effective in reducing registration errors by enriching the intensity X-ray images with multi-scale feature maps, enhancing the similarity analysis of craniofacial structures.

Fig. s-8. The LRE (mm) and the LRE percentile (mm) on clinically obtained X-ray images using the proposed method (EN), the OPM and the OPC, the RBF, the PLSR, and the restnet.

Fig. s-9. Registration results of clinically obtained X-ray images. From left to right: input xray image, DRR of output volume, skull rendering of output volume, and the overlapping of landmarks on the target X-ray and synthetic DRR images.

>>back to table of content

Summary

The first paper applies a weakly supervised learning method to achieve 2D to 3D registration on incomplete vascular images. By introducing a projection module, the model is fine-tuned on real unlabeled data. This paper innovatively designs a patch-based content loss function to address the challenges posed by the traditional mean squared error loss function.

The second paper introduces a supervised learning-based registration method for lung tumor localization. The authors predict a 3D deformation field by inputting 2D X-ray images, enabling accurate non-rigid registration. To enhance the model's generalization capability, this article utilizes domain randomization techniques to generate the training dataset. Additionally, the model takes into account the positioning errors between patient poses and intraoperative X-ray imaging devices, making the registration more robust.

The third paper adopts an unsupervised learning approach to 3D modeling of cranial bone deformations based on 2D X-ray images, supporting long-term cranial orthodontic treatments. Compared to previous unsupervised learning methods, this approach only requires a single 2D X-ray image as input. To address the insufficient geometric constraints from a single-angle image input, the paper introduces multi-layer feature losses. Furthermore, to better represent the deformation field, the paper incorporates representation learning, using a low-dimensional code vector as the embedding representation of the deformation field.

Discussions

Despite their valuable contributions, both the first and third papers have notable limitations. One major drawback is the lack of consideration for potential errors in patient positioning and imaging device alignment, which could have a significant impact on the accuracy of registration results. The omission of this crucial factor undermines the papers' ability to accurately reflect real-world scenarios.

In the case of the first paper, it fails to address the modeling of elastic deformations that commonly occur during surgical procedures. Factors such as patient respiration or the influence of surgical tools, which can cause organ compression and deformation, are not accounted for. The absence of these critical considerations limits the robustness and stability of the proposed registration method.

Interestingly, this issue is discussed in the preprint version of the paper but is omitted in the final published version. In the preprint, the authors utilize a U-Net architecture (see Fig. d-1) to predict the 2D deformation field between the projection module-derived MDRR and the DSA images. To train the U-Net network, the paper employs a data generation method as shown in Fig. d-2: First, a DRR image is selected, and three points are randomly chosen to form a triangle, denoted as Triangle A. Then, within a neighborhood of these three points, another set of three points is randomly selected to form a new triangle, denoted as Triangle B. An affine transformation from A to B is calculated. This affine transformation is then combined with a random deformation field with Gaussian blur to create a non-rigid 2D deformation field, serving as the ground truth label.

Fig. d-1. The structure of the elastic registration model.

FIg. d-2. The training data generation process for the elastic model.

The second paper exhibits shortcomings in its methodology. The complexity of the employed data generation method raises concerns regarding potential biases or confounding factors that might affect the reliability and generalization of the proposed model. Furthermore, the experimental evaluation of the paper solely focuses on testing the proposed model without conducting a comparative analysis against established methods or alternative approaches. This lack of comparative assessment diminishes the persuasiveness and overall credibility of the presented findings.

>>back to table of content

References

[R-1] Unberath, Mathias, et al. "The impact of machine learning on 2d/3d registration for image-guided interventions: A systematic review and perspective." Frontiers in Robotics and AI 8 (2021): 716007.

[V-1] Meng, Cai, et al. "A weakly supervised framework for 2D/3D vascular registration oriented to incomplete 2D blood vessels." IEEE Transactions on Medical Robotics and Bionics 4.2 (2022): 381-390.
[V-2] J. H. Hipwell et al., “Intensity-based 2-D-3-D registration of cerebral angiograms,” IEEE Trans. Med. Imag., vol. 22, no. 11, pp. 1417–1426, Nov. 2003.
[V-3] S. Demirci, O. Kutter, F. Manstad-Hulaas, R. Bauernschmitt, and N. Navab, “Advanced 2D-3D registration for endovascular aortic interventions: Addressing dissimilarity in images,” in Proc. Med. Imag. Visualizat. Image-Guided Procedur. Model., vol. 6918, 2008, Art. no. 69182S.
[V-4] S. Demirci, M. Baust, O. Kutter, F. Manstad-Hulaas, H.-H. Eckstein, and N. Navab, “Disocclusion-based 2D–3D registration for aortic inter- ventions,” Comput. Biol. Med., vol. 43, no. 4, pp. 312–322, 2013.
[V-5] S. Miao, R. Liao, and Y. Zheng, “A hybrid method for 2-D/3-D reg- istration between 3-D volumes and 2-D angiography for trans-catheter aortic valve implantation (TAVI),” in Proc. IEEE Int. Symp. Biomed. Imag. Nano Macro, 2011, pp. 1215–1218.
[V-6] A. Raheem, T. Carrell, B. Modarai, and G. Penney, “Non-rigid 2D-3D image registration for use in endovascular repair of abdominal aortic aneurysms,” in Proc. Med. Image Understand. Anal., 2010, pp. 153–157.
[V-7] S. Hunsche et al., “Intensity-based 2D 3D registration for lead localiza- tion in robot guided deep brain stimulation,” Phys. Med. Biol., vol. 62, no. 6, p. 2417, 2017.
[V-8] C. Meng, Q. Wang, S. Guan, K. Sun, and B. Liu, “2D-3D registration with weighted local mutual information in vascular interventions,” IEEE Access, vol. 7, pp. 162629–162638, 2019.
[V-9] K. Yang et al., “A novel 2D/3D hierarchical registration framework via principal-directional fourier transform operator,” Phys. Med. Biol., vol. 66, no. 6, 2021, Art. no. 65030.
[V-10] M. Groher, D. Zikic, and N. Navab, “Deformable 2D-3D registration of vascular structures in a one view scenario,” IEEE Trans. Med. Imag., vol. 28, no. 6, pp. 847–860, Jun. 2009.
[V-11] D. Rivest-Henault, H. Sundar, and M. Cheriet, “Nonrigid 2D/3D regis- tration of coronary artery models with live fluoroscopy for guidance of cardiac interventions,” IEEE Trans. Med. Imag., vol. 31, no. 8, pp. 1557–1572, Aug. 2012.
[V-12] D. Xu et al., “Single-view 2D/3D registation for X-ray guided bron- choscopy,” in Proc. IEEE Int. Symp. Biomed. Imag. Nano Macro, 2010, pp. 233–236.
[V-13] S. Ghafurian, I. Hacihaliloglu, D. N. Metaxas, V. Tan, and K. Li, “3D/2D image registration using weighted histogram of gradient directions,” in Med. Imag. Image-Guided Procedur. Robot. Intervent. Model., vol. 9415, 2015, Art. no. 94151Z.
[V-14] J. Zhu et al., “Heuristic tree searching for pose-independent 3D/2D rigid registration of vessel structures,” Phys. Med. Biol., vol. 65, no. 5, 2020, Art. no. 55010.
[V-15] S. Yoon, C. H. Yoon, and D. Lee, “Topological recovery for non- rigid 2D/3D registration of coronary artery models,” Comput. Methods Programs Biomed., vol. 200, Mar. 2021, Art. no. 105922.
[V-16] C.-R. Chou, B. Frederick, G. Mageras, S. Chang, and S. Pizer, “2D/3D image registration using regression learning,” Comput. Vis. Image Understand., vol. 117, no. 9, pp. 1095–1106, 2013.
[V-17] A. R. Gouveia, C. Metz, L. Freire, P. Almeida, and S. Klein, “Registration-by-regression of coronary CTA and X-ray angiography,” Comput. Methods Biomech. Biomed. Eng. Imag. Visualizat., vol. 5, no. 3, pp. 208–220, 2017.
[V-18] S. Miao, Z. J. Wang, and R. Liao, “A CNN regression approach for real-time 2D/3D registration,” IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1352–1363, May 2016.
[V-19] S. Miao, Z. J. Wang, Y. Zheng, and R. Liao, “Real-time 2D/3D registra- tion via CNN regression,” in Proc. IEEE 13th Int. Symp. Biomed. Imag. (ISBI), 2016, pp. 1430–1434.
[V-20] J.Zheng,S.Miao,Z.J.Wang,andR.Liao,“Pairwisedomainadaptation module for CNN-based 2-D/3-D registration,” J. Med. Imag., vol. 5, no. 2, 2018, Art. no. 21204.
[V-21] S. Guan, C. Meng, Y. Xie, Q. Wang, K. Sun, and T. Wang, “Deformable cardiovascular image registration via multi-channel con- volutional neural network,” IEEE Access, vol. 7, pp. 17524–17534, 2019.
[V-22] S. Miao et al., “Dilated FCN for multi-agent 2D/3D medical image registration,” in Proc. AAAI Conf. Artif. Intell., vol. 32, 2018, pp. 4694–4701.
[V-23] D.Tothetal.,“3D/2Dmodel-to-imageregistrationbyimitationlearning for cardiac procedures,” Int. J. Comput. Assist. Radiol. Surg., vol. 13, no. 8, pp. 1141–1149, 2018.
[V-24] Y. Hu et al., “Label-driven weakly-supervised learning for multimodal deformable image registration,” in Proc. IEEE 15th Int. Symp. Biomed. Imag. (ISBI), 2018, pp. 1070–1074.
[V-25] Q. Zeng et al., “Label-driven magnetic resonance imaging (MRI)- transrectal ultrasound (TRUS) registration using weakly supervised learning for MRI-guided prostate radiotherapy,” Phys. Med. Biol., vol. 65, no. 13, 2020, Art. no. 135002.
Authorized licensed use limited to: Technische Universitaet Muenchen. Downloaded on June 26,2023 at 09:05:18 UTC from IEEE Xplore. Restrictions apply.
[V-26] P. Li, Y. Pei, Y. Guo, G. Ma, rigid 2D-3D registration using in Proc. IEEE 17th Int. Symp. Biomed. Imag. (ISBI), 2020, pp. 700–704.
[V-27] Y. Zhang, “An unsupervised 2D–3D (2D3D-RegNet) for cone-beam CT vol. 66, no. 7, 2021, Art. no. 74001.
[V-28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[V-29] X. Zhou, C. Yang, and W. Yu, detecting contiguous outliers in the Trans. pattern Anal. Mach. Intell., vol. 35, no. 3, pp. 597–610, Mar. 2013.
[V-30] E. B. Van de Kraats, G. P. Penney, D. Tomazevic, T. V. Walsum, and W. J. Niessen, “Standardized evaluation methodology for 2-D-3- D registration,” IEEE Trans. Med. Imag., vol. 24, no. 9, pp. 1177–1189, Sep. 2005.

[L1] Lecomte, François, Jean-Louis Dillenseger, and Stéphane Cotin. "CNN-based real-time 2D-3D deformable registration from a single X-ray projection." arXiv preprint arXiv:2212.07692 (2022).
[L-2] Shieh, C.C., Caillet, V., Dunbar, M., Keall, P.J., Booth, J.T., Hardcastle, N., Haddad, C., Eade, T., Feain, I.: A Bayesian approach for three- dimensional markerless tumor tracking using kV imaging during lung radiotherapy. Physics in Medicine and Biology 62(8), 3065–3080 (2017)
[L-3] Zhang, Y., Huang, X., Wang, J., Sebastian, N., Robb, R., Webb, A., Shilo, K., Denicola, G.M., Williams, T.M.: Automatic Cone Beam Projection- based Liver Tumor Localization by Deep Learning and Biomechanical Modeling. Int. Journal of Radiation Oncology, Biology, Physics 108(3), 171 (2020)
[L-4] Hirai, R., Sakata, Y., Tanizawa, A., Mori, S.: Real-time tumor tracking using fluoroscopic imaging with deep neural network analysis. Physica Medica 59, 22–29 (2019)
[L-5] Foote, M.D., Zimmerman, B.E., Sawant, A., Joshi, S.C.: Real-Time 2D- 3D Deformable Registration with Deep Learning and Application to Lung Radiotherapy Targeting. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11492 LNCS, 265–276 (2019)
[L-6] Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain Randomization for Transferring Deep Neural Networks from Sim- ulation to the Real World. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 22–30 (2017)
[L-7] Trouve, A., Faisal Beg, M., Miller, M.I., Younes, L.: Computing Large Deformation Metric Mappings via Geodesic Flows of Diffeomorphisms. International Journal of Computer Vision 61(2), 139–157 (2005)
[L-8] Rouze, S., de Latour, B., Flecher, E., Guihaire, J., Castro, M., Corre, R., Haigron, P., Verhoye, J.-P.: Small pulmonary nodule localization with cone beam computed tomography during video-assisted thoracic surgery: a feasibility study. Interactive CardioVascular and Thoracic Surgery 22(6), 705–711 (2016)
[L-9] Lee, B.C., Sinha, A., Varble, N., Pritchard, W.F., Karanian, J.W., Wood, B.J., Bydlon, T.: Breathing-compensated neural networks for real time c-arm pose estimation in lung ct-fluoroscopy registration. In: IEEE International Symposium on Biomedical Imaging (ISBI), pp. 1–5 (2022)
[L-10] Shen, L., Zhao, W., Xing, L.: Patient-specific reconstruction of volumet- ric computed tomography images from a single projection view via deep learning. Nature Biomedical Engineering 3(11), 880–888 (2019)
[L-11] Durrleman, S., Prastawa, M., Charon, N., Korenberg, J.R., Joshi, S., Gerig, G., Trouv e, A.: Morphometry of anatomical shape complexes with dense deformations and sparse parameters. NeuroImage 101, 35–49 (2014)
[L-12] Banyaga, A.: The Structure of Classical Diffeomorphism Groups. Mathe- matics and its Applications, vol. 400. Kluwer Academic (1997)
[L-13] Unberath, M., Zaech, J.N., Lee, S.C., Bier, B., Fotouhi, J., Armand, M., Navab, N.: DeepDRR - A Catalyst for Machine Learning in Fluoroscopy- Guided Procedures. Lecture Notes in Computer Science 11073 LNCS, 98–106 (2018). Accessed 2022-02-27
[L-14] van de Kraats, E.B., Penney, G.P., Tomazevic, D., van Walsum, T., Niessen, W.J.: Standardized evaluation methodology for 2-d-3-d registra- tion. IEEE Transactions on Medical Imaging 24(9), 1177–1189 (2005)
[L-15] Hugo, G.D., Weiss, E., Sleeman, W.C., Balik, S., Keall, P.J., Lu, J., Williamson, J.F.: A longitudinal four-dimensional computed tomography and cone beam computed tomography dataset for image-guided radiation therapy research in lung cancer. Medical physics 44(2), 762 (2017)

[S-1] Li, Peixin, et al. "Non-rigid 2D-3D registration using convolutional autoencoders." 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, 2020.
[S-2] Botjan Likar, “A review of 3d/2d registration methods for image-guided interventions,” Medical Image Analy- sis, vol. 16, no. 3, pp. 642–661, 2012.
[S-3] Gendrin Christelle et al., “Monitoring tumor motion by real time 2d/3d registration during radiotherapy,” Radio- therapy and Oncology Journal of the European Society for Therapeutic Radiology and Oncology, vol. 102, no. 2, pp. 274–280, 2012.
[S-4] Weimin Yu, Moritz Tannast, and Guoyan Zheng, “Non- rigid free-form 2d-3d registration using a b-spline-based statistical deformation model,” Pattern Recognition, p. S003132031630293X, 2016.
[S-5] G. Zheng, “Statistically deformable 2d/3d registration for accurate determination of post-operative cup orien- tation from single standard x-ray radiograph.,” Annals of Biomedical Engineering, vol. 38, no. 9, pp. 2910–2927, 2010.
[S-6] Guoyan Zheng, “3d volumetric intensity reconsturction from 2d x-ray images using partial least squares regres- sion,” in IEEE ISBI, 2013, pp. 1268–1271.
[S-7] Yuru Pei, Fanfan Dai, Tianmin Xu, Hongbin Zha, and Gengyu Ma, “Volumetric reconstruction of craniofacial structures from 2d lateral cephalograms by regression forest,” in IEEE ICIP, 2016, pp. 4052–4056.
[S-8] Shun Miao, Sebastien Piat, Peter Fischer, Ahmet Tuy- suzoglu, Philip Mewes, Tommaso Mansi, and Rui Liao, “Dilated fcn for multi-agent 2d/3d medical image regis- tration,” in AAAI, 2018.
[S-9] Shun Miao, Z Jane Wang, and Rui Liao, “A cnn regres- sion approach for real-time 2d/3d registration,” IEEE Trans. MI, vol. 35, no. 5, pp. 1352–1363, 2016.
[S-10] Yuru Pei et al., “Non-rigid craniofacial 2d-3d registra- tion using cnn-based regression,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 117–125. Springer, 2017.
[S-11] Zhang Yue et al., “Task driven generative modeling for unsupervised domain adaptation: Application to x-ray image segmentation,” in MICCAI, 2018.
[S-12] Jiannan Zheng, Shun Miao, and Liao Rui, “Learning cnns with pairwise domain adaption for real-time 6dof ultrasound transducer detection and tracking from x-ray images,” in MICCAI, 2017.
[S-13] B Walczak and DL Massart, “The radial basis func- tionsłpartial least squares approach as a flexible non- linear regression technique,” Analytica Chimica Acta, vol. 331, no. 3, pp. 177–185, 1996.
[S-14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.

>>back to table of content

Seitenhierarchie

Deep Learning for Non-rigid 2D-3D Registration

This blog post is written by Shucheng Yang, and guided by Baochang Zhang

Table of Content

Introduction

Previous Research

A Weakly Supervised Framework for 2D-3D Vascular Registration Oriented to Incomplete 2D Blood Vessels [V-1]

Problem Statement

Method

Evaluations

Results

CNN-based Real-time 2D-3D Deformable Registration from a Single X-ray Projection [L-1]

Problem Statement

Method

Evaluations and Results

Non-rigid 2D-3D Registration using Convolutional Autoencoder [S-1]

Problem Statement

Method

Evaluations

Results

Summary

Discussions

References