In this blog post, we aim to explore and share insights on three methods used for reconstructing 3D CT models from single or biplanar images. Our goal is to highlight the key contributions of each method, while also acknowledging their potential limitations and the challenges associated with the metrics used for comparison.

Author: Jingtian Zhao

Supervisor: Unbekannter Benutzer (ga87jay)

Introduction
Literature Background
Methods and Experiments
Discussions
References

1. Introduction

1.1. The motivation for 3D reconstruction in Medical Science

3D reconstruction in Medical Science is a cutting-edge process where two-dimensional images, such as X-rays or MRI scans, are converted into three-dimensional models. This technology allows doctors to gain a more comprehensive and tangible understanding of a patient's anatomy. By doing so, doctors are able to make precise decision for patients. For instance, in complex surgical procedures, highly precise 3D reconstructions can provide doctors with essential information that might not be visible from their perspective, significantly benefiting patient care. Moreover, the integration of 3D reconstruction into medical education offers immense learning opportunities for doctors and other healthcare professionals, enhancing their experience and expertise in the field.

1.2. Motivation and Challenges in 3D reconstruction from a single or biplanar images

The motivation for generating 3D reconstruction from single or biplanar images in medical science is driven by significant concerns associated with CT scans. A primary issue is proposed by Ying et al.[1], that the high radiation dose incurred during a CT scan, which varies based on the number of X-rays acquired for the reconstruction. This poses a considerable health risk to patients, especially when frequent imaging is required. Additionally, CT scanners represent a substantial financial investment, often being prohibitively expensive and less accessible, particularly in developing countries where resources are limited. In contrast, X-ray machines are more affordable and widely available, making the development of 3D reconstruction techniques from X-ray images a valuable and necessary advancement in medical imaging.

However, this endeavor faces many challenges. Cafaro et al.. proposed that the primary challenge is the nature of this task as an ill-posed inverse problem[2]. X-ray measurements, which result from attenuation integration across the body, present a high degree of ambiguity. Traditional methods of reconstruction typically require hundreds of projections to sufficiently constrain the internal structures for accurate modeling. With only a few projections, as is the case with single or biplanar X-rays, disentangling these structures for even a coarse 3D estimation becomes exceedingly difficult. This complexity highlights the need for innovative approaches and advanced computational techniques to effectively tackle the challenges in 3D reconstruction from limited X-ray data.

Figure1: Illustration for 3D reconstruction for lungs from 2 biplanar images

2. Literature Background

In the previous section, we discussed the inherent challenges in 3D reconstruction from single or biplanar images, primarily due to the lack of necessary information. Traditional reconstruction methods, which are not based on neural networks, often fall short in this context. These conventional methods typically require hundreds of 2D images to generate a reliable 3D model, a requirement that is not feasible with limited image data. Consequently, this necessitates the exploration of neural network-based models to bridge the information gap. Recent trends in this field have demonstrated that neural networks can indeed deliver satisfactory performance in these complex tasks.

A notable example of this approach is the work by Henzler et al.[3]. Their research focuses on creating a 3D volume from a single 2D X-ray image of a mammal skull using a Convolutional Neural Network (CNN). The architecture of their model is an encoder-decoder framework augmented with skip connections and residual learning, a structure that has proven effective in various image processing tasks. What sets their model apart is its significant improvement in performance, as evidenced by enhanced metrics in both 3D Volume (L2) and 2D Image (DSSIM), compared to earlier methods. This advancement underscores the potential of NN-based models in revolutionizing the field of medical imaging and 3D reconstruction.

Figure2: Overview of the network presented by Henzler et al.

In recent years, the field of 3D reconstruction using deep learning has witnessed a rapid evolution. Starting in 2019, Ying et al. introduced the X2CT-GAN, a pioneering approach utilizing a Conditional Generative Adversarial Network for this task. Advancing further, in 2021, Nakao et al. made a significant contribution by incorporating organ deformation during breathing into their model, addressing a crucial aspect of dynamic anatomical changes.

In 2022, Shen et al. explored the use of Implicit Neural Representations for reconstruction. Although their work was based on multiple inputs, it introduced an innovative idea that garnered attention and comparison in subsequent studies. Most recently, in 2023, Cafaro et al.. made a notable advancement with X2Vision. Their approach leveraged manifold learning to understand structural priors, significantly enhancing the model's precision. Each of these developments marks a step forward in the ongoing journey to refine and improve 3D reconstruction methodologies using deep learning techniques.

3. Methods and Experiments

In this section, we will mainly focus on three papers for 3D reconstruction from single or biplanar images.

3.1. X2CT-GAN (CVPR 2019)

X2CT-GAN generates the 3D CT model from biplanar images. The main contribution of this work is that it introduces some useful connections which helps to generate 3D volume that solves the problem of losing information from transfer from 2D to 3D.

3.1.1. Loss functions

X2CT-GAN uses conditional GAN, an extension of the original GAN, to deal with the task. As the loss functions are the key parts of the model, we will first discuss the loss functions that are used in X2CT-GAN.

Figure3: Network architecture of the X2CT-GAN generator

3.1.1.1. Adversarial Loss

As is known to all, the procedure of training an original GAN is that the generator G and discriminator D compete with each other. After training, we want the distribution of the generator $\begin{array}{l}p_G(x)\end{array}$ to be as close as possible to the distribution of the real world $\begin{array}{l}p_{data}(x)\end{array}$ . When it comes to this task, the author takes LSGAN which is defined as:

$\begin{aligned} & \mathcal{L}_{L S G A N}(D)=\frac{1}{2} {\left[\mathbb{E}_{y \sim p(C T)}(D(y \mid x)-1)^2+\right.} \left.\mathbb{E}_{x \sim p\left(X_{\text {ray }}\right)}(D(G(x) \mid x)-0)^2\right], \\ & \mathcal{L}_{L S G A N}(G)=\frac{1}{2}\left[\mathbb{E}_{x \sim p\left(X_{\text {ray }}\right)}(D(G(x) \mid x)-1)^2\right],\end{aligned}$

LSGAN uses the least-square loss, which helps to stabilize the training process to get more realistic details.

3.1.1.2. Reconstruction Loss

As the task is for medical usage, which is quite different from generating some other real world object. For example, the CT scans have less diversity in color and shape, but requires higher precision in internal structures in 3D. In order to get more precise result, this work, as like some other previous work, also use the reconstruction loss which looks like:

$\begin{array}{l}\displaystyle \mathcal{L}_{r e}=\mathbb{E}_{x, y}\|y-G(x)\|_2^2\end{array}$

3.1.1.3. Projection Loss

As a supplement to the reconstruction loss which focuses on 3D model, the author also mentioned they want to make sure that the generated 3D model can produce realistic 2D projections from three perspectives. In order to improve the model's ability at that part, projection loss is introduced as below:

$\begin{array}{r}\mathcal{L}_{p l}=\frac{1}{3}\left[\mathbb{E}_{x, y}\left\|P_{a x}(y)-P_{a x}(G(x))\right\|_1+\right.\mathbb{E}_{x, y}\left\|P_{c o}(y)-P_{c o}(G(x))\right\|_1+ \left.\mathbb{E}_{x, y}\left\|p_{s a}(y)-P_{s a}(G(x))\right\|_1\right]\end{array}$

3.1.1.4. Total Objective

The total loss is composed from the three loss mentioned above, which is formulated as:

$\begin{aligned} D^* & =\arg \min _D \lambda_1 \mathcal{L}_{L S G A N}(D), \\ G^* & =\arg \min _G\left[\lambda_1 \mathcal{L}_{L S G A N}(G)+\lambda_2 \mathcal{L}_{r e}+\lambda_3 \mathcal{L}_{p l}\right],\end{aligned}$

in which λ1=0.1, λ2=λ3=10.

3.1.2. Network architectures

Like the original GAN, X2CT-GAN also contains a generator and a discriminator. For the discriminator part, X2CT-GAN just uses a similar architecture as 3DPatchDiscriminator[5] from the PatchGAN framework, in order to distinguish 3D volumes.

For the generator, one important thing in the architecture of X2CT-GAN is that it introduces 3 connections in its network architecture, which play important roles in the generator.

Figure4: Different types of connections in X2CT-GAN

Connection-A: This connection is used to link the last encoder layer and the first decoder layer. Its primary purpose is to increase the dimensionality of feature maps. It involves flattening and elongating the encoder layer’s output to a one-dimensional vector, which is then reshaped into a three-dimensional format. However, it's important to note that most of the 2D spatial information is lost during this conversion process.

Connection-B: Like Connection-A, Connection-B also aims to increase the dimensionality of feature maps. This connection is utilized for the rest of the skip connections in the network, ensuring that the channel number of the encoder matches that of the decoder. This connection helps in effectively transferring low-level features from the encoder to the decoder, maintaining the consistency of information across the 2D to 3D transition.

Connection-C: This connection is distinct from the first two as its primary purpose is to fuse information from two different views. In the context of the X2CT-GAN model, which deals with biplanar X-rays, this connection is vital for combining data from two orthogonal views to construct a more comprehensive and accurate 3D representation.

Beyond the connections, X2CT-GAN also employs synthesized X-rays to train the network in understanding the mapping from 2D to 3D. Additionally, it uses CycleGAN to adapt real X-ray images to the synthesized style, ensuring that the network, although trained on synthesized images, can still effectively reconstruct CT images from actual X-ray data.

3.1.3. Experiments and Results

The dataset used for the experiments was derived from the publicly available LIDC-IDRI dataset, which contains 1,018 chest CT scans. Due to the lack of a readily available dataset with paired X-rays and corresponding CT reconstructions, the researchers utilized DRR technology to synthesize corresponding X-rays. This approach was more cost-effective and feasible for training their networks. The synthesized X-rays were then used to train the X2CT-GAN model.

Figure5: Reconstructed CT scans from different methods

Quantitative evaluations showed that using biplanar inputs led to a significant improvement in reconstruction accuracy, with an approximate 4 dB gain for both X2CT-CNN and X2CT-GAN models compared to using a single X-ray input. It's noted that GAN models often compromise MSE-based metrics to achieve visually better results, a trend also observed in this study. However, by adjusting the relative weights of voxel-level MSE loss and semantic-level adversarial loss, the X2CT-GAN model, optimized for the total objective, demonstrated improved performance. This balance enabled the model to capture finer details in the reconstructed CT images, thereby enhancing the overall quality and accuracy of the results.

Figure6: Quantitative results

3.2. Image-to-Graph Convolutional Network for Deformable Shape Reconstruction from a Single Projection Image (MICCAI 2021)

In IGCN, the authors proposed a method that is used to generate the surface of 3D models of the organ from a single X-ray image. One main contribution of the work is that the model also takes the deformation of the organs while breathing into consideration.

3.2.1. Network Structure

As shown in Figure 7, the IGCN network mainly consists of two parts, a CNN part and a GCN part. The CNN is utilized for extracting perceptual features from the input image(a single 2D X-ray). These features are then associated with the corresponding vertices of the initial model. The extracted features, along with vertex coordinates, are concatenated and fed into the GCN. The GCN is responsible for learning the deformation of the mesh according to these image features. This setup allows for the initial model to be projected onto the input DRR image, facilitating the mesh deformation learning process. Both networks are optimized together using a specially designed loss function to ensure accurate 3D reconstruction. Notably, an extended VGG-16 model is employed in the CNN part of the framework, which has been adjusted for this specific application without the need for pretraining.

Figure7: Network Structure of IGCN

3.2.2. Loss functions

3.2.2.1. Position Loss

The first loss that comes into mind is the distance between the generating model and the ground truth. In this work, this loss $\begin{array}{l}L_{pos}\end{array}$ is defined as the mean distance between the estimated shape and the ground truth of every vertices, with the formula

$\begin{array}{l}\displaystyle \mathcal{L}_{\text {pos }}=\frac{1}{n} \sum_{i=1}^n\left\|v_i-\hat{v}_i\right\|_2^2\end{array}$

3.2.2.2. Mapping Loss

In order to help the model to learn the deformation mapping, the mapping loss $\begin{array}{l}L_{map}\end{array}$ is introduced, with the formula

$\begin{array}{l}\displaystyle \mathcal{L}_{\text {map }}=\frac{1}{n} \sum_{i=1}^n\left\|q_i-M\left(p_i\right)\right\|_2^2\end{array}$

in which $\begin{array}{l}M\end{array}$ is the mapping function, $\begin{array}{l}p_i\end{array}$ is the projected point of the initial shape, and $\begin{array}{l}q_i\end{array}$ is the projected point that corresponds to the target vertex $\begin{array}{l}v_i\end{array}$ .

3.2.2.3. Laplacian Loss

To maintain the original surface's curvature and smoothness during the reconstruction process, the Laplacian loss $\begin{array}{l}L_{laplacian}\end{array}$ is implemented, which utilizes the Laplace-Beltrami operator to constrain any deviations from the initial mesh, thus preventing the emergence of unwanted surface noise and ensuring high-quality mesh outputs. The formula is showed below:

$\begin{array}{l}\displaystyle \mathcal{L}_{\text {laplacian }}=\frac{1}{n} \sum_{i=0}^n\left\|L\left(v_i\right)-L\left(\hat{v}_i\right)\right\|_2^2\end{array}$

3.2.2.4. Total Loss

The total loss function is composed from the three losses mentioned above, with the formula:

$\begin{array}{l}\displaystyle \mathcal{L}_{\text {total }}=\mathcal{L}_{\text {pos }}+\lambda_{\text {map }} \mathcal{L}_{\text {map }}+\lambda_{\text {laplacian }} \mathcal{L}_{\text {laplacian }}\end{array}$

with the weight $\begin{array}{l}\lambda_{\text {map }}=10.0\end{array}$ and $\begin{array}{l}\lambda_{\text {laplacian }}=1.0\end{array}$ in this model.

3.2.3. Results

In the experiments, the authors test their model of IGCN on the reconstruction work of human liver. They also compare their results with results from other methods such as P2M(Pixel to Mesh).

Figure8: Visual results of IGCN

As shown in the figure below, the IGCN network has better performance in all the three metrics that are used in comparison(MD, RMSE and DSC), which means the reconstruction of IGCN has better accuracy and similarity to the ground truth. The results are also statistically significant with the p-value less than 0.05.

Figure9: Quantitative results of IGCN

3.3. X2Vision (MICCAI 2023)

In X2Vision, the authors address the task of reconstructing 3D CT model from biplanar images by introducing prior knowledge of anatomic structures through a generative model trained on 3D CT scans of the head and neck.

3.3.1. Network Structure

Figure10: Network structure of X2Vision

As shown above, the model of X2Vision mainly contains 2 part, the first part is Manifold learning, which can capture the deep structural priors of the head and neck from learning. The second part is for optimizing the latent vector, in which the model learns to find the best representation of the given biplanar X-rays in the latent space.

3.3.1.1. Manifold Learning

Objective: The goal is accurately identified as training StyleGAN to generate 3D CT models of the head and neck region from any latent vector $\begin{array}{l}z\end{array}$ . The key process involves transforming a random vector generated from a simple distribution (like Gaussian) into a latent vector $\begin{array}{l}w\end{array}$ through a mapping network. This latent vector $\begin{array}{l}w\end{array}$ represents the distribution of head and neck CT scans in the latent space.
Mapping Network and StyleGAN: The mapping network, typically structured as a Multilayer Perceptron (MLP), is concurrently trained with StyleGAN. Its role is to transform the input vector $\begin{array}{l}z\end{array}$ into $\begin{array}{l}w\end{array}$ , which is more suitable for generating complex 3D structures.
Perceptual Loss: The importance of perceptual loss during training is correctly emphasized. This loss measures the difference between the projection of the reconstructed model and the projection of the ground truth model, aiding in producing a reconstruction volume that aligns more closely with actual 3D CT images.
Freezing StyleGAN Parameters: After manifold learning, the StyleGAN model, capable of generating high-quality 3D CT models, has its parameters frozen. The distribution of the latent vector $\begin{array}{l}w\end{array}$ is also recorded at this stage.

3.3.1.2. Reconstruction from Biplanar Projections

This section aims to optimize the latent vector $\begin{array}{l}w\end{array}$ to better represent the two input projections. The process involves minimizing a loss that includes a standard loss component and a regularization term which has the formular:

$\begin{array}{l}\displaystyle R(\mathbf{w}, \mathbf{n})=\lambda_w \mathcal{L}_w(\mathbf{w})+\lambda_c \mathcal{L}_c(\mathbf{w})+\lambda_n \mathcal{L}_n(\mathbf{n})\end{array}$

The first term $\begin{array}{l}\mathcal{L}_w(\mathbf{w})=-\sum_k \log \mathcal{N}\left(\mathbf{w}_k \mid \mu, \sigma\right)\end{array}$ forces the latent vector $\begin{array}{l}w\end{array}$

3.3.2. $Results$

For the training part, the model was trained on a large dataset comprising 3500 CT scans of head-and-neck cancer patients, including 2297 cases from The Cancer Imaging Archive (TCIA) and 1203 from private internal data. The data was divided into 3000 cases for training, 250 for validation, and 250 for testing. For the quantitative result showed in the figure below we can see that compared to other methods such as X2CT-GAN and NeRP, the method shows significant improvement in both metrics PSNR and SSIM.

$Figure11: Quantitative results of X2Vision$

$For the visual result which listed in the figure below, it is evident that the X2Vision method demonstrates superior performance in reconstruction. For instance, when compared to other methods, distinct spinal structures are clearly visible in the X2Vision reconstructions. In contrast, the results from other methods predominantly show noise in the corresponding areas, underscoring the significant improvements achieved by the X2Vision approach.$

Figure12: Visual results of X2Vision and other methods

4. Discussions

In this section, we will compare the three papers mentioned in the previous section and also point out some potential disadvantages of the three methods.

4.1. Comparison

	X2CT-GAN(CVPR 2019)	IGCN(MICCAI 2021)	X2Vision(MICCAI 2023)
Network	GAN(Conditional GAN)	CNN+GCN	GAN(StyleGAN 2)
Input	biplanar X-ray images	single X-ray image	X-ray images(possible for more images)
Output	3D CT Volume	3D Surface Grid	3D CT Volume
Deformation	No	Yes	No
Loss	Adversarial Loss(LS), Reconstruction Loss(MSE), Projection Loss(L1)	Position Loss(MSE), Mapping Loss(MSE), Laplacian Loss(MSE)	Adversarial Loss(LS), Perceptual Loss
Metrics	PSNR, SSIM	MD, RMSE, DSC	PSNR, SSIM

From the comparison we can conclude that all the three methods use NN-based method to deal with the task of reconstruction from single or biplanar images. However, the network structures they use are very different. Generally, the X2CT-GAN and X2Vision choose the GAN-based architecture to generating the 3D CT Volume, which is currently the popular choice for such task. The advantage of GAN is clear that it can force the generator to generate the interior structure of the CT, such as the vessel in lungs from X2CT-GAN and the spinal structures generated from X2Vision. However, the quality of the reconstructed interior structures is highly related to the ability of the model, for example in X2CT-GAN, although the generator has the tendency of generating vessels in lungs, but the quality of the reconstruction is not very satisfying, from the visual result we tend to conclude that the generator is just able to generate some main vessels, but for some small vessels the generator seems to reconstruct some noise but not signal.

4.2. Discussion about Metrics

	GT	X2CT-CNN	X2CT-GAN
Reconstructed CT model
SSIM(higher is better)	N/A	0.721(±0.001)	0.656(±0.008)
PSNR(higher is better)	N/A	27.29(±0.04)	26.19(±0.13)

As mentioned above(also illustrated in the figure above), for X2CT-GAN and X2Vision, both methods use PSNR and SSIM in the part of results comparison. Both metrics are very well-knowned as metrics for evaluating the quality of the reconstruction and other CV tasks. But we may suspect that these metrics may have some difference from our human perception. For example, in the results of X2CT-GAN, the reslts of X2CT-CNN shows better results in both metrics than the results of X2CT-GAN, but by comparing the reconstructed scans from the two methods we may get a different result that the X2CT-GAN has a better performance in reconstruction while it shows more interior structures than X2CT-CNN, which is for our human more similar to the ground truth. In the paper the authors also mentioned that "GAN models often sacrifice MSE-based metrics to achieve visually better results". By reading some more papers related to this phenomenon, we find that Richard Zhang et al.[7] also proposed in 2018 that the traditional metrics such as PSNR and SSIM sometimes give a different result compared to human perception in the task "Which patch is closer to the target patch". The reason for that difference is that the traditional metrics such as PSNR and SSIM focus more on the difference between pixel but ignore some deep features of the pictures which is focused by the human perception. Richard Zhang et al.[7] also proposed using perceptual metrics in comparison to make the result closer to human perception. However, we do not find any of the papers above using perceptual metrics in comparison.

Figure13: Some examples that traditional metrics differ from human perception

4.3. Discussion about Experiments and Reality

In the three papers discussed, X-ray images serve as input; however, these images are not obtained directly from scans but are synthesized from 3D CT volumes contained within the datasets. This raises the question of whether these models would also perform effectively on real-world X-ray images acquired directly from medical facilities. The authors of X2CT-GAN have indicated that they assessed their model using actual X-ray images and achieved satisfactory results. However, the other two papers lack such experimental validation. It is reasonable to question whether their performance would hold up with real-world X-ray images. It's also worth noting that for biplanar images, the quality of real-world X-rays may significantly differ from those generated from a 3D CT model. While a 3D CT model can produce two perfectly orthogonal X-rays, achieving such orthogonality with real-world X-rays is unlikely due to potential complications during the imaging process. Although issues of alignment can be mitigated in traditional reconstructions that utilize hundreds of X-rays through abundant information, for more complex tasks like reconstructions from single or biplanar images, it is crucial for the model to be sufficiently robust to handle the noise inherent in real-world X-rays. In my opinion, the inclusion of experiments with real X-ray images to demonstrate model stability would greatly enhance the paper's credibility and its applicability in future medical science endeavors.

5. References

[1]: Ying, X., Guo, H., Ma, K., Wu, J., Weng, Z., & Zheng, Y. (2019). X2CT-GAN: reconstructing CT from biplanar X-rays with generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10619-10628).
[2]: Cafaro, A., Spinat, Q., Leroy, A., Maury, P., Munoz, A., Beldjoudi, G., ... & Paragios, N. (2023, October). X2Vision: 3D CT Reconstruction from Biplanar X-Rays with Deep Structure Prior. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 699-709). Cham: Springer Nature Switzerland.
[3]: Henzler, P., Rasche, V., Ropinski, T., & Ritschel, T. (2018, May). Single‐image tomography: 3D volumes from 2D cranial x‐rays. In Computer Graphics Forum (Vol. 37, No. 2, pp. 377-388).
[4]: Nakao, M., Tong, F., Nakamura, M., & Matsuda, T. (2021). Image-to-graph convolutional network for deformable shape reconstruction from a single projection image. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24 (pp. 259-268). Springer International Publishing.
[5]: Shen L, Pauly J, Xing L. NeRP: implicit neural representation learning with prior embedding for sparsely sampled image reconstruction[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022.

[6]: Tong, F., Nakao, M., Wu, S., Nakamura, M., & Matsuda, T. (2020, July). X-ray2Shape: reconstruction of 3D liver shape from a single 2D projection image. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) (pp. 1608-1611). IEEE.
[7]: Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586-595).
[8]: Miura, R., Nakao, M., Nakamura, M., & Matsuda, T. (2022). 2D/3D Deep Image Registration by Learning 3D Displacement Fields for Abdominal Organs. arXiv preprint arXiv:2212.05445.
[9]: Wu, S., Nakao, M., Tokuno, J., Chen-Yoshikawa, T., & Matsuda, T. (2019, May). Reconstructing 3D lung shape from a single 2D image during the deaeration deformation process using model-based data augmentation. In 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) (pp. 1-4). IEEE.

Seitenhierarchie

3D reconstruction from a single or biplanar images