Neural radiance fields (NeRF) is a neural rendering technique that joins geometric reasoning and machine learning to render novel views of a scene from input images. NeRF is limited to rigid non-deformable scenes. There is a need for methods that can reconstruct dynamic scenes. Several methods have been devised. This post mainly discusses two methods, Nerfies and D-NeRF, that augment NeRF for that purpose. It also gives an example, EndoNeRF, of using deformable neural radiance fields in healthcare.
Introduction
NeRF can produce complex 3D scenes from input 2D images. It is based on a neural network (NN) that encodes colour and density as a function of location and viewing angle to produce novel views through [1]. The method struggles with non-rigidly deforming scenes [2], as in Fig.1 [3]. To reconstruct deformable scenes, methods that extend NeRF have been developed.
Figure 1. Dynamic Scenes [3].
Specialised labs can capture quality 3D scans [4], but can they can be replaced with phone cameras, as shown in Fig.2? This can be challenging because of our inability to stay still (non-rigidity) and demanding materials like the hair [5].
Figure 2. Novel views from phone camera captures [5].
One technique, “Nerfies”, tackles that by waving a phone camera, taking photos or a video to provide a photorealistic model. To handle dynamic scenes, the technique uses a deformation field, rigidity priors and a coarse-to-fine deformation regularisation. The model is evaluated using two phone-cameras [5].
Another method, D-NeRF, requires one camera and uses a continuous 6D function with the location, viewing direction and time. It consists of a canonical and deformation network. The model is evaluated on scenes with various deformations [2].
Focus and Problem Statement
Using camera images representing a dynamic scene, the goal is to develop a deep learning model that extends NeRF to implicitly encode a dynamic scene before synthesising novel views [2], thus mapping location and viewing direction to colour and volume density M:(x,d) -> (c,σ) [5]..
Figure 3. Rendering novel views of dynamic scenes [2].
Comparison of the Main Methods
Nerfies
The method involves NeRF, which acts as a canonical template and deformation field per observation represented as a multi-layer perceptron (MLP). To avoid an under-constrained optimisation when optimising both NeRF and the deformation field, elastic, background and a coarse-to-fine deformation regularisation are employed, as in Fig. 4 [5].
Figure 4. Nerfies Method [5]
NeRF
NeRF acts as a template volume representing the scene's relative appearance and structure.
The mapping is done using a positional encoding of sin and cos functions of increasing frequencies to ensure the MLP can model high frequency signals in low frequency domains.
γ(x) =(x,· · · ,sin (2^k πx),cos (2^k πx),· · · ) (1) |
k ∈ [0,...,m - 1], m, the hyperparameter adjusting the number of frequency bands.
An appearance latent code per frame is used to regulate the output colour considering the appearance variations between each frame [5].
Neural Deformation Fields
Neural deformation fields extend NeRF to reconstruct dynamic scenes. An observation-to-canonical deformation per each frame is used. Points and rays that are cast in the observation frame are mapped using the deforming field to the template. The deformation fields for all time steps are represented as T∶ (x,w_i) → x^', mapping observation-space coordinates x to canonical-space coordinates x^', where w_i is the per-frame latent deformation code encoding its relative scene’s state.
By considering the mapping T and canonical radiance field F, the observation-space radiance field is evaluated as G(x,d,ψ_i ,w_i) = F (T(x,w_i ),d,ψ_i )[5].
Elastic regularisation
Deformation fields add ambiguities, like an object moving backwards appearing to be shrinking, which leads to distorted results due to optimisation problems. One solution is using priors. Here, elastic regularisation is used, for example as shown in Fig. 5 [5].
Figure 5. Elastic regularisation of under constrained scene [5]
Elastic energies are used to measure the local deformations' deviation from a rigid motion. For a latent code w_i , the deformation field is a nonlinear mapping T∶ x → x^'. To control the deformation’s local behaviour, J_T (Jacobian) is used, since it describes the best linear approximation of the translation at a given point. To penalise the Jacobian’s deviation from the rigid transformation, the deviation of the log singular values from zero is penalised. First, by applying SVD: J_T(x)=UΣV^T then the Jacobian's deviation from the identity: L_e_l_a_s_t_i_c (x)= ‖logΣ- logI‖_F^2 =‖logΣ‖_F^2.
The captured object could involve non-rigid movements like a facial expression, which requires “robust loss” to be used to remap the elastic energy giving the equation:
L_e_l_a_s_t_i_c (x)= ρ(‖logΣ‖_F,c) (2) |
ρ(x,c)=(2〖(x/c)〗^2)/(〖(x/c)〗^2+4) (3) |
where, ρ is the German-McClure robust error function and c=0.03.
The elastic penalty at every sample along a ray is weighed according to its contribution to the rendered view [5].
Background Regularisation
Since the deformation field behaves freely in empty space, background regularisation is added to stop the background from moving and align the observation coordinate frame to the canonical, thus penalising deformations at selected-static points using: L_bg=1/K\[ \sum_{k=1}^{K} ‖T(x_k )-x_k ‖_2 \] [5].
Coarse-to-Fine Deformation Regularisation
An issue of registration is very smooth results or incorrect registration due to the trade-of between modelling small and large motions. Fig. 6 shows that a small m (equation 1) gives low frequency bias and a large m gives high frequency bias.
Coarse-to-fine regularisation can regulate the deformation field’s ability to model high frequencies by first working in low resolution with small motions then upscaling.
NeRF’s MLP's Neural Tangent Kernel (NTK) can represent positional encoding, where m control’s the kernel’s bandwidth [6]. Based on that, a parameter α ∈[0,m] that windows the positional encoding's frequency band can be used to smoothly strengthen the NTK’s bandwidth. Each frequency band's weight and the positional encoding are defined relatively in equations 4 and 5:
w_j (α) = ((1 - cos(π clamp(α - j,0,1)) )/2 (4) |
γ(x) =(x,· · · ,w_j (α) sin (2^k πx),w_j (α) cos (2^k πx),· · · ) (5) |
Figure 6. Results of varying m [5].
D-NeRF
D-NeRF uses one camera to train, without 3D ground truth supervision, on monocular data with time as an additional input. The learning process is split into encoding the scene into a canonical space and mapping the canonical representation into the deformed scene at a specific time. This allows the method to control object movement by controlling the camera view and time variable when rendering new images [2]. Unlike structure-from-motion (SfM) methods [7], used in Nerfies, D-NeRF doesn't require a reference known template to acquire the surface geometry [2].
D-NeRF Method
Unlike NeRF, D-NeRF can learn a scene's volumetric density representation with a single view per time instance. It is formed of the Canonical and Deformation Networks, which parametrise two mappings, Ψ_x and Ψ_t , as shown in Fig. 7 [2].
Architecture
Similar to “Nerfies”, the Canonical Network in D-NeRF is an MLP, which encodes the scene in the canonical space. It takes a 3D point and a viewing direction and gives colour and density, Ψ_x:(x,d)→(c,σ). The canonical configuration acts as an anchor interconnecting all images to form a scene that stores the information of all points in the images. This allows needed information to be retrieved when synthesising scenes [2].
The Deformation Network is an MLP that guesses a deformation field that defines the transformation between the scene at a time t and that in its canonical form. Thus, taking a 3D point x at t, the network returns the displacement ∆x that transforms x to its canonical space’s position [2]:
(x,t)=(∆x \ if \ t≠0\ and \ 0 \ if \ t=0) (6) |
Before being fed to the NN, the point x, viewing direction d and time t are first encoded to a higher dimensional space by a positional encoder γ(p) =< (sin(2^l πp),cos(2^l πp)) >_0^L [1], where L=10 for x and 4 for d and t [2].
Figure 7. D-NeRF architecture [2].
Volume Rendering
Below are the equations used to take care of non-rigid deformations in the 6D neural radiance field.
Taking point x(h) = o+hd along a ray from the camera’s centre o to pixel p, the pixel's anticipated colour C^' at time t is approximated below with the aid of numerical quadrature [2].
C^' (p,t) =\[ \sum_{n=1}^{N}〖 T^' (h_n,t)α(h_n,t,δ_n)c(p(h_n,t),d)〗 \] (7) |
α(h,t,δ) = 1 - exp(-σ(p(h,t))δ) (8) |
T^' (h_n,t) = exp(-\[ \sum_{m=1}^{n-1} 〖σ(p(h_m,t)) δ_m 〗 \]) (9) |
p(h,t) represents the point on ray x(h) transformed by Ψ_t to canonical space. T^' (h_n,t) deals with the probability of the ray not hitting other particles. The distance between two quadrature points isδn = h_n+1-h_n [2].
Learning the Model
Both networks, Ψ_t and Ψ_x, are trained simultaneously by minimising the mean squared error with respect to the scene’s RGB images \[ \bigl\{ I_t \bigr\}_{t=1}^{T} \] and their related camera pose matrices \[ \bigl\{ T_t \bigr\}_{t=1}^{T} \].The training loss used is L=1/N_s\[ \sum_{i=1}^{N_s} ‖Cˆ(p,t) - C_i (p,t) ‖_2^2 \], where Cˆ is the ground truth-colour [2].
Related Work
NSFF uses a monocular video and models dynamic scenes as a time variant continuous function of geometry, 3D scene motion and appearance [3].
Occupancy flow gives every point a 3D vector, thus learning a continuous vector field, but requires 3D ground-truth supervision [8].
Neural volumes is based on encoder-decoder voxel representation and a voxel warp field, but requires multi-view image-capture [9].
NeRFlow uses a deformation MLP similar to Nerfies and incorporates scene flow across time [10].
Experiments and Results
Nerfies
The experiment included taking videos or selfies of the user standing in front of a static background to guarantee the geometric registration of the cameras is consistent. COLMAP [11], a SfM network was used to compute the camera intrinsics and image pose, whereas featured filtering on the subject was done using a foreground segmentation network [5].
Implementation [5]
- NeRF template [1].
- Deformation Network.
- 8 dimensions for latent deformation and appearance codes.
- 6 frequency bands.
- αannealing over 80,000 iterations.
- MSE photometric loss [1].
- Weight loss L_t_o_t_a_l = L_r_g_b+ λL_e_l_a_s_t_i_c_-_r+ µL_b_g, \ λ = µ = 10^-^3.
Evaluation
A rig with two phones was used to evaluate generated novel views of quasi-static and dynamic scenes, whereby the former involves still subjects and the latter moving ones [5].
To accommodate for cameras' photometric differences, a per-camera appearance code, {ψ_L, ψ_R} ∈ R^2, was used instead of per-frame, ψ_i. The reconstruction’s quality was checked using the density field's depth renderers [5].
Results
Through table 1, the following can be deduced [5]:
- Learned perceptual image patch similarity (LPIPS) metric does better than Peak Signal-to-Noise Ratio (PSNR) in dynamic scene reconstructions, as also shown in Fig. 8.
- “Nerfies” performs the best in terms of LPIPS.
- Stronger elastic regularisation (λ) enhances dynamic scene results.
- Elastic loss mostly affects quasi-static scenes, because its impact is probably minimal to other losses in dynamic scenes.
- Elastic regularisation improves results when scene is under-constrained.
- Coarse-to-fine regularisation greatly enhances dynamic-scene results.
- Background regularisation supports PSNR through decreasing shifts in static areas.
Table 1. Nerfies Quantitative evaluation [5].
Figure 8. Qualitative evaluation of Nerfies with PSNR/LPIPS displayed (the better performing in red) [5].
D-NeRF
The model was tested against NeRF and T-NeRF (D-NeRF without canonical mapping) [2].
Implementation
- Both networks as 8-layer MLPs.
- Canonical configuration always set as the scene state at time 0.
- training over 800,000 iterations [2].
Results
Fig. 9 shows [2]:
- The canonical network's ability to represent the scene.
- The network's ability to approximate deformation fields capable of mapping the canonical scene to a shape at every input image.
- The colour consistency across different time instances signals that the displacement field is properly estimated.
Figure 9. D-NeRF results at different times t [2].
As displayed in Fig. 10, the method can encode changes in the appearance of a point over time, thus properly modelling shadow effects [2].
Figure 10. Shadow effect and (time and view) conditioning [2].
Table 2 and Fig. 11 display the quantitative and qualitative evaluations. The results show that NeRF is unable to model dynamic scenes, whereas T-NeRF struggles with retrieving high frequency details in dynamic scenes and D-NeRF synthesises high details in novel views [2].
Table 2. D-NeRF quantitative results [2].
Figure 11. D-Nerf qualitative results [2]
Healthcare Use-Case (EndoNeRF) - An Overview
Introduction
For soft tissue reconstruction from endoscopic stereo videos, SLAM-based approaches struggle with complex scenes. EndoNeRF adopts deformable neural radiance fields to reconstruct dynamic surgical scenes from binocular captures through one viewpoint. The method uses MLPs to optimise deformations and shapes through a learning-based approach. To overcome poor 3D clues and tool occlusion, methods like tool mask-guided ray casting, stereo depth cueing ray marching and stereo depth supervised optimisation are used. An experiment shows that EndoNeRF can outperform other methods [14].
Method
The aim is to reconstruct 3D structures in surgeries without occlusion using a single viewpoint. Video frames are taken as input, denoted by [14]:
\{ (I_i^l, I_i^r)\}_i_=_1^T (10), |
where, T is the total number of frames and (I_i^l, I_i^r) are the left and right images at frame i .
To specify the surgical instruments' region, Binary tool masks, \{ M_i\}_i_=_1^T , are extracted for left views. Using binocular captures, coarse depth maps \{ D_i\}_i_=_1^T for the left views are estimated to employ stereo clues. The model is based on D-NeRF. It represents deformable scenes as a canonical field and time-based deformable field [14].
The pipeline's training, shown in Fig. 12, involves [14]:
- Randomly picking a training frame.
- Using tool guided ray casting to direct the rays.
- Using depth cueing ray marching to sample ray points.
- Using the network to get the colour and density at each sampled point.
- Rendering the results via volume rendering.
- Omptimising the rendering and depth losses to reconstruct the colours, deformations and shapes.
Figure 12. Illustration of EndoNeRF [14].
Experiments and Results
Evaluation
The method is evaluated on surgery stereo videos from 6 cases of a DaVinci robotic prostatectomy data. PSNR, LPIPS and SSIM are metrics used for evaluation [14] .
Implementation
The endoscope is calibrated. STTR-light [15] generates coarse stereo depth maps. Tool masks are acquired through manual labeling. Optimised radiance fields are rendered to RGBD maps and back-projected RGBD to point clouds and smooth rendered depth maps through bilateral filtering to retrieve explicit geometry [14].
Results
When testing EndoNeRF's peformance in cutting soft tissues, reconstruction results show its capability to track the procedure's details. This results from the strong displacement fields representation by the NN. Bypassing tool occlusion issues appears possible due to the neural implicit field's interpolation property and mask-guided ray casting. Whereas, the compared method does not seem capable of tracking detailed changes nor patch all the occluded areas, as shown in Fig. 13 [14].
Figure 13. EndoNeRF Qualitative analysis [14].
Figure 14. Ablation study on depth-related modules [14].
As shown in table 3, EndoNeRF outperforms E-DSSR [16]. When removing the neural displacement field (w/o D), despite the big drop in performance, it still does better than E-DSSR [14].
An ablation study on depth-related modules, Fig 14., shows that[14]:
- Without depth-supervision loss, the pipeline fails to correctly learn the geometry.
- Disabling depth refinement results in a corrupt stereo depth estimation, thus artifacts in the reconstruction.
- artifacts can be reduced using depth-cueing ray marching.
Table 3. EndoNeRF Quantitative analysis [14].
Limitations
Nerfies
- Struggling with topological changes [5].
- Failing to reconstruct rapid motions [5].
- Static regions might shift since deformations are unconstrained [5].
- Possible hollow-face-illusion effect [5].
- Possible orientation flips [5].
D-NeRF
- Deformations are limited to translations [2].
- Difficulty reconstructing fast motion [2].
NSFF
- Difficulty extrapolating content unseen in training views [2].
- Difficulty retaining high-frequency details [3].
EndoNeRF
- Possible artifacts [14]
Conclusions
Neural rendering techniques like NeRF can reconstruct static scenes. Moreover, methods that augment NeRF to reconstruct dynamic scenes exist. D-NeRF and Nerfies are similar, as both use a canonical and deformation configuration, but Nerfies applies per frame deformations, whilst D-NeRF uses a continuous function that with an additional input, time. Both methods offer good results with some limitations like dealing with rapid motion. EndoNeRF, a method based on D-NeRF, for use in surgical procedures, shows promising results.
Personal Review
The Neural Radiance Field (NeRF) method is an exciting approach to produce geometrical reconstructions of scenes. It performs that by devising a new way of using memory efficient machine learning techniques for that. However, it only works on rigid scenes. Nerfies, D-NeRF and NSFF along with other methods were devised to augment NeRF and allow the reconstruction of dynamic scenes. Though all three methods offer an improvement to the existing NeRF model, Nerfies and D-NeRF seem to outperform NSFF, as clearly shown in the quantitative analysis of the Nerfies paper. Nerfies and D-NeRF offer a similar solution architecture based on an generalised canonical template and a deformation configuration. However, unlike Nerfies, D-NeRF, incorporates both time and view conditioning. Both models display an ability to accommodate moving objects and to reconstruct highly complex material, but still face some limitations, such as dealing with rapid motions. Whilst D-NeRF claims to be able to mostly deal with rapid motion, Nerfies appears to be less capable of doing that. Nerfies faces a problem, as it uses SfM and requires a rigid background for proper reconstruction. Meanwhile, D-NeRF does not, but seems to be limited to deformations as translations. The growth of deformable neural radiance fields’ models increases 3D modelling technology’s field of application and accessibility. This technology offers big opportunities in various fields like 3D media or healthcare, such as EndoNeRF, which is a good example of how neural rendering techniques could outperform existing methods in reconstructing deformable scenes. However, it is still a novice technology, facing many limitations, and in need of further investigation. Can those models deal with full body motion? Can scenes under lightning variations get properly reconstructed. What happens when the background is moving? How much data is really needed for training? Can the extensive training time (1 week for Nerfies) and large number of iterations (80K for Nerfies & D-NeRF) be reduced? Can manual labeling, like in the process of EndoNeRF, be replaced by more efficient methods?
References
[1] Mildenhall B., Srinivasan P., Tancik M., Barron J., Ramamoorthi R. and Ng R. NeRF: Representing scenes as neural radiance fields for view synthesis. ECCV, 2020.
[2] Pumarola A., Corona E., Pons-Moll G. and Moreno-Noguer F. D-NeRF: Neural Radiance Fields for Dynamic Scenes. Max Planck Institute of Informatics, 2020.
[3] Li Z., Niklaus S., Snavely N. and Wang O. Neural scene flow fields for space-time view synthesis of dynamic scenes. arXiv preprint arXiv:2011.13084, 2020.
[4] Dou M., Khamis S., Degtyarev Y., Davidson P., Fanello S., Kowdle A., Escolano S., Rhemann C., Kim D., Taylor J., et al. Fusion4D: Real-time performance capture of challenging scenes. ACM ToG, 2016.
[5] Park K., Sinha U., Barron J., Bouaziz S., Goldman D., Steven S. and Martin-Brualla R. Nerfies: Deformable Neural Radiance Fields. University of Washington, 2021.
[6] Tancik M., Srinivasan P., Mildenhall B., Fridovich-Keil S., Raghavan N., Singhal U., Ramamoorthi R., Barron J and Ng R. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
[7] Chhatkuli A., Pizarro D. and Bartoli A. Stable template-based isometric 3d reconstruction in all imaging conditions by linear least-squares. In CVPR, 2014.
[8] Niemeyer M., Mescheder L., Oechsle M. and Geiger A. Occupancy flow: 4d reconstruction by learning particle dynamics. In ICCV, 2019.
[9] Lombardi S., Simon T., Saragih J., Schwartz G., Lehrmann A. and Sheikh Y. Neural volumes: learning dynamic renderable volumes from images. TOG, 38(4), 2019.
[10] Du Y., Zhang Y., Yu H., Tenenbaum J., Wu J. Neural Radiance Flow for 4D View Synthesis and Video Processing. 2021.
[11] Schönberger J. and Frahm J. Structure-from-motion revisited. CVPR, 2016.
[12] Tucker R. and Snavely N. Single-view view synthesis with multiplane images. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2020.
[13] Shih M., Su S., Kopf J. and Huang J. 3d photography using context-aware layered depth inpainting. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 8028–8038, 2020.
[14] Wang Y., Long Y., Fan, S. and Dou Q. Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery. The Chinese University of Hong Kong, 2022.
[15] Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F.X., Taylor, R.H., Unberath, M.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: ICCV. pp. 6197–6206 (2021)
[16] Long, Y., Li, Z., Yee, C.H., Ng, C.F., Taylor, R.H., Unberath, M., Dou, Q.: E-dssr: efficient dynamic surgical scene reconstruction with transformer-based stereoscopic depth perception. In: MICCAI. pp. 415–425. Springer (2021)