This blog post is for the paper "pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis" [1] written by "Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, Gordon Wetzstein" from Stanford University.

Problem Definition and Introduction

GANs have been used to generate high-resolution, photorealistic images [9, 10, 11]. GANs are used to create new data instances that resembles the training data, generally used to create 2D-images, as an example, a picture of a non-existing human face can be synthesized by using human faces training data [9].

Figure: Images generated using the CELEBA-HQ dataset [9].

Due to lack of photorealistic 3D-training data, GANs are generally limited to generate 2D-images [1]. By using this knowledge, another area of utilization by GANs can be called, which is 3D-aware image synthesis. 3D-GANs try to support synthesizing multiple-view of a single object by learning neural scene representations unsupervised from 2D images, which can be used to render view-consistent image from different camera poses [1].

Figure: Multiple-view examples synthesized by using pi-GAN [1].

Current solutions of 3D-aware image synthesis are good at decoupling identity from structure, allowing of a single instance from multiple poses. On the other hand they lack of multi-view consistency or details [7, 16, 17, 21]. pi-GAN [1] both improves the image quality and view-consistency, compared to the previous approaches. SIREN [22] is more capable than ReLU implicit representations at representing finer details.

Animation: Comparison of three approaches [1]

This paper [1] introduces the improvements with combination of these techniques:

SIREN-based implicit GANs as an alternative to convolution GAN architectures
A mapping network with FiLM [18] (feature-wise linear modulation) conditioning
Progressive growing from ProgressiveGAN [9]
Neural radiance fields from NeRF [15]

Animation: Example of a radiance field [15]

Authors [1] tell their paper achieves state-of-the art result in CelebA [12], Cats [26] and CARLA [4, 21] datasets.

Animation: Results from pi-GAN [1]

Related Works

This works consists of two parts basically: First one is, neural representations and rendering. Second one is Generative 3D-aware image synthesis.

Neural representations and rendering

Neural implicit 3D-representations can be used to represent parts [30, 31], objects [32, 33, 34, 35, 36, 37, 38] or scenes [39, 40, 41, 42, 22] and utilizing neural rendering [43], neural representations can be trained using multi-view 2D images [51, 40, 45, 46, 15, 36, 47, 48, 49].

In this part, the closest works to this paper are SIREN [22] and NeRF [15], these works were used to overfit individual objects or scenes. Meanwhile pi-GAN [1] tries to combine the concept neural representations with 3D GANs.

Animation: ReLU activation or SIREN (periodic sine) activation for shape modelling [22]

Generative 3D-aware image synthesis

GANs [6] are generally used in image generation [19, 9,10,11], image-to-image translation [27], image editing [29], and learning from partial and noisy observations [3]. Most of them are limited to 2D images. Still, there exists some implementations even try to utilize 3D shape of the objects to generate 2D images from the shape [28, 5, 25, 16, 17, 13, 23, 24].

Most similar work to this paper is mentioned as GRAF [21], which is a generative model for implicit radiance fields. Difference between pi-GAN [1] and GRAF [21] can be listed as follows:

pi-GAN uses SIREN [22] (periodic sine activation function) while GRAF [21] uses ReLU MLP.
GRAF's MLP generator is conditioned on both a shape noise code and appearance noise code. While pi-GAN uses StyleGAN-inspired [25] mapping network, which conditions MLP on a single input noise vector through FiLM conditioning.
pi-GAN uses progressive growing [24] strategy during training.
GRAF [21] uses a patch-based discriminator, which is not used in pi-GAN [1], as SIREN [22] is prone to local overfitting to the last batch if sufficient coverage of the space is not maintained.

Methodology

pi-GAN tries to learn radiance field representations from unlabeled 2D images, with the goal of synthesizing high-quality view consistent images [1].

SIREN-Based Implicit Radiance Field

3D objects are represented in the radiance field, which is parameterized as MLP that takes 3D coordinate $\begin{array}{l}\mathbf{x} = (x,y,z)\end{array}$ and viewing direction $\begin{array}{l}\mathbf{d}\end{array}$ as input. Neural radiance field outputs spatially varying density $\begin{array}{l}\sigma(\mathbf{x})\end{array}$ and view-dependent color $\begin{array}{l}\mathbf{c}(\mathbf{x},\mathbf{d})\end{array}$ . Also, StyleGAN-inspired [25] mapping network is used to condition SIREN on a noise vector $\begin{array}{l}\mathbf{z}\end{array}$ through FiLM conditioning. The model can be seen in the figure [1].

The pi-GAN generator architecture [1]

$\begin{gather} \Phi(\mathbf{x})=\phi_{n-1} \circ \phi_{n-2} \circ ... \circ\phi_0(\mathbf{x}) \\ \phi_i(\mathbf{x}_i) = \sin(\gamma_i\cdot(\mathbf{W}_i\mathbf{x}_i+\mathbf{b}_i)+\beta_i) \end{gather}$

The model is represented by the function below. Simple affine transform happens, then followed by the mapping network, which is a simple ReLU MLP, takes input noise vector from FiLM [18] $\begin{array}{l}\mathbf{z}\end{array}$ and outputs frequencies $\begin{array}{l}\gamma_i\end{array}$ and phase shifts $\begin{array}{l}\beta_i\end{array}$ is applied to condition each layer. Finally sine nonlinearity is applied as activation function. On the output layer, density and color are defined as:

$\begin{gather} \sigma(\mathbf{x}) = \mathbf{W}_\sigma \Phi(\mathbf{x}) + \mathbf{b}_\sigma \\ \mathbf{c}(\mathbf{x}, \mathbf{d}) = \mathbf{W}_c\phi_c([\Phi(\mathbf{x}),\mathbf{d}]^T)+\mathbf{b}_c \end{gather}$

Neural Rendering

Neural volume rendering is used to render neural radiance field from arbitrary camera poses $\begin{array}{l}\xi\end{array}$ from a pinhole camera from the origin $\begin{array}{l}\mathbf{o}\end{array}$ , as shown in the figure [1].

Figure: Visualization of neural volume rendering procedure [1]

At every sample, generator tries to predict density $\begin{array}{l}\sigma\end{array}$ and color $\begin{array}{l}\mathbf{c}\end{array}$ . Then the pixel color $\begin{array}{l}\mathbf{C}\end{array}$ for a camera ray $\begin{array}{l}\mathbf{r}(t) = \mathbf{o}+t\mathbf{d}\end{array}$ is calculated using with the volume render equation [14] from near bound to far bound. pi-GAN [1] implements a discretized form, which was introduced by NeRF [15], also used by GRAF [21]:

$\begin{gather} \mathbf{C}(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t),\mathbf{d})dt \\ \text{where } T(t)=\exp(-\int_{t_n}^t \sigma(\mathbf{r}(s))ds). \end{gather}$

Discriminator

Similar to ProgressiveGAN [9], pi-GAN [1] uses a convolutional discriminator with parameters growing progressively. Training starts with a high amount of low-resolution graphics in a batch to produce coarse shapes, then while training progresses, newer layers are introduced to the discriminator to process high-resolution graphics to discriminate finer details.

Training Details

For the details of training, $\begin{array}{l}\xi\end{array}$ camera poses with the distribution $\begin{array}{l}p_\xi\end{array}$ are sampled. Pose distributions for CelebA and Cats is Gaussian, for CARLA is uniform. Camera positions are constrained to the surface of unit sphere and directed camera to point towards the origin, then pitch and yaw rotations along the sphere with a distribution $\begin{array}{l}p_I\end{array}$ from the real images $\begin{array}{l}I\end{array}$ were sampled from training dataset. Non-saturating GAN loss, with R1 regularization is used, which generator tries to minimize and discriminator tries to maximize:

$\begin{array}{l}\mathcal{L}(\theta, \phi) = \mathbf{E}_{\mathbf{z}\sim pz, \xi \sim p_\xi} [f(D_{\theta_D}(G_{\theta_G}(\mathbf{z},\xi)))]+\mathbf{E}_{I\sim p_\mathcal{D}}[f(-D_{\theta_D}(I))+\lambda |\nabla D_{\theta_D}(I)|^2],\\ \text{ where } f(u)= -\log(1+\exp(-u))\end{array}$

For the training, Adam optimizer is used with $\begin{array}{l}\beta_1=0, \beta_2=0.9\end{array}$ and learning rates were $\begin{array}{l}5\times10^{-5}\end{array}$ with $\begin{array}{l}1\times10^{-5}\end{array}$ decay for the generator and $\begin{array}{l}4\times10^{-4}\end{array}$ with $\begin{array}{l}1\times10^{-4}\end{array}$ for the discriminator.

Experiments & Results

Evaluating Image Quality

pi-GAN is compared against two previous approaches for 3D-aware image synthesis: HoloGAN[16] and GRAF [21] on three datasets, which are CelebA, Cats and CARLA [12, 26, 4, 21].

Animation: Comparison of three approaches [1]

According to the authors [1], HoloGAN achieves good image quality with sharper details, but it lacks of multi-view consistency, identity of the object changes with the rotations; especially on the CARLA dataset. GRAF was more capable to recover wide viewing angles with utilizing the 3D representation, but falls short on rendering finer details, such as hair and teeth, therefore the images generated are similar to cartoons. As a result, pi-GAN renders finer images in wide-viewing angles, does better than both approaches. All of these approaches have benchmarked in terms of Frechet Inception Distance (FID) [8], Kernel Inception Distance (KID) [2], and Inception Score [20] for all three datasets [12, 26, 4, 21]. pi-GAN showed better benchmarking results than HoloGAN [16] and GRAF [21].

Table: Quantitative results for three datasets and three approaches [1]

Generating Approximate 3D Representations

Using 3D-Aware representation grants of the advantage of rendering and interpreting poses which were unseen or uncommon at training time, generalizes for wider angles, can create results, even the outliers have visual artifacts. This result is caused by the imbalance of the dataset which focuses on front-facing images, using uniformly distributed datasets, such as CARLA, did not suffer from these artifacts.

Figure: Generating wider-angle representations [1]

Also, by using radiance fields, even with the training of the cropped images, results can be zoomed out by the extrapolation of the radiance fields.

Figure: Zooming out with the extrapolation of the radiance fields [1]

Semantical meaningfulness of the latent space learned by pi-GAN is demonstrated as shown.

Figure: Latent space interpolation [1]

3D structure, generated by pi-GAN can be extracted as either marching cubes such shown in CARLA or a projected depth-map shown in Cats and CelebA.

Figure: Extracted 3D structures [1]

Benchmark of the radiance field's conditioning is shown as in the Table, best FID score achieved via Sine activation and FiLM conditioning against radiance fields conditioned by concatenation and ReLU activational positional encoding.

Table: FID scores on CelebA @ 64 x 64 when comparing different activation functions and conditioning methods [1]

Larger batch sizes with lower resolution images via utilizing progressive growing helped to achieve better FID results.

Figure: FID scores benchmarked on progressive Growing strategy and fixed size images [1]

Discussion, Future Work & Conclusion

With using pi-GAN, it is possible that using an input image and pretrained generator, perform single-view reconstruction, such shown in the figure.

Figure: Single-view 3D reconstruction from an input image and pretrained generator [1]

In addition, while the unsupervised learning of 3D shapes was not the focus of the pi-GAN, it still can produce interpretable view-consistent of 3D representations. One future work might be focusing on refining the quality of extracted meshes.

In some cases pi-GAN generated radiance fields, which produced viable images but failed on 3D shape extractions, such represented the shape of the face concave. Also, while pi-GAN increased the image quality for 3D-aware GANs, there's much work remains to match with the state-of-the-art convolutional 2D GANs in terms of image quality. In addition to that, pi-GAN is computationally expensive when compared to 2D GANs, due to the generator not only scales with the image size, but the depth along each ray. Future works can reveal insights of such cases and improve the results and increase the image quality also decrease the computational complexity of pi-GAN.

Figure: pi-GAN generates concave 3D representation for a face as a failure [1]

In ethical side, pi-GAN can be used to extend DeepFakes, which is used to generate fake photos and videos of real people and poses a societal threat. Also for the CelebA dataset lacks of diversity, that might affect the results of faces in the paper.

To conclude, pi-GAN improved the results in the area of photorealistic 3D-aware image synthesis.

Review

I found the paper interesting, easily readable and enjoyable. I liked especially these points:

Code is available for the public to take a look.
Experiments are done in a comparable way with the previous approaches
More than one improvement, such as FiLM, SIREN, progressive growing, radiance fields are introduced and combined

As the weaknesses and suggestions for improvements to this paper:

I had to have some clarifications such as that I needed to research what is a radiance field and read about NeRF. I needed to check what FiLM, SIREN is, why it is better than ReLU, have some questions such that why we don't utilize periodic sine as a standard deep learning model, instead of ReLU etc.
I saw a recently released paper named "ADOP: Approximate Differentiable One-Pixel Point Rendering" [49] and I think some of the ideas can be adopted from that paper to the neural representation/rendering part of pi-GAN as a future work.Figure: ADOP [49] used for neural rendering
Also another recent released paper (December 9, 2021) “Plenoxels: Radiance Fields without Neural Networks” [50] can be an alternative to radiance fields part.

Figure: Plenoxels [50] used as an alternative for NeRF [15]

References

[1] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. arXiv preprint arXiv:2012.00926, 2020.

[2] Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. In Proc. ICLR, 2018.

[3] Ashish Bora, Eric Price, and Alexandros G. Dimakis. AmbientGAN: Generative models from lossy measurements. In Proc. ICLR, 2018.

[4] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Proc. CoRL, 2017.

[5] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. In Proc. 3DV, 2017.

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proc. NeurIPS, 2014.

[7] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Escaping plato’s cave: 3d shape from adversarial rendering. In Proc. ICCV, 2019.

[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. In Proc. NeurIPS, 2017.

[9] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In Proc. ICLR, 2018.

[10] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, 2019.

[11] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020.

[12] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proc. ICCV, 2015.

[13] Sebastian Lunz, Yingzhen Li, Andrew Fitzgibbon, and Nate Kushman. Inverse graphics gan: Learning to generate 3d shapes from unstructured 2d data. arXiv preprint arXiv:2002.12674, 2020.

[14] N. Max. Optical models for direct volume rendering. IEEE TVCG, 1(2):99–108, 1995.

[15] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proc. ECCV, 2020.

[16] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan: Unsupervised learning of 3d representations from natural images. In Proc. ICCV, 2019.

[17] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, and Niloy Mitra. Blockgan: Learning 3d objectaware scene representations from unlabelled images. In Proc. NeurIPS, 2020.

[18] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proc. AAAI, 2018

[19] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proc. ICLR, 2016.

[20] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Proc. NeurIPS, 2016.

[21] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. In Proc. NeurIPS, 2020.

[22] Vincent Sitzmann, Julien N.P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In Proc. NeurIPS, 2020

[23] Attila Szabó, Givi Meishvili, and Paolo Favaro. Unsupervised generative 3d shape learning from natural images. arXiv preprint arXiv:1910.00287, 2019.

[24] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3D models from single images with a convolutional network. In Proc. ECCV, 2016.

[25] Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, and Joshua B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Proc. NeurIPS, 2016.

[26] Weiwei Zhang, Jian Sun, and Xiaoou Tang. Cat head detection - how to effectively exploit shape and texture features. In Proc. ECCV, 2008.

[27] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proc. ICCV, 2017.

[28] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum, and William T. Freeman. Visual object networks: Image generation with disentangled 3D representations. In Proc. NeurIPS, 2018.

[29] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proc. CVPR, 2018.

[30] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna, William T Freeman, and Thomas Funkhouser. Learning shape templates with structured implicit functions. In Proc. ICCV, 2019.

[31] Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna, and Thomas Funkhouser. Local deep implicit functions for 3d shape. In Proc. CVPR, 2020.

[32] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In Proc. CVPR, 2019.

[33] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, Mahsa Baktashmotlagh, and Anders Eriksson. Implicit surface representations as layers in neural networks. In Proc. ICCV, 2019.

[34] Matan Atzmon and Yaron Lipman. SAL: Sign agnostic learning of shapes from raw data. In Proc. CVPR, 2020.

[35] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. In Proc. ICML, 2020.

[36] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Ronen Basri, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. In Proc. NeurIPS, 2020.

[37] Thomas Davies, Derek Nowrouzezahrai, and Alec Jacobson. Overfit neural networks as a compact shape representation. arXiv preprint arXiv:2009.09808, 2020.

[38] Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove, and Richard Newcombe. Deep local shapes: Learning local sdf priors for detailed 3d reconstruction. In Proc. ECCV, 2020.

[39] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204–1210, 2018.

[40] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Proc. NeurIPS 2019, 2019.

[41] Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nießner, and Thomas Funkhouser. Local implicit grid representations for 3d scenes. In Proc. CVPR, 2020.

[42] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. In Proc. ECCV, 2020.

[43] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, et al. State of the art on neural rendering. Proc. Eurographics, 2020.

[44] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. Texture fields: Learning texture representations in function space. In Proc. ICCV, 2019.

[45] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proc. CVPR, 2020.

[46] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. In Proc. NeurIPS, 2020.

[47] Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization. In Proc. CVPR, 2020.

[48] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In Proc. CVPR, 2020.

[49] Darius Rückert, Linus Franke, Marc Stamminger. ADOP: Approximate differentiable one-pixel point rendering, arXiv:2110.06635, 2021.

[50] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong Chen, Benjamin Recht, Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks, arXiv:2112.05131

[51] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. ICCV, 2019.

Seitenhierarchie

pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis