Few-shot Image Synthesis

In this blog post, the definition and development of image synthesis are generalized, popular methods using GANs are briefly classified and presented, and their performance is compared under different evaluation criteria. The relevant papers are marked at the end as references.

Author: Juan Carlos Climent Pardo

1. Introduction to Image Synthesis

Image synthesis is the process of artificially generating images that contain some particular and desired content. Plainly said: it is the challenge of making "fake" images more real. A clear example is seen in the field of generating new images with human faces as seen in figure 1. Here, the images in the first row and the first column are used as the source of styles for performing style mixing. The faces generated in the second, third, and fourth rows are generated using the coarse styles from the faces in the top row, and the middle and fine styles from the leftmost faces in each row (i.e. the first column).

But there are also other ways of performing image synthesis, exemplary as adding noise to images in order to generate new ones. This technique is usually employed in Generative Adversarial Networks (GANs), which will be introduced in the next subsections. There are other interesting application domains too, such as medical image synthesis, which has grown hugely in popularity over the last years due to the challenge posed by the data scarcity provided by some medical imaging tools like computer tomography or magnetic resonance scans.

Challenges

As briefly hinted, the biggest challenges faced in image synthesis correlate with data quantity and, rather, data scarcity. In low data regiments, models based on GANs tend to overfit in the training procedure, "learning by heart" the features of the few existing images instead of generalizing the features. Thus, the generating part of the system does not get to learn meaningful features, which ultimately leads to a divergence in the performance. On the other hand, the system can also face the issue of mode collapsing, which essentially means that the generating part of the model can only produce a certain, single type, or a small set of outputs. As seen in figure 2, the system gets to the point of generating only homogenous images, which are visualized by the black squares. Therefore, few-shot settings where not many images are available have also become an interesting research field.

Furthermore, when complicated imagining techniques such as the medical ones are used, data collection itself can become a problem, leading to mismatches in training and target distributions as training pairs are hardly available and (pixel-wise) annotations are really costly. Figure 3 visualizes the desired settings for image synthesis training models in green, the needed, optimal requirements in yellow, and the mentioned risks and challenges in red.

Generative Adversarial Networks (GANs)

GANs are popular methods for image synthesis, as the nature of the needed data makes them suitable to handle the mentioned challenges. They are compromised by a simple structure, consisting of two networks, the so-called generator, and discriminator, which are comically described in figure 4 and depicted as a block diagram in figure 5. The generator usually receives a vector of random values as an input and tries to generate data with the same structure as the training data. Then, the discriminator receives batches of data containing observations from both the training data and generated data from the generator as an input, and attempts to classify the observations as "real" or "generated". In plain words, one could say that the generator tries to "fool" the discriminator by learning to generate realistic images and the discriminator tries to differentiate if the input it receives is real or not.

2. Few-Shot Image Synthesis Methods

Image synthesis methods are firstly classified into three different groups according to the amount of data they use, and then, they are classified considering the type of variations they perform on a pre-processing (data selection), processing (GAN optimization), or structural (knowledge sharing) level. Thirdly, two different methods are explained in more detail for each type of variation made.

Classification through data quantity

First, different image synthesis methods will be considered and classified with regard to the quantity of data used. There are three classes:

One-shot methods, which only use one single image for training purposes. Here the algorithms can not learn a lot of diversity content-wise, but rather the components in the images have different distributions, which essentially makes them new images.
Few-Shot methods, normally use between 10 and 100 images as training data. These methods are further explained below, focusing on the changes that are performed on the generator, discriminator, or data augmentation.
Limited Data methods, which usually take between 1k and 7k of input images. These approaches tend to outperform the two named above, simply because of the data quantity, which allows for better generalization and feature learning.

Classification through approach class

A different way of classifying few-shots methods is depending on which relevant data distributions they address (see figure 7). The data initial distribution consists of the real images fed into the GAN system at the beginning of training. The generated distribution is made up of "fake" images produced by the generator, and the empirical one by the limited data distribution used for training consisting of both, "real" and "fake" images. The expected distribution finally equals the target distribution.

These distributions help distinguish the methods into three approach classes:

Data Selection: This addresses how training data is firstly selected. With dense sampling, the difficulties of GAN training can be reduced, yet a reduction in data diversity is inevitable. Conversely, sparse sampling increases the diversity yet also increases the training instability. Finding a compromise between these two factors is usually a big challenge in GAN training.
GAN Optimization: This approach focuses on how to perform data augmentation on the training data without leading the discriminator to overfit, different regularization techniques that can be applied and how to choose the correct strength, and how the model architecture may be changed in order to have optimized structures interacting with each other.
Knowledge Sharing: This last approach is based on pre-training a GAN on a source dataset different from the expected data, and then fine-tuning it on a target dataset. This way, data demand is reduced, and training speeds up effectively.

Discriminator

When solely considering the discriminator, the most common problem that arises is its already named overfitting. [3] proposes the FastGAN structure, which provides a strong regularization in the loss function and implements the discriminator in a self-supervised fashion. Thie discriminator is treated as an encoder that can be consequently trained with small decoders. This principle of auto-encoding training forces the discriminator (denoted by D in figure 8) to extract image feature maps (denoted by F) that the decoders can give good reconstructions on. The decoders can then be optimized together with the discriminators on one reconstruction loss. This method works well for few-shots settings as it extracts a more comprehensive representation from the inputs and the autoencoding is generally used to improve the model robustness and generalization ability.

[8] presents a generative co-training approach for GANs, which implements a weight- and data-discrepancy co-training of discriminators. Co-training per se aims to learn multiple complementary classifiers from different views for training more generalizable models. The weight-discrepancy co-training (WeCo) therefore co-trains multiple distinctive discriminators by diversifying their parameters. This is seen in figure 9 below, as for example discriminator 1 (D1) and discriminator 2 (D2) are differentiated by a weight discrepancy loss in the upper, pink part. In the data-discrepancy co- training (DaCo) setting, the co-training is implemented by feeding discriminators with different views of the input images (e.g., different frequency components of the input images). In this case, D1 and D3 have a different view of input x.

Generator

When solely considering the generator, the most common problem that arises is its already named mode-collapse. The two papers presented in this section tackle that problem by focusing on changes in the generator. [5] presents a method that makes use of a semantic prior, which basically means that a state-of-the-art StyleGAN generator structure is used to extract a feature map for a few semantic labeled images and is then combined with the semantic label. If using dense semantic labelling, then a masked average pooling is performed to calculate the latent space representative vectors that can then be fed into a Cos-function to recreate pseudo-labeled images. For a sparse sampling of the labels, pixel-wise feature extraction is used instead, and the latent vectors are fed into a "top-k matching" function. Both variants can be found in figure 10 below.

In [6] a cross-domain correspondence approach is taken, which belongs to the knowledge-sharing class mentioned above. The main idea is to perform GAN adaptation via transfer learning, by utilizing a large source domain for pre-training, and transferring diversity information from source to target by preserving the relative similarities and differences between instances. This transfer is evaluated by a novel cross-domain distance consistency loss, accompanied by a cross-domain spatial structural consistency loss, which help to align the spatial structural information between the synthesis image pairs of the source and target domains. Furthermore, cross-domain alignment relaxation options are also proposed in [6] by compressing the original latent space of generative models to a subspace. This allows to capture the essential features of the initial image distribution and portray them into the target domain, which is usually in a few-shot setting as depicted in figure 11 below.

Data Augmentation

In few-shot settings, data augmentation plays a really important role. For conventional classification problems, data augmentation is performed dynamically, e.g. the distribution of the images is adjusted by performing copy techniques such as cropping, flipping, scaling, or color jittering. Training a classifier under such conditions leads to an increasing invariance to these semantics-preserving distortions, which is a highly desirable quality in a classifier. GANs, in contrast, if trained under similar dataset augmentations, learn to generate the augmented distribution, as the transformation is only added to the real images, and the generators are encouraged to match the distribution of the augmented images. If both, real and generated images are augmented when training the discriminator, the subtle balance between the generator and discriminator is broken, leading to poor convergence as they optimize different objectives. These cases are so-called "leaking" augmentations, the main problem in low data GAN settings. The two approaches proposed in this subsection aim at efficiently improving data augmentation.

[4] presents a differentiable data augmentation approach to prevent the discriminator from overfitting, while ensuring that none of the augmentations leak to the generated images. An overview of the "DiffAugment" technique is found in figure 12 below, which depicts how the updating procedures of the discriminator (D on the left) and the generator (G on the right) work. "DiffAugment" applies the augmentation T to both the real samples x and the generated output G(z). Then, when the generator is updated, gradients need to be back-propagated through T, which requires T to be differentiable with respect to the input. Therefore, a differentiable augmentation.

[7] proposes an adaptive augmentation (ADA), based on consistency regularization terms for the discriminator loss, in order to enforce the discriminator consistency for both real and generated images. It works stochastically, evaluating the discriminator for only using augmented images, and has the benefit that it can be used regardless of the amount of training data, properties of the dataset, or the exact training setup. Figure 13 below shows a brief schematic of the approach, how real images and "latents" (features before having been processed by the generator) are fed into the augmentation block (marked in blue), which is adjusted by a probability p that controls whether it should be adapted or not. The higher p, the more adaptation procedures are also implemented, as is clearly seen in the individual images.

Medical Application

This last subsection is only meant for completion purposes, as the research topic at hand is whether image synthesis can be applied to few-shot medical settings. In the current status quo, top-performing GANs are capable of generating realistic-looking medical images by FID (explained below in section 3) standards that can fool trained experts in a visual Turing test and comply with some metrics. However, segmentation results suggest that no GAN is capable of reproducing the full richness of medical datasets. The two figures below show some medical use cases.

3. Results & Comparison

Evaluation of metrics & challenges

In order to compare the different presented approaches, some standardized metrics have to be considered. The Fréchet Inception Distance (FID) and the Learned Perceptual Image Patch Similarity (LPIPS) are the most common ones, the FID compares the distribution of generated images with the distribution of real images that were used to train the generator, whereas the second essentially computes the similarity between the activations of two image patches for some pre-defined network. [10] mainly uses the FID metric to compare how different amounts of data influence the performance of individual techniques. Figure 16 shows some exemplary FID results.

One other approach to the evaluation would be to consider some other type of comparison feature, e.g. the computation efficiency of the presented methods. What also becomes interesting when analyzing performances is how well the initial challenges are being resolved. Examples of that would be how realistic synthetic images are created or how well mode collapsing and the overfitting of the discriminator is prevented.

Further example Images

4. Conclusion & Discussion

Recap

First, the challenges of image synthesis in low data regimens (few-shot) were presented, and a brief review on GANswas made. Two approaches on how to classify GAN methods and variations were presented, one focussing on the quantity of data used (one- vs. few- vs limited-shot), and the other on the type of methods and changes performed on the GANs (data selection vs. GAN optimization vs. knowledge-sharing). Finally, a brief evaluation based on FID was shown along with some result figures. The question still remains if this is the best metric or not.

Personal Review

GANs are becoming popular and the research in the field is growing considerably, with faster and lighter training expected in the future. Furthermore, the medical field is gaining a lot of traction, which leaves a gap for optimization and improvement, as there is still a lot of practical application missing, and the works presented so far are in their early stages. One technique that could play an important role in this domain is transfer learning, as many features can easily be applied o the medical field, although they were not specifically developed for it.

References

[1] Shaham, T. R., Dekel, T., & Michaeli, T. (2019). SinGAN: Learning a Generative Model from a Single Natural Image. CoRR, abs/1905.01164. http://arxiv.org/abs/1905.01164.

[2] Sushko, V., Gall, J., & Khoreva, A. (2021). One-Shot GAN: Learning to Generate Samples from Single Images and Videos. CoRR, abs/2103.13389. https://arxiv.org/abs/2103.13389.

[3] Liu, B., Zhu, Y., Song, K., & Elgammal, A. (2021). Towards Faster and Stabilized GAN Training for High-fidelity Few-shot ImageSynthesis. CoRR, abs/2101.04775. https://arxiv.org/abs/2101.04775.

[4] Zhao, S., Liu, Z., Lin, J., Zhu, J.-Y., & Han, S. (2020). Differentiable Augmentation for Data-Efficient GAN Training. CoRR, abs/2006.10738. https://arxiv.org/abs/2006.10738

[5] Endo, Y., & Kanamori, Y. (2021). Few-shot Semantic Image Synthesis Using StyleGAN Prior. CoRR, abs/2103.14877. https://arxiv.org/abs/2103.14877.

[6] Xiao, J., Li, L., Wang, C., Zha, Z.-J., & Huang, Q. (2022). Few Shot Generative Model Adaption via Relaxed Spatial Structural Alignment. arXiv. https://doi.org/10.48550/ARXIV.2203.04121.

[7] Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., & Aila, T. (2020). Training Generative Adversarial Networks with LimitedData. CoRR, abs/2006.06676. https://arxiv.org/abs/2006.06676.

[8] Cui, K., Huang, J., Luo, Z., Zhang, G., Zhan, F., & Lu, S. (2021). GenCo: Generative Co-training for Generative Adversarial Networks with Limited Data. arXiv. https://doi.org/10.48550/ARXIV.2110.01254

[9] Zhao, Y., Ding, H., Huang, H., & Cheung, N.-M. (2022). A Closer Look at Few-shot Image Generation. arXiv. https://doi.org/10.48550/ARXIV.2205.03805.

[10] Li, Z., Wu, X., Xia, B., Zhang, J., Wang, C., & Li, B. (2022). A Comprehensive Survey on Data-Efficient GANs in Image Generation. ArXiv, abs/2204.08329.

[11] Mahla Abdolahnejad, Peter Xiaoping Liu. (2020). Deep learning for face image synthesis and semantic manipulations: a review and future perspectives.https://link.springer.com/article/10.1007/s10462-020-09835-4

[12] Chang, C.-C., Lin, C. H., Lee, C.-R., Juan, D.-C., Wei, W., & Chen, H.-T. (2018). Escaping from Collapsing Modes in a ConstrainedSpace. CoRR, abs/1808.07258. http://arxiv.org/abs/1808.07258

[13] Thomas Simonini. How AI can learn to generate pictures of cats. (2018). https://www.freecodecamp.org/news/how-ai-can-learn-to-generate-pictures-of-cats-ba692cb6eae4/

[14] MathWorks. Train Conditional Generative Adversarial Network (CGAN). (2022). https://www.mathworks.com/help/deeplearning/ug/train-conditional-generative-adversarial-network.html

[15] Nie, D., Trullo, R., Lian, J., Wang, L., Petitjean, C., Ruan, S., Wang, Q., & Shen, D. (2018). Medical Image Synthesis with Deep Convolutional AdversarialNetworks. IEEE transactions on bio-medical engineering, 65(12), 2720–2730. https://doi.org/10.1109/TBME.2018.2814538

[16] Xu, J., Li, M., & Zhu, Z. (2020). Adaptive Data Augmentation for 3D Medical Image Segmentation. arXiv. https://doi.org/10.48550/ARXIV.2010.11695

Seitenhierarchie