Multimodal Image Synthesis

Abstract

This blog post gives an overview of the most important multimodal image synthesis techniques will be presented and categorized into three categories. These techniques are compared and at the end the risks and limitations of such models will be reported.

Introduction

Motivation

In order to mimic human imagination and creativity in the real world, the tasks of Multimodal Image Synthesis (MIS) aim at generating realistic images leveraging multimodal data.

Multimodal data could be a) Visual (e.g., image, color, video), b) aural (e.g., music, audio clips), c) gestural (e.g., body language, mimic), d) spatial (e.g., space information, depth values), or e) linguistic (e.g., text) as shown in Figure1. Multimodality is important in all machine-learning tasks. First, real-world data is usually multimodal. Second, different types of data could give important information that complements other modalities' information. This complementary information could help improve the robustness of machine learning models and decrease their sensitivity to noise. Third, multimodal learning is important to facilitate human-AI communication and interaction e.g. it is easier for a human to express his need in words than with drawings.

Figure1: Different types of modalities.source: https://rampages.us/cnjacksonuniv200spring2017/multimodality/

Problem:

Although multimodal learning is important and helpful, it is not straightforward to neural networks to handle multimodality because of the inherent modality gap. Data from different modalities have different distributions and formats. Zhan et al. [1] have categorized MIS methods into four categories: GAN-based models, autoregressive methods, diffusion-based methods, and NeRF-based methods. The first three categories will be presented and a famous example will be presented.

CLIP: Contrastive Language Image Pre-training

A very important challenge in all multimodal tasks is the modality gap between the inputs and thus mapping text, image, and audio in the same latent space is a crucial step.

CLIP (Contrastive Language-Image Pre-training)[4] is a model developed by OpenAI for visual-linguistic understanding. It is a pre-training method that learns to predict the relationships between images and text by training on a large dataset of images and their associated text.

The model is trained to predict the likelihood of a given image and text pair being true. It does this by using a contrastive loss function, where the model is trained to maximize the similarity between the image and text representations for true pairs and minimize the similarity between the image and text representations for false pairs as shown in Figure 4. CLIP is pre-trained on a large dataset of images and their associated text. Once pre-trained, the model can be fine-tuned on a smaller dataset for a specific task, such as image captioning or image-text retrieval. The pre-training allows the model to learn a general understanding of the relationship between images and text, which can then be fine-tuned for specific tasks.

The main advantage of CLIP over other image-text pre-training methods is that it utilizes a contrastive loss function, which allows it to learn more robust and generalizable representations of images and text (Radford et al.). Additionally, it can generalize to unseen text and images, which is a desirable property for a model that is trained for visual-linguistic understanding.

Figure4: CLIP framework[4].

CLIP is used in many tasks and models:

DALL-E [6]: this model will be presented in this blog-post
DALL-E-2 [8]: this model will be presented in this blog-post
Image-to-text and text-to-image retrieval

GAN-based Methods

StackGAN:

A pioneer text-to-image GAN-based method is StackGAN [2]. StackGAN is a generative model that is employed to produce high-resolution images from text descriptions. A Stage-I generator and a Stage-II generator make up its two stages. A low-resolution image is produced by the Stage-I generator using a text description it catches a rough shape and basic colors. the text description is first encoded by an encoder to generate text embedding. Latent space for the text embedding is often highly dimensional which typically results in discontinuity in the embedding space which is undesirable. Hence, Conditioning augmentation is needed. However, a high-resolution image is produced by the Stage-II generator. The output of the stage-II generator is conditioned to both the low-resolution image and the written description. A high-resolution image can only be created by the Stage-II generator when the Stage-I generator has been trained to provide an image with enough information. The model learns to produce high-resolution images that are compatible with the text description since the two generators are trained together.

Figure 2: The overall architecture of StackGAN[2].

AttnGAN:

AttnGAN (Attention Generative Adversarial Network) [3] is a generative model that is used for generating images from text descriptions. It utilizes attention mechanisms to selectively focus on different regions of the image while generating it. The model consists of three main components: a text encoder, a generator, and a discriminator. The text encoder takes in a text description and encodes it into a feature vector. The generator takes in the encoded feature vector and generates an image. The discriminator takes in both the generated image and the encoded feature vector and decides whether the image is real or fake. The generator is trained to generate images that are consistent with the text description, and the discriminator is trained to distinguish real images from fake images. The generator and the discriminator are trained together in an adversarial manner, where the generator is trying to generate images that are indistinguishable from real images, while the discriminator is trying to correctly identify fake images. The attention mechanism in AttnGAN allows the model to focus on specific regions of the image while generating it, which makes it possible to generate images that are more consistent with the text description, and also allows for more fine-grained control over the generated images.

Figure 3: The overall architecture of AttnGAN [3].

Sound-Guided Semantic Image Manipulation

The CLIP-based Contrastive Latent Representation Learning phase and the Sound-Guided Image Manipulation step make up the two key components of the sound-guided semantic image manipulation approach[5].
To create a multimodal latent representation in (a), they train a group of encoders on data from three different modalities (audio, text, and image). In the (CLIP-based) embedding space, samples of a positive triplet pair's latent representations are mapped close together, whereas samples of a negative triplet pair's latent representations are mapped farther apart (left).
In (b), they employ a direct code optimization method that modifies a source latent code in response to user-supplied audio, yielding a result for sound-guided image alteration (right). The overall framework is shown in Figure 5.

Figure 5: The overall Sound-Guided Semantic Image Manipulation pipeline[5].

Autoregressive Methods

Autoregressive models are the sort of models that create images one pixel or patch one after the other utilizing previously created pixels or patches as input. Because the model is conditioned on its own previous outputs, this strategy is known as autoregressive.

VQ-VAE-2 (Vector Quantized Variational Autoencoder 2)

VQ-VAE-2 (Vector Quantized Variational Autoencoder 2)[8] is a model developed by OpenAI that combines the concepts of autoencoders and vector quantization to generate high-quality images. It is a variant of VQ-VAE (Vector Quantized Variational Autoencoder) and it improves the quality of generated images.

The model consists of two main components: an encoder and a decoder. The encoder takes in an image and maps it to a lower-dimensional feature space, which is then fed into a vector quantizer. The vector quantizer divides the feature space into a set of discrete "codebook" vectors and assigns the input features to the nearest codebook vector. The decoder takes the quantized features and maps them back to the original image space autoregressively.

The model is trained to reproduce the original image by minimizing the reconstruction loss, which measures the difference between the original image and the generated image by the decoder, and the commitment loss, which measures the difference between the input features and the nearest codebook vector.

After the training, the model can be used to generate new images by sampling from the codebook vectors and feeding them into the decoder. The VQ-VAE-2 can generate high-quality images with a high level of detail, as it uses a vector quantizer to encourage the use of a limited set of codebook vectors, which results in a more compact and expressive representation of the image. Additionally, VQ-VAE-2 can be used in image generation tasks, image classification, and other tasks as it can generate high-quality images.

Figure 6: Vector Quantized Variational Autoencoder 2 architecture [7].

DALL_E

A famous text-to-image synthesis method that uses an autoregressive decoder is DALL-E. DALL-E is a generative model developed by OpenAI. DALL-E is based on VQ-VAE-2, next token prediction, and CLIP model for re-ranking. DALL·E is able to generate a wide variety of images from natural language descriptions. The model consists of a transformer-based architecture with a large number of parameters. The model is trained on a large dataset of images and their associated text captions. During training, the model learns to generate images that are consistent with the text captions. Once trained, DALL·E can be used to generate new images from natural language prompts.

DALL-E 1 training:

Train an image encoder and decoder: learn a visual codebook.
Concatenate text tokens with image tokens into a single array
Train to predict the next image token from the preceding tokens.

DALL-E 1 Prediction:

Tokenize input text to text tokens
Predict the next image token from the learned codebook
Decode the image tokens using the VQ-VAE-2 decoder
Select the best image using the CLIP model ranker

Diffusion-based Models

Diffusion models learn to recover the data by reversing the noising process that consists in systematically adding Gaussian noise to ruin the training data.

Figure 7: Examples of the denoising process.

DALL-E-2

DALL-E-2 is a text to image synthesis model introduced by OpenAI in 2022. DALL-E-2 is composed of three important parts:

The CLIP pre-training part: A CLIP text and image encoders are trained to create multimodal embeddings that are compared to a "mental" representation.
Prior model: Takes a text description and CLIP text embedding and outputs CLIP image embeddings.
Decoder Diffusion model (unCLIP): Generates image from CLIP image embedding using diffusion.

The overall architecture is shown in Figure 8.

Figure 8: The overall architecture of DALL-E-2[8]

Results and Comparison

In order to compare the previously presented models, the COCO Captions dataset [10] is used. COCO Captions is composed of ~1.5M sentences describing over 330.000 images.
The metrics used for this comparison are reported in Table 1:

Metric	Goal	Values
Fréchet Inception Distance (FID)	Image Quality	the lower, the better
Inception Score (IS)	Image Quality	the higher, the better

Table 1: Overview about the evaluation metrics.

Table 2 shows the overall comparison results. From this table, we can see that DALL-E and DALL-E-2 are the best performig models among the previously presented ones. However, DALL-E-2 is the best performing one. Since these both models

Model	IS	FID
StackGAN	8.45	74.05
AttnGAN	25.89	35.20
DALL-E	17.90	27.50
DALL-E-2	-	10.39

Table 2: Comparison table reporting Inception Score (IS) and Fréchet Inception Distance (FID).

Review

DALL-E and DALL-E-2 are two powerful models that can generate good quality images. DALL-E generates realistic images from text descriptions. However, it utilizes CLIP only at the end to select the most matching image from all the generated images. Hence, DALL-E-2 can catch fine-grained interconnections between both modalities because it is based on CLIP representations. DALL-E-2 can produce faster results than DALL-E because it utilizes diffusion and not autoregressive models. Despite producing higher resolution images faster, DALL-E-2 can produce more variations of an image in a few seconds.

Limitations of DALL-E-2:

Risks and Limitations	Example
Biases and stereotypes:	When the prompt is vague, DALL-E-2 has a tendency to portray individuals and settings as White or Western. It also uses gender stereotypes (for instance, a builder is a male, a flight attendant is a woman). This is what the model produces when given these occupations: This phenomenon, known as representational bias, happens when models like DALL-E-2 or GPT-3 confirm prejudices found in the dataset that classify individuals in one way or another depending on their identity (e.g. race, gender, etc.).
Herassment and Bullying:	It could be used to add some objects in the image and his could beused as a herassment tool by: •Modifying clothing: removing or adding religious clothing (yarmulke, hijab) •Adding people to an image: adding a person into an image holding hands with the original person in the image (e.g. someone who is not their partner)
Misinformation:	While deepfakes might be more effective for faces, DALLE 2 could produce plausible scenarios of many kinds. DALLE 2 might be instructed to produce photos of burning structures or of people walking or talking happily with a famous building in the background, for example. This could be employed to deceive and misinform people about the actual events taking place there.
Spelling:	DALL-E-2 excels at drawing but fails miserably at spelling. The lack of spelling information from the text provided in dataset photos in DALL-E-2 could be the cause. DALL-E-2 cannot accurately draw something if it is not represented in CLIP embeddings. When given "a sign that says deep learning," DALL-E-2 produces the following:
Incoherence:	The majority of the time, DALL-E-2 images appear nice, but occasionally consistency is absent in a way that human works would never lack. This demonstrates that DALLE 2 is quite good at acting as though it understands how the world functions, but actually doesn't. The majority of people could never paint like DALLE 2, but they most certainly wouldn't make these blunders accidentally. In the next example, the hands have good drawing. The skin's lines and its varying tones of light and dark are clear. The fingers even have the appearance of being freshly dug up by the earth. However, the palms of both hands are merged where the plant is growing, and one of the fingers is not a part of either hand. Even though DALLE 2 captured two hands in great detail, it overlooked the fact that they are frequently separated from one another.

Conclusion

Multimodal image synthesis task is important nowadays because it can facilitate the interaction between human and AI. There are several types of models that have bee proposed. The survey of Zhan, F. et al. [1] proposed a classification of these models. This classification is composed of four types of models a) GAN-based methods (e. g. StackGAN, AttnGAN), b) Autoregressive methods (e. g. DALL-E), c) Diffusion-based model (e. g. DALL-E-2), and d) NeRF-based methods. Since NeRF-based models are totally different and aim at outputing a 3D scene representation, they are left for a future blog post.

One major problem in dealing with multimodality is the modality gap between e.g. text and image inputs. To deal with this gap, CLIP[4] is an effective method that maps different modalities into the same latent space. CLIP has bee widely used in the state of the art text-to-image synthesis methods that are mainly autoregressive of diffusion-based such as DALL-E-2.

Although multimodal image synthesis methods can be a great artistic tools to create artificial creative images and to help humans, it also has some risks and could be dangerous in some cases. These powerful methods could be used for bullying and herassement, or to misinform people.

References:

Zhan, F., Yu, Y., Wu, R., Zhang, J., & Lu, S. (2021). Multimodal image synthesis and editing: A survey. arXiv preprint arXiv:2112.13592.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. N. (2017). Stackgan: Text to photo-realistic image synthesiswith stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 5907-5915)
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1316-1324).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visualmodels from natural language supervision. In International Conference on Machine Learning (pp. 8748-8763). PMLR.
Lee, S. H., Roh, W., Byeon, W., Yoon, S. H., Kim, C., Kim, J., & Kim, S. (2022). Sound-guided semantic image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3377-3386).
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., ... & Sutskever, I. (2021, July). Zero-shot text-to-image generation. In International Conference on Machine Learning(pp. 8821-8831). PMLR.
Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. 2022. URL: https://doi. org/10.48550/arXiv, 2204.
Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

Seitenhierarchie