Multimodal Image Synthesis

Abstract

This blog post gives an overview of the most important multimodal image synthesis techniques will be presented and categorized into three categories. These techniques are compared then, at the end, the risks and limitations of such models will be reported.

Introduction

Motivation

In order to mimic human imagination and creativity in the real world, the tasks of multimodal image synthesis aim at generating realistic images leveraging multimodal data.

Multimodal data could be a) Visual (e.g., image, color, video), b) aural (e.g., music, audio clips), c) gestural (e.g., body language, mimic), d) spatial (e.g., space information, depth values), or e) linguistic (e.g., text) as shown in Figure1. Multimodality is important in all machine-learning tasks. First, real-world data is usually multimodal. Second, different types of data could give important information that complements other modalities' information. This complementary information could help improve the robustness of machine learning models and decrease their sensitivity to noise. Third, multimodal learning is important to facilitate human-AI communication and interaction e.g. it is easier for a human to express his need in words than with drawings.

Figure1: Different types of modalities.
Source: https://rampages.us/cnjacksonuniv200spring2017/multimodality/

Problem:

Although multimodal learning is important and helpful, it is not straightforward to neural networks to handle multimodality because of the inherent modality gap. Data from different modalities have different distributions and formats. Zhan et al. [1] have categorized MIS methods into four categories: GAN-based models, autoregressive methods, diffusion-based methods, and NeRF-based methods. The first three categories will be presented and a famous example will be presented.

CLIP: Contrastive Language Image Pre-training

A very important challenge in all multimodal tasks is the modality gap between the inputs and thus mapping text, image, and audio in the same latent space is a crucial step.

CLIP (Contrastive Language-Image Pre-training)[4] is a model developed by OpenAI for vision language understanding. Using a large dataset of images and the corresponding text, this pre-training method trains to predict the associations between images and text.
The model is trained to forecast the probability that a specific image and text pair will be true. This is accomplished by employing a contrastive loss function, in which the model is trained to maximize the similarity between picture and text representations for true pairs and minimize similarity for false pairs, as illustrated in Figure 4. CLIP has been pre-trained on a large dataset of photos and the text that goes with them. For a particular job, such as image captioning or picture-text retrieval, the model can be fine-tuned after being pre-trained on a smaller dataset. The pre-training allows the model to learn a general understanding of the interconnection between image and text inputs, which can then be fine-tuned for specific tasks.

The main advantage of CLIP over other image-text pre-training methods is that it utilizes a contrastive loss function, which generates more robust and generalizable representations of both modalities [4]. It can also generalize to unseen data pairs, which is a desirable property for a model that is trained for visual-linguistic understanding or reasoning.

Figure4: CLIP framework [4].

CLIP is used in many tasks and models. It has been utilized mainly in:

DALL-E [6]: this model will be presented in this blog-post
DALL-E-2 [8]: this model will be presented in this blog-post
Image-to-text and text-to-image retrieval

GAN-based Methods

StackGAN:

A pioneer text-to-image GAN-based method is StackGAN [2]. StackGAN is a generative model that produces high-resolution images from text descriptions. A Stage-I generator and a Stage-II generator make up its two stages. A low-resolution image is produced by the Stage-I generator using a text description it catches a rough shape and basic colors. the text description is first encoded by an encoder to generate text embedding. Latent space for the text embedding is often highly dimensional which typically results in discontinuity in the embedding space which is undesirable. Hence, Conditioning augmentation is needed. However, a high-resolution image is produced by the Stage-II generator. The output of the stage-II generator is conditioned to both the low-resolution image and the written description. A high-resolution image can only be created by the Stage-II generator when the Stage-I generator has been trained to provide an image with enough information. The model learns to produce high-resolution images that are compatible with the text description since the two generators are trained together.

Figure 2: The overall architecture of StackGAN [2].

AttnGAN:

AttnGAN (Attention Generative Adversarial Network) [3] is a generative model that is used for generating images from text descriptions. It utilizes attention mechanisms to selectively focus on different regions of the image while generating it. The model consists of three main components: a text encoder, a generator, and a discriminator. The text encoder takes in a text description and encodes it into a feature vector. The generator takes in the encoded embedding vector and generates an image. The discriminator takes in both the generated image and the encoded embedding vector and decides whether the image is real or fake. The generator is trained to generate images that are consistent with the text description, and the discriminator is trained to distinguish real images from fake images. The attention mechanism in AttnGAN allows the model to focus on specific regions of the image while generating it, which makes it possible to generate images that are more consistent with the text description, and also allows for more fine-grained control over the generated images.

Figure 3: The overall architecture of AttnGAN [3].

Sound-Guided Semantic Image Manipulation

The CLIP-based Contrastive Latent Representation Learning phase and the Sound-Guided Image Manipulation step make up the two key components of the sound-guided semantic image manipulation approach[5].
To create a multimodal latent representation in (a), they train a group of encoders on data from three different modalities (audio, text, and image). In the (CLIP-based) embedding space, samples of a positive triplet pair's latent representations are mapped close together, whereas samples of a negative triplet pair's latent representations are mapped farther apart (left).
In (b), they employ a direct code optimization method that modifies a source latent code in response to user-supplied audio, yielding a result for sound-guided image alteration (right). The overall framework is shown in Figure 5.

Figure 5: The overall Sound-Guided Semantic Image Manipulation pipeline [5].

Autoregressive Methods

Autoregressive models are the sort of models that create images one pixel or patch one after the other utilizing previously created pixels or patches as input. Because the model is conditioned on its own previous outputs, this strategy is known as autoregressive.

VQ-VAE-2 (Vector Quantized Variational Autoencoder 2)

VQ-VAE-2 (Vector Quantized Variational Autoencoder 2)[8] is a model developed by OpenAI that combines the concepts of autoencoders and vector quantization to generate high-quality images. It is a variant of VQ-VAE (Vector Quantized Variational Autoencoder) and it improves the quality of generated images.

The model consists of two main components: an encoder and a decoder. The encoder takes in an image and maps it to a lower-dimensional latent space, which is then fed into a vector quantizer. The vector quantizer divides the feature space into a set of discrete "codebook" vectors and assigns the input features to the nearest codebook vector. The decoder takes the quantized features and maps them back to the original image space autoregressively.

The model is trained to reproduce the original image by minimizing the reconstruction loss, which measures the difference between the original image and the generated image by the decoder, and the commitment loss, which measures the difference between the input features and the nearest codebook vector.

After the training, the model can be used to generate new images by sampling from the codebook vectors and feeding them into the decoder. The VQ-VAE-2 can generate high-quality images with a high level of detail, as it uses a vector quantizer to encourage the use of a limited set of codebook vectors, which results in a more compact and expressive representation of the image. Additionally, VQ-VAE-2 can be used in image generation tasks, image classification, and other tasks as it can generate high-quality images.

Figure 6: Vector Quantized Variational Autoencoder 2 architecture [7].

DALL_E

A famous text-to-image synthesis method that uses an autoregressive decoder is DALL-E. DALL-E is a generative model developed by OpenAI. DALL-E is based on VQ-VAE-2, next token prediction, and CLIP model for re-ranking. DALL·E is able to generate a wide variety of images from natural language descriptions. The model consists of a transformer-based architecture with a large number of parameters. The model is trained on a large dataset of images and their associated text captions. During training, the model learns to generate images that are consistent with the text captions. Once trained, DALL·E can be used to generate new images from natural language prompts.

DALL-E 1 training:

Train an image encoder and decoder: learn a visual codebook.
Concatenate text tokens with image tokens into a single array
Train to predict the next image token from the preceding tokens.

DALL-E 1 Prediction:

Tokenize input text to text tokens
Predict the next image token from the learned codebook
Decode the image tokens using the VQ-VAE-2 decoder
Select the best image using the CLIP model ranker

Diffusion-based Models

Diffusion models learn to recover the data by reversing the noising process that consists in systematically adding Gaussian noise to ruin the training data.

Figure 7: Examples of the denoising process.
Source:https://vaclavkosar.com/ml/openai-dall-e-2-and-dall-e-1

DALL-E-2

DALL-E-2 is a text-to-image synthesis model introduced by OpenAI in 2022. DALL-E-2 is composed of three important parts:

CLIP pre-training: CLIP text and image encoders are trained to create multimodal embeddings that are compared to a "mental" representation.
Prior model: Takes a text description and CLIP text embedding and outputs CLIP image embeddings.
Decoder Diffusion model (unCLIP): Generates image from CLIP image embedding using diffusion.

The overall architecture is shown in Figure 8.

Figure 8: The overall architecture of DALL-E-2 [8]

Results and Comparison

In order to compare the previously presented models, the COCO Captions dataset [10] is used. COCO Captions is composed of ~1.5M sentences describing over 330.000 images.
The metrics used for this comparison are reported in Table 1:

Metric	Goal	Values
Fréchet Inception Distance (FID)	Image Quality	the lower, the better
Inception Score (IS)	Image Quality	the higher, the better

Table 1: Overview of the evaluation metrics.

Table 2 shows the overall comparison results. From this table, we can see that DALL-E and DALL-E-2 are the best-performing models among the previously presented ones. However, DALL-E-2 is the best-performing one. Since these both models

Model	IS	FID
StackGAN	8.45	74.05
AttnGAN	25.89	35.20
DALL-E	17.90	27.50
DALL-E-2	-	10.39

Table 2: Comparison table reporting Inception Score (IS) and Fréchet Inception Distance (FID).

Review

DALL-E and DALL-E-2 are two powerful models that can generate good-quality images. DALL-E generates realistic images from text descriptions. However, it utilizes CLIP only at the end to select the most matching image from all the generated images. Hence, DALL-E-2 can catch fine-grained interconnections between both modalities because it is based on CLIP representations. DALL-E-2 can produce faster results than DALL-E because it utilizes diffusion and not autoregressive models. Despite producing higher-resolution images faster, DALL-E-2 can produce more variations of an image in a few seconds.

Limitations of DALL-E-2:

Risks and Limitations	Example
Biases and stereotypes:	When the prompt is vague, DALL-E-2 has a tendency to portray individuals and settings as White or Western. It also uses gender stereotypes (for instance, a builder is a male, and a flight attendant is a woman). This is what the model produces when given these occupations: This phenomenon, known as representational bias, happens when models like DALL-E-2 or GPT-3 confirm prejudices found in the dataset that classify individuals in one way or another depending on their identity (e.g. race, gender, etc.).
Harassment and Bullying:	It could be used to add some objects in the image and this could be used as a harassment tool by: •Modifying clothing: removing or adding religious clothing (e.g., hijab) •Adding people to an image: adding a person into an image holding hands with the original person in the image (e.g. someone who is not their partner)
Misinformation:	DALLE 2 could produce plausible scenarios of many kinds. DALLE 2 might be instructed to produce photos of burning structures or of people walking or talking happily with a famous building in the background, for example. This could be employed to deceive and misinform people about the actual events taking place there.
Spelling:	DALL-E-2 excels at drawing but fails miserably at spelling. The lack of spelling information from the text provided in dataset photos in DALL-E-2 could be the cause. DALL-E-2 cannot accurately draw something if it is not represented in CLIP embeddings. When given "a sign that says deep learning," DALL-E-2 produces the following:
Incoherence:	The majority of the time, DALL-E-2 images appear nice, but occasionally consistency is absent in a way that human works would never lack. This demonstrates that DALLE 2 is quite good at acting as though it understands how the world functions, but actually doesn't. The majority of people could never paint like DALLE 2, but they most certainly wouldn't make these mistakes or incoherences accidentally. In the next example, the hands have a good drawing. The skin's lines and its varying tones of light and dark are clear. The fingers even have the appearance of being freshly dug up by the earth. However, the palms of both hands are merged where the plant is growing, and one of the fingers is not a part of either hand. Even though DALLE 2 captured two hands in great detail, it overlooked the fact that they are frequently separated from one another.

Conclusion

Multimodal image synthesis task is important because first, different modalities can give complementary information and thus improve the quality of the generated images, and second, facilitate the interaction between humans and AI. Several types of models have been proposed. The survey by Zhan, F. et al. [1] proposed a classification of these models. This classification is composed of four types of models a) GAN-based methods (e.g., StackGAN, AttnGAN), b) autoregressive methods (e.g., DALL-E), c) diffusion-based methods (e.g., DALL-E-2), and d) NeRF-based methods. Since NeRF-based methods are used for a totally different aim which consists in outputting a 3D scene representation, they are left for a future blog post.

One major problem in dealing with multimodality is the modality gap between e.g. text and image inputs. To deal with this gap, CLIP [4] is an effective method that maps different modalities into the same latent space. CLIP has been widely used in state-of-the-art text-to-image synthesis methods that are mainly autoregressive or diffusion-based such as DALL-E-2.

Although multimodal image synthesis methods can be a great artistic tool to create artificial creative images and to help humans, it also has some risks and could be dangerous in some cases. These powerful methods could be used for bullying and harassment, or to misinform people. Hence, they should be used wisely and ethically.

References:

Zhan, F., Yu, Y., Wu, R., Zhang, J., & Lu, S. (2021). Multimodal image synthesis and editing: A survey. arXiv preprint arXiv:2112.13592.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. N. (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 5907-5915)
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1316-1324).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visualmodels from natural language supervision. In International Conference on Machine Learning (pp. 8748-8763). PMLR.
Lee, S. H., Roh, W., Byeon, W., Yoon, S. H., Kim, C., Kim, J., & Kim, S. (2022). Sound-guided semantic image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3377-3386).
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., ... & Sutskever, I. (2021, July). Zero-shot text-to-image generation. In International Conference on Machine Learning(pp. 8821-8831). PMLR.
Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. 2022. URL: https://doi. org/10.48550/arXiv, 2204.
Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

Seitenhierarchie