4: ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models

Blog post written by: Alexandra Samoylova

Based on: L. Zhang, A. Rao and M. Agrawala, "Adding Conditional Control to Text-to-Image Diffusion Models," 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023, pp. 3813-3824, doi: 10.1109/ICCV51070.2023.00355.

Introduction

In this blogpost I will explain the concept and theory behind ControlNet¹, which has transformed text-to-image generative models by making them not only impressive in terms of performance metrics but also functional for real-world tasks.

Text-to-Image models: any limitations in practice?

Text-to-image diffusion models consistently astonish new users by producing aesthetic and realistic images that can also emulate specific artistic styles—such as cartoons or oil paintings—upon request. Users typically specify the style details and desired image composition through a text control commonly known as a text prompt. However, the initial excitement often turns to frustration due to the practical challenges in using text-to-image models. It's quite difficult to prompt the model to produce outputs that closely resemble one's mental imagery, especially when conveying specific spatial details.

To illustratethe difficulty of accurately conveying spatial details to a model—details that may seem straightforward to a human—I share my experience with the Stable Diffusion model^2,3 in Fig.1. I attempted to generate an image of a patient lying inside an Magnetic Resoanance Imaging (MRI) scanner. Despite explicitly stating in the text prompt that the patient should be inside the scanner as the scan was in progress, the model repeatedly failed to capture this detail. The results shown in Fig.1, are realistic and visually appealing, yet the model consistently placed the patient outside the scanner. How can we effectively communicate human-intuitive concepts like spatial arrangement to a model, especially when prompt engineering appears ineffective?

Fig.1 Stable Diffusion model output. Text prompt: A patient lying inside MRI scanner, the head and trunk are inside scanner opening, the patient is being scanned.

How can conditional controls be provided?

The need to repeatedly modify text prompts to achieve desired results is a major limitation that hinders the use of text-to-image diffusion models in everyday life and work. Usage could be greatly simplified if users could also provide an image reference, illustrating the spatial arrangement of the desired output, alongside the text prompt. This concept is precisely what ControlNet actualizes: the model accepts inputs such as sketches or feature maps (including edge maps, depth maps, and segmentation maps, among others), and converts them into conditional controls for pre-trained text-to-image diffusion models, with Stable Diffusion as one example.

To understand the design choices behind ControlNet and discuss the method in detail, let's first examine the available approaches to provide conditional controls to text-to-image diffusion models, including their benefits and drawbacks.

Training from scratch

One option to provide conditional control to a text-to-image diffusion model could be retraining the model from scratch using both text control and image feature map control, like a depth map, during training. Apart from the vast computational resources required, another challenge is the significant disparity between the data used to train state-of-the-art text-to-image diffusion models like Stable Diffusion—5 billion text-image pairs⁴—and the much smaller datasets available for specific controls, usually around 200,000 pairs but ranging from 80,000 to 3 million pairs¹.

Finetuning, Continued Learning

Alternatively, depending on the data available for a desired control, we could fine-tune a pre-trained model (if data is limited), or opt for continued learning strategy. However, both approaches have been shown to potentially disrupt the pre-trained backbone, causing overfitting and catastrophic forgetting⁵.

Going deeper?

The authors of ControlNet chose a customizable approach that leverages the capabilities of a pre-trained text-to-image diffusion model without starting from scratch. Importantly, this method preserves the integrity of the pre-trained backbone and effectively addresses the training data limitations mentioned earlier. Curious about how researchers achieved this? Let’s dive into the methods behind ControlNet!

Placing Stable Diffusion under Control

Method details

Let’s first discuss the general outline of ControlNet architecture. To train a ControlNet for a specific conditional control, for instance a depth map, one needs a pre-trained text-to-image diffusion model, and lock its parameters. A trainable copy of this model is then created, which receives the original diffusion model input and combines it with a feature map control. The outputs of both the frozen and trainable parts are then added together to produce the final output. However, noisy parameter updates at the start of training could potentially compromise the capabilities of the pre-trained model backbone! Protective measures should be implemented.

To safeguard the functionality of the pre-trained backbone, zero-convolutional layers are introduced at two points: before adding the conditional feature map to the diffusion model input, and before combining the outputs of the two models. A zero convolution layer is a 1x1 convolution with weights and biases initially set to zero. These parameters gradually increase throughout the training process. This ensures that the trainable copy's contribution to the output starts at zero in the initial training step and progressively grows as training continues. During the training phase, only parameters in the trainable copy and zero-convolutional layers are updated, keeping the pre-trained backbone intact. The outline of ControlNet is illustrated in Fig.2.

Fig.2 Conceptual representation of ControlNet.

ControlNet can be applied to various diffsuion models¹, but in the original paper¹, the capabilities of ControlNet are illustrated through application to Stable Diffusion Model. Stable Diffusion operates in latent image space instead of pixel space, which reduces computational resources required for training and was also shown to stabilize the training process³. Hence, the conditional control feature map is first passed through a 4-layer Convolutional Neural Network which transforms it into a 64x64 latent space embedding, as shown in Fig. 3.

Fig.3 Feature map extraction from an image used as a conditional control for the Stable Diffusion Model.

The architecture of the Stable Diffusion model includes a U-Net⁵, three main parts of which are a stack of Encoding Blocks, a Middle Block, and a stack of Decoding Bocks, with skip connections connecting each Encoding Block and corresponding Deoder Block. The ControlNet implements a trainable copy, that includes the Encoder and the Middle Block of the U-Net only. After passing the latent embedding of control the feature map, c_lf , through a zero-convolution layer, it is added to the latent-space input to pre-trained backbone, z_t. The resulting latent space vector passes through the trainable Encoder Blocks and the Middle Block, the outputs of which are added to the Decoder Blocks of the pre-trained backbone through zero convolution layers, providing the desired conditional control to the Stable Diffusion model (Fig. 4).

The prediction of the cumulative model is the noise ϵ(t, z_t, c_lf, c_t), added to an image at each particular time step t, which is also dependent on the noisy input z_t, the text prompt c_t and the conditional control c_lf. The objective function used to fit the ControlNet model is the same as the one used to train the Stable Diffusion model: the aim is to minimize the L-2 norm between noise added to the image and the predicted noise. The trainable copy is optimized together with 4-layer convolutional encoder, while the pre-trained backbone has fixed parameters, which are not affected by the backpropagation during optimization. During the training stage, in 50% of cases, no text prompt was provided to the model to teach it to infer semantics from feature maps provided as conditional controls.

Fig.4 ControlNet applied to Stable Diffusion Model.

Experiments and Results

One important experiment described in the ControlNet paper¹ is an ablation experiment which illustrates the importance of the zero-convolutional layers. The authors compared the original ControlNet to a ControlNet where zero-convolutional layers were substituted with plain convolutional layers. Another model used in the ablation study was ControlNet-Lite, which had one convolutional layer as the only trainable unit of the whole network. The authors expected the ControlNet-lite to struggle with following provided conditional controls, so ControlNet-Lite can be seen as a baseline model. The question that the Ablation experiment answers is whether the removal of zero-convolution layers from ControlNet degrades the model performance and makes it similar to the baseline, or not. Fig.5 illustrates the performance of 3 models in "No Prompt", "Insufficient Prompt", "Conflicting Prompt" and "Perfect Prompt" conditions. The original ControlNet performs well in all 4 settings, generating aesthetic and realistic images which strictly follow the provided conditional control (an edge map). In contrast, both ControlNet Lite and ControlNet without zero-convolution layers failed in "No Prompt" and "Insufficient Prompt" conditions. High performance of these two models in "Perfect Prompt" and "Conflicting Prompt" conditions is a result of pre-trained Stable Diffusion Backbone being a part of the model. The similarity of two models and their failure in conditions with limited or no information provided by the text prompt shows that zero-convolutional layers are detrimental for the performance of ControlNet.

Fig.5 Ablation study results

ControlNet in Action

The range of available features

Now, let me show you how ControlNet can be used in real life! Let's tackle the problem I shared with you before: I'd like to convince Stable Diffusion to generate an image of a patient lying inside an MRI scanner for me. I have a sketch or a reference image of the picture I have in mind. Finally, I can make a good use of them and provide a control to the Stable Diffusion model. If I am diligent enough to draw a sketch, I can use my drawing right away and provide it to a sketch-based ControlNet, available online⁶. If I am not in the mood to make a sketch, I can use some reference image, and use one of feature extractors available together with ControlNet online. The feature range includes Canny edge map, Normal map, Depth map, Linea art map, Open pose map and Segmentation map (Fig.6). Some features are better for fine-grained images, some are better for landscapes and sceneries. Line art feature turned out to be the best for my reference image.

Fig.6 Feature extraction step.

Changing contents of a photo

After opting for one of available feature extractors, we can run ControlNet with or without a text prompt. Fig. 7 shows my results with generating images from a Lineart feature map. In "No Prompt" setting, I got an aesthetic image of a girl inside an MRI scanner as an output. This perfectly fits the goal I had: drawing a patient accurately located inside the MRI scanner opening! Next, I controlled contents of the output image by providing text prompts mentioning 'a baby', 'a woman', 'an old lady'. The image I received as ControlNet output followed both feature map control and text prompt control very well, even though the images are not fully realistic.

Image feature "Lineart"

Fig.7 Image generation from Lineart feature map.

Changing the style of a power point slide

Next, I tried to change the style of a power point slide using ControlNet. With Line art feature map, in "No Prompt" condition, the model struggled to infer the semantics of the input map and I received a slide full of dishes ith meat and celery as an output (Fig.8). When I provided a text prompt, explicitly stating that the input featured a power point slide with brain scans and MRI scanners images in it, the output improved a lot (Fig.9). I was able to change the style of the slide to a neon-llights style and an oil-painting style.

Here, I'd like to stress how important it is to choose the feature type appropriate for your task. In Fig.10, I tried to change style of the same power point slide using segmentation map and depth map feature extractors. Due to the nature of the chosen features, all fine-grained details like text or brain sulci were lost, and the output generally followed the slide outline, but hallucinated what was located in each slide section. Hence, it is important to choose a feature type which would effectively capture the aspects one would like to transfer from the control image to the output image.

type: Lineart

Fig.8 Changing style of power point slide without a prompt

Fig.9 Changing style of power point slide with a prompt: appropriate features used

Fig.10 Changing style of power point slide with a prompt: inappropriate features used

Potential application to bioimaging

One potential application of ControlNet to the field Bioimaging is resolution enhancement of functional Magnetic Resonance Imaging (MRI) scans using Structural MRI scan as a control. Fig. 11 shows the drastic spatial resolution difference between the abovementioned imaging modalities. Importantly, often fMRI datasets include at least one structural MRI scan of each subject. It would be great to leverage structural MRI data to enhance the spatial resolution of noisy fMRI scans.

Fig.11 Spatial resolution difference between structural and functional MRI scans.

As illustrated in Fig.12, we could maybe reach this goal by applying ControlNet to a diffusion model pre-trained for super-resolution task. We could provide low-resolution fMRI scan as a condition to the pretrained model and train a separate ControlNet using structural MRI scans as a conditioning controls. This approach could not only increase resolution of noisy fMRI scans, but also inform them with anatomical data from structural MRI scans.

Fig.12 ControlNet can be used to enhance spatial resolution of fMRI scans using structural MRI scan as a control.

Conclusion

ControlNet is a model which effectively learns conditional controls for text-to-image diffusion models, reusing large-scale pre-trained layers. A unit detrimental for ControlNet performance is zero-convolution layers, which protect the pretrained backbone. ControlNet controls Stable Diffusion with one or several conditions, with or without prompts, and is applicable to a wide range of diffusion models.

References:

L. Zhang, A. Rao and M. Agrawala, "Adding Conditional Control to Text-to-Image Diffusion Models," 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023, pp. 3813-3824, doi: 10.1109/ICCV51070.2023.00355.
https://stablediffusionweb.com/
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. "High-Resolution Image Synthesis with Latent Diffusion Models", 2022. https://doi.org/10.48550/arXiv.2112.10752
Christoph Schuhmann, Romain Beaumont, and Jenia Jitsev et al. "LAION-5b: An open large-scale dataset for training next generation image-text models", 2022. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Finetuning text-to-image diffusion models for subject-driven generation.
arXiv preprint arXiv:2208.12242, 2022
https://huggingface.co/spaces/hysts/ControlNet-v1-1

Chat GPT's assistance

Prompts used during preparation of this blogpost:

Please explain the theory behind Latent Diffusion Models.
Please explain time stem embedding method details in Stable Diffusion model.
Please check the grammar and make this text concise.

Seitenhierarchie