Vision Transformers

This blog post mainly introduces Vision Transformers(ViT) and its applications like transformers in classification, segmentation, and medical image segmentation. To better understand the topic, attention, and self-attention, which are building blocks of the transformer network, will be mentioned. Finally, the strengths and weaknesses of the subject will be analyzed.

Table of Contents:

Attention for CNN

The attention idea started with the CNN architectures. It focuses on more informative regions in the image to be able to solve the main task.

Designing a network that pays more attention to the relevant areas of the input improves the overall performance, as in Fig.1. Background information is not necessary to be able to detect this stop sign. If the model focuses on the region to be detected, this improves the performance and robustness of the network.

Attention is used for the image captioning problem[1], and each word and corresponding region is highlighted in Fig.2. The parts the model looks at to depict the scene are similar to the points a person looks at. In addition to improving the performance, this also facilitates the analysis of the model's predictions.

Self-Attention

There are different ways of obtaining attention, such as adding a branch that predicts a mask or using Gradcam. Recently, the most popular one is self-attention. It helps to find long-range dependency modeling over feature maps. Self-attention[2] does this much better than attention models. The overall self-attention architecture is depicted in Fig.3. There are three branches, and they project the same feature maps into different spaces. It can be performed by the fully connected(FC) layer or convolution layer(1x1). The first one is called query, the second is called key, and the last one is called value.

At first, Key and Query are multiplied(matrix multiplication). The multiplication result gives the correlation between them. The softmax function is applied to normalize it.The normalized attention shows the similarity between different image regions. Next, a new feature map is constructed by multiplying the attention and value features.

Non-Local Neural Networks

Non-local neural networks are another example of self-attention. It is used for video action classification[3]. The architecture of the network is in Fig.4. There are three branches, as in self-attention, and the same operations are applied here. Attention has a much better meaning for this model because there is a temporal dimension in videos. Thus attention should give to high values comparing similar parts among the frames.

There is a person and a ball in Fig.4, and their position changes in each video frame. However, attention should be able to highlight these objects even if they change their location. The attention value obtained according to the change of objects will contribute positively to the performance of the action classification model.

Sequence to Sequence Modelling

Self-attention is used for the sequence to sequence modeling. It became more famous with language to language translation problems. Before the transformer, RNN and LSTM were used. However, they were not able to model long-range dependencies because RNN has a vanishing gradient problem, and LSTM tends to overfit.

If the sequence is too long, as in Fig.5, the first and last parts of the sequence have less effect on each other. However, in translation, it might be important. The long-distance words may be more significant than neighbors.

An example can be seen in Fig.6. While "eating" and "green" are closer to each other, the attention value between the words "eating" and "apple" is higher in terms of semantic integrity. Self-attention handles long dependencies better than RNN and LSTM.

Attention is All You Need - Transformer

Attention is All You Need[4] is the first paper that introduces the transformer network. It uses a self-attention mechanism to model dependencies between distant positions without recurrence and find global relations.

The formulation in Fig.7(left) is used for self-attention in the original transformer model[4]. Self-attention structure is used in the first transformer architecture, and then the architecture is modified in the final model. There is an optional mask operation depending on the task. If there is a mask, it can be introduced here to filter out unnecessary attention. Moreover, there is a scale operation and a softmax block to normalize it again. After that, the output attention map is constructed. Finally, the matrix multiplication is applied to the attention map and values. Multi-head attention is used to increase the representation power. It is just a copy of the same module, as in Fig.7(left). Each attention head has multiple attention branches in the same layer. The depth of this layer is called h, and it increases the representation power.

Vision Transformer

The main topic of this blog, Vision Transformer, focuses on the application of vision problems. It is introduced in the An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[5]. The input is reorganized to utilize NLP Transformer. It is split into patches and treated like words. It performs better while training with the larger datasets, according to the original ViT paper. Although there are several attempts to mimic transformers for vision tasks, CNN still performs better (compared to vanilla ViT).

Fig.8 shows how the ViT works. The input splits into small patches, and each patch is projected into feature space using this linear projection layer. There is a position embedding block in the architecture. It is important because there is no convolution in this network. It is necessary to add position embeddings to be able to store location information. It is also called a patchifying operation. A value is assigned to keep the neighbor's information depending on the position of the patch.Position embeddings and patches are pass-through transformer encoders, and the final output features project to the classes.

The transformer encoder block has multiple self-attention heads in addition to normalization layers. There is a normalization layer, multi-head attention layer, residual connections again normalization layer, and MLP inside the transformer encoder. This is one block, and it is repeated L times to increase the depth of the transformer architecture.

The transformer encoder part is the same as the language model. The additional components are linear projection and position embeddings. The last layer, MLP, is different since it tries to solve the classification problem.

Hybrid Architecture is one of the different types of vision transformer architecture. The model architecture is depicted in Fig.9. The orange layers are convolution layers, and the intermediates are transformer blocks. CNN features can be used instead of using raw pixels with this network. Extracted CNN features will carry more contextual information about the input image.We can use the hybrid architecture, so we don't have to sorely on transformer blocks because convolution has some strengths.

There are different strategies to train ViT. It is initially trained on a large dataset. One strategy is to fine-tune for the downstream task after pre-training on a large dataset. In this case, the prediction head needs to replace for the new task. Another one is fine-tuning at a higher resolution than pre-training. Some of the experiments show that it is often beneficial. While feeding the model with high-resolution images, the patch remains the same. Also, It increases the number of patches. In this case, the position embedding information loses its meaning. Therefore, we perform interpolation of the pre-trained position embedding.

The table showing the performance of ViT, ResNets, and hybrids models against pre-training compute for different architectures is shown in Fig.10. The table on the left shows the average of five different datasets. It is observed that ViT mostly outperform ResNets when comparing the results on both tables. VİT use 2-4x less computation power to get on-par results with the CNN. Also, it can be seen that the hybrid network improves the performance of pure transformers.

Some of the representative examples of attention from the output token to the input space can be seen in Fig.11. These examples show that the model attends to image regions that are semantically relevant for classification.

Applications

Transformers in Classification

Reversible Vision Transformer[6] is an example usage of a transformer in image classification. The network architecture can be seen in Fig.12. It can also be used for video classification.

The formulation of reversible transformation is in Fig.13. The reversible transformation is adapted to the ViT model as in Fig.12. This transformation is applied to the input then two partitioned tensors are obtained. These tensors keep their reversibility in Fig.12(a).

After that, the input tensors mix their information with the functions F and G. Both reversible ViT and reversible MViT is a two-residual-stream architecture. The MViT model creates a feature hierarchy, and it is adapted to reversible design. There is a connection between residual streams in Fig.12(b). Also, there are channel upsampling and resolution downsampling operations in that block. The majority of the computational graph is created in Fig.12(c). Furthermore, the information is propagated while maintaining the input feature size.

The proposed methods increased the efficiency compared with other methods, as in Fig.14. Model performance improved when the network complexity is increased. The number of operations and parameters decreased according to the results. Thanks to that, it used less memory. Also, it requires less computational power.

Transformers in Image Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation[7] is an example usage of a transformer in image segmentation. The network architecture can be seen in Fig.15.

They try to obtain a generalized transformer that can be used for three different segmentation tasks, which are panoptic, semantic, and instance segmentation. The backbone architecture extract features. The extracted features pass through the pixel decoder and then the transformer decoder. It utilizes high-resolution features from a pixel decoder by feeding one scale of the multi-scale feature to one Transformer decoder layer at a time. The output of the transformer-decoder and one of the output from the pixel-decoder are multiplied to obtain a mask.

The transformer decoder is different than the standard one. There is a masked attention module in the decoder. Experiments show that it improves the results and convergence. In addition to that, the order of self and cross-attention were switched. It helps to make query features learnable. Also, dropout was removed to make computation more effective.

The result of the panoptic segmentation is in Fig.16. The introduced method compared with other architectures using different backbones. It achieves a better result than a previous paper with less training time and the number of parameters. It even achieved better results with a different backbone. In the last one, there is a much larger step up(67.4), but the number of parameters is increased.

The result of the instance segmentation is in Fig.17. The proposed model compared with other networks using different backbones. It even achieved better with a different backbone. The overall result is better than the compared methods. It mostly requires fewer number parameters. On the other hand, we can observe that the success of the model in segmenting small objects is not good enough when the AP scores are analyzed.

The result of the semantic segmentation is in Fig.18. The proposed model compared with other networks using different backbones. It performed better than other methods.

It achieved on-par results with SoTA CNN architectures for segmentation tasks. It uses three times less memory and requires less computational power. Also, it converges faster compared to the other methods. On the other hand, there has a problem with segmenting small objects. It still needs to be trained for a specific task (e.g., panoptic segmentation) to get better results, although they want to create a generalizable model. Also, integrating multi-scale objects is difficult.

Transformers in Medical Image Segmentation

Another application area of ViT is medical image segmentation. The network architecture can be seen in Fig.19. In Transformer based Generative Adversarial Network for Liver Segmentation[8], the transformer is used for liver segmentation.

The architecture consists of an encoder, decoder, and transformer block type of architecture. It takes an input image and extracts convolutional features, and the transformer blocks do the multi-head self-attention. Finally, the decoder projects these features to the prediction mask. In this example, it is constructed as an adversarial training framework. There is a discriminator network that tries to distinguish between real and fake images.

Introduced methods got almost the same dice coefficient results with the Transformer as shown in Fig.20. The model with the lowest dice score achieved the highest precision. Also, the model with the highest dice score achieved the highest recall. Although the number of samples used is not large, the use of transformers with GAN provided a slight improvement.

This work shows that transformers can be used as a generative network. They used a convolutional transformer[9]. A fully connected layer is used for the Key and Query value projections in the vanilla transformer block. On the other hand, in the convolutional transformer, they used convolution. It is the difference, and reduces the training time and memory footprint.

Conclusion and Personal Review

Strengths

The ViT solve the inductive bias problem. Models make fewer image-specific assumptions due to the design of the ViT . Convolution is a local operation and assumes that local patterns are much more important, whereas ViT focus on global structure.
ViT are more interpretable than CNN, as shown in Fig.11. ViT has an internal attention mechanism, so we can check which part of the image is used to make predictions. Thus, this makes it more interpretable.
ViT are more scalable. ViT can get similar results to most popular CNN models, such as ResNet and Mask-RCNN, using less computation power.

Weaknesses

Some of the ViT still require long training time for better results.
Training on a large dataset is costly in terms of time and energy consumption. The convergence speed is also slow on a large dataset.
ViT require large memory to get a better result, although recent models try to address this issue.

Comparison of the Papers

I mentioned how ViT is used in different applications[6,7,8].All the transformers aforementioned require a long training time. The first and second papers mostly focus on making a more generalizable vision transformer, whereas the third paper focuses on just image segmentation. While there was a decrease in the number of parameters when the results were close to the CNN models, the number of parameters increased when better results were obtained. While getting results on par with the methods in the third article, with fewer data samples, most image transformers require large datasets for results close to or slightly better than CNN models. Unlike other papers, the results in the third paper are not very detailed. There is an improvement in memory consumption in the second and third papers compared to vanilla ViT. Although it was aimed to obtain a generalizable model in the compared studies, it could not be fully achieved.

References

[1] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 2015

[2] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019

[3] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017

[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

[6] Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, and Jitendra Malik. Reversible vision transformers. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

[7] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

[8] Ugur Demir, Zheyuan Zhang, Bin Wang, Matthew Antalek, Elif Keles, Debesh Jha, Amir Borhani, Daniela Ladner, and Ulas Bagci. Transformer based generative adversarial network for liver segmentation. ICPAI, 2021

[9] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. CoRR, abs/2103.15808, 2021

Seitenhierarchie