Abstract
Computer vision (CV) tasks were dominated by Convolutional Neural Network (CNN)-based architectures for many years. One of these tasks is segmentation, solving the classification of each image pixel to for example numerable instance objects like persons or background classes like sky. Before Vaswani et al. enriched the field by an encoder-decoder architecture called Transformers in 2017, the research community observed a deficit of global attention in CNN-architectures due to the locality of convolutions. In contrast the Transformer architecture easily allows global context due to self- and cross-attention and achieved state-of-the-art (SOTA) results in Natural Language Processing (NLP) tasks despite limited amount of development. With Vision Transformer Dosovitskiy et al. provided a solution in 2020, how Transformer can be used also for CV tasks. In this work, we will provide an introduction about the key concepts, allowing Transformer networks to be used for vision and in particular segmentation tasks. We hereby retrospect the ground zero Vision Transformer and continue with major developments and concepts, for example Segmenter (Strudel et al.), Swin Transformer (Liu et al.), TransUNet (Chen et al.), Mask2Former (Cheng et al.) and the very recent SOTA segmentation methods Mask DINO (Li et al.) and Vision Transformer Adapter (ViT-Adapter) (Chen et al.). The latter lead notable segmentation challenges, such as ADE20K (Mask DINO with 60.8 mIoU), Cityscapes (ViT-Adapter with 85.2 mIoU), PASCAL Context (ViT-Adapter with 68.2 mIoU) and COCO (Mask DINO with 54.5 maskAP on minival and 59.4 PQ on test-dev).
Introduction
In the field of computer vision, CNNs [13] have gained state-of-the-art results for the past years and they are still very competitive. Hence intensive research and manpower has been spent to develop and fine-tune CNN-based architectures, such as Residual Network (ResNet) [10], U-Net [21], V-Net [17], and to find out their natural performance limits. The success, but also challenge, of convolutions is their local operation. Convolutions can learn image features of different sizes depending on their kernel size and naturally present them in the dimension of the image, i.e. 2D image convolutions based on a 2D image. Advantages are an easy to scale amount of learnable features and a significant reduction of network parameters compared to Fully Connected Networks (FCNs). However the locality of convolutions constraints the network to gain a global context, so to attend to various image features of different scale and location. A cascade of convolution layers, so basically Deep Convolutional Neural Networks, are a solution to increase the receptive field and to recognize image features at different scales. Nevertheless CNNs with self-attention layers [24, 11, 26, 25] and especially recent attempts of using Transformers [23] for vision tasks [9] achieved promising results despite short development period and therefore justified the need for global attention. Transformer [23] is an attention-focused encoder-decoder architecture, modelling sequence-to-sequence translation tasks and therefore started its great success in NLP [1, 8, 19]. After its release the research community has found solutions to model also other applications, such as CV, as sequences to benefit from the powerful attention mechanism. While at the beginning mostly only the encoder was used to translate the input to a higher dimensional embedding space, there exist nowadays also creative concepts, which entail a full encoder-decoder architecture.
Related Work
Transformer
The Transformer architecture consists of an encoder and a decoder to translate input sequences into output sequences. Every sequence element is usually projected to an embedding of D-dimensional space and concatenated with a positional encoding. The main ingredient for the gain of global attention is the multi-head attention block (Figure 3). It uses matrices for keys and queries to compute a self-attention of each pair of sequence items within the encoder and decoder. An additional multi-head attention layer in the decoder enables cross-attention between the encoder and decoder sequence embeddings. These computations result in softmax-normalized attention maps for the respective sequences (Figure 2). A FCN finalizes one attention layer of the encoder or decoder by learning and selecting from the attended features. The architecture can be easily scaled by cascading multiple such attention layers, as indicated in Figure 1 with the N\!\times. Residual connections [10] and normalization [12] layers improve the gradient flow and therefore help to stabilize network training. Due to the comparison of each pair of items within the sequences, the computational cost of a Transformer is quadratic with respect to the length of sequences.
Figure 1: Transformer overall architecture [23]. | Figure 2: Scaled dot-product attention [23]. | Figure 3: Multi-head attention [23]. |
Architectures
Encoder
Since the publication of Vision Transformer [9], the decomposition of an input image into a sequence of encoded patches is a standard preprocessing step for a Transformer-based encoder. In pure Transformer architectures the original input image serves for this decomposition procedure, while in hybrid encoder architectures the original input image is often first preprocessed by a convolution-based network, for example by a pre-trained ResNet. The decomposition is then done based on the resulting feature map instead of the input image directly. The idea of hybrid approaches is to benefit from the advantages of local feature generation by convolution kernels and the global attention mechanism of Transformers. We will see, that Swin Transformer [16] tries to mimic the local processing similar to convolutions in its adapted multi-head attention modules.
Patch encoding
In general an input image x \in \mathbb{R}^{H\times W\times C} is split into N patches. H and W describe the image height and width in pixels, while C stands for the number of channels, e.g., 3 for RGB. With a patch size of P\times P, we can obtain a sequence of N=\frac{HW}{P^2} patches: [x_1,...,x_N] \in \mathbb{R}^{N\times P^2\times C}. Hence each patch, or in frame of Transformers also called token, has a feature size of P^2 \times C. It is first flattened to a 1D vector and then linearly projected to a D-dimensional embedding.
Positional encoding
One of the controversial topics in the early use of Transformers in vision is if there should be an inductive bias about the nature of 2D images, so whether a positional encoding should be included and how this should look like [9]. Dosovitskiy et al. [9] argued, that an inductive bias is not necessary under the premise of enough training data. A positional encoding is therefore learnt in Vision Transformers, as well as in Segmenter, where D-dimensional positional embeddings are added to the previously calculated sequence of patch embeddings: [x_1+pos_1,...,x_N+pos_N] \in \mathbb{R}^{N\times D} As shown in Figure 4, the encoded patch sequence is finally processed by several multi-head self-attention layers using the originally proposed Transformer architecture. The self-attention mechanism between every pair of patches gains a global information context of the whole input image, respectively feature map.
Figure 4: Encoder of Segmenter [22] architecture, demonstrating the processing of an image patch sequence. | Figure 5: Swin Transformer [16] encoder, illustrating the hierarchical representation of the input image with patch merging. |
Hierarchical representation
The Swin Transformer goes one idea further. After processing the first set of output embeddings with the Transformer architecture, the number of patches is reduced by merging 2\times 2 = 4 patches with each other (Figure 5). This corresponds to a reduction of the image resolution by a factor of two, equal to the hierarchical representation of convolutional networks, for example with strided convolutions [16]. Information of small image parts like edges can therefore be condensed to larger image regions in order to recognize subjects and objects.
Shifted windows
The self-attention mechanism of Transformers requires to compute the relationship between every token and all other tokens. Hence, if a token represents an image pixel or a fixed size patch, for example a 4\times 4 pixels image patch, the computational complexity grows with the image size quadratically. Considering the dimensionality of 2D/3D images or even video sequences, the use of Transformers is more challenging than for NLP applications with comparatively short text sequences. Shifted windows (Swin) are a concept introduced in Swin Transformers [16], that help to tackle this problem by putting a fixed amount of movable, non-overlapping windows over the image. So instead of having patches of fixed pixel size, Swin Transformer uses M\times M windows for each image, regardless of its size, which results in linear computational complexity with regards to the image size [16]. A two-step shifting operation, i.e. the alternation of two window partitioning configurations as shown in Figure 6, allows cross-connectivity between the windows. The proposed shifting is designed such that the windows keep their size in both configurations. This avoids the need of padding and allows effective batch operations [16]. Cross-connections of tokens, that are not neighbored, for example the cyclic connections at the edges e.g., the top with the bottom, are masked out during the attention computation. These window computations are enfolded in the Swin Transformer Block, which is apart from these enhancements entirely based on the original Transformer blocks. Together with the Patch Merging blocks, these procedures form the Swin Transformer architecture (Figure 7) for image feature encoding.
Figure 6: Cyclic shift of window partition along with masking to force the natural neighboring of 2D images [16]. |
Figure 7: Swin Transformer overall architecture [16]. |
Decoder
The second part of a Transformer-based architecture for image segmentation is the decoding of the generated feature embeddings of the encoder. Again, there are several different approaches available to project the embedding space to segmentation masks.
Naive upsampling
A naive solution is to linearly map the patch encodings z \in \mathbb{R}^{N\times D} to class logits. Assuming there are K segmentation classes, a point-wise linear layer would reduce the D-dimensional feature space to a vector \hat{z} \in \mathbb{R}^{N\times K} with K dimensions. Afterwards the patch sequence is reshaped back to a feature map and bilinearly upsampled to the original image size. A softmax layer finally converts the logits to a segmentation map [22, 27].
Convolution-based upsampling
Instead of a bilinear upsampling, SEgmentation TRansformer (SETR) [27] proposes two additional decoder designs using multiple convolution and upsampling layers. Figure 8 shows the SETR-PUP decoder, using alternating convolutions and upsampling by a factor of two.
Figure 8: Architecture of CNN-based decoder SETR-PUP [27]. |
U-Net architecture
The CNN-based U-Net [21] architecture is a common choice especially in medical image segmentation. Chen et al. suggest with TransUNet [3] a U-Net, feeding a transformer with downsampled image features in the contraction path (Figure 9). Image features gained at every depth during the U-Net encoder are converted to a 2D patch sequence and processed by a Transformer encoder. The resulting output embeddings contain a global attention context on the high-level features and help to reconstruct the semantically relevant information during the expanding, or decoder, phase. The residual connections of the U-Net allow to construct segmentation masks at high-resolution.
Figure 9: TransUNet overall architecture [3]. |
Pure Transformer
More recently there is also an effort made to use Transformers not only for encoding but also decoding. As Transformers are a sequence-to-sequence translation method, an initial challenge is how a discrete set of outputs e.g., segmentation masks can be modelled. DEtection TRansformer (DETR) [2] is a model introducing end-to-end object detection with Transformers (Figure 10). DETR uses bipartite matching to directly predict a set of outputs. Compared to other successful object detection methods like Faster-RCNN [20], DETR does not need Non-Maximum Suppression (NMS) to filter between duplicate detection proposals [2]. The discrete set of output predictions is called object queries and is predicted in parallel, instead of sequentially as proposed in the original Transformer implementation [2]. The predicted object queries are finally transformed via a FCN to an object class and to relative coordinates, modelling the surrounding object detection rectangle. DETR always predicts a fixed-size set of object queries, so in a common situation with less object detections, DETR assigns a no-object class to the exceeding object queries.
Figure 10: DETR overall architecture [2]. |
Based on the concepts of DETR, Strudel et al. presented with Mask Transformer [22] a Transformer-based decoder, which they use in combination with their Transformer-based encoder (Figure 11). Mask Transformer decodes the generated patch embeddings from the encoder and predicts a discrete, fixed-size set of class embeddings from randomly initialized object queries. We denote the decoded patch embeddings with z_M \in \mathbb{R}^{N\times D} and the predicted class embeddings with c \in \mathbb{R}^{N\times K} for K object classes. The product z_Mc^T \in \mathbb{R}^{N\times K} yields a matrix describing K segmentation masks for each image patch. After reshaping the patch sequence back into a 2D mask, a bilinear upsampling layer projects the segmentation masks to the original input image size H\times W. As last step softmax and layer norm generate the final segmentation maps s \in \mathbb{R}^{H\times W\times K} with a pixel-wise class score, where all masks are softly exclusive to each other, i.e. \sum^{K}_{k=1} s_{i,j,k} = 1\;\;\forall\;(i,j) \in H\times W. The class embeddings refer to semantic segmentation, but can easily be exchanged by object instance embeddings for panoptic segmentation tasks [22].
Figure 11: Segmenter [2] overall architecture consisting of pure Transformer encoder and decoder based on the DETR object queries concept. |
Cheng et al. proposed with Mask2Former [5] a non-specialized segmentation framework, capable of achieving good results on any segmentation task, i.e., semantic, instance and panoptic segmentation. Mask2Former constrains the cross-attention of the decoder to the foreground region of the predicted mask (Figure 13) and therefore allows to extract localized image features. In addition feature maps from the pixel decoder are passed into the transformer decoder layers in a round robin fashion. The dot-product of the resulting N mask embeddings of the transformer decoder with the per pixel embeddings of highest upsampled resolution of the pixel decoder generates the segmentation masks. Mask2Former inherits this procedure from its former version MaskFormer [6], illustrated in Figure 12.
Figure 12: MaskFormer overall architecture [6]. |
Figure 13: Mask2Former [5] architecture, showing the round robin inclusion of feature maps from the pixel decoder pyramid to generate high-resolution segmentation masks (left). A masked attention adaptation in the multi-head attention layer restricts the attention context to local masks (right). |
The basic concept of the Mask2Former segmentation approach is also adopted in the very recent Mask DINO [14]. Although the object queries in Mask DINO are designed to predict box offsets and class membership, they are used in a dot-product with the per pixel embedding map to yield state-of-the-art segmentation results, i.e., to this date first rank in ADE20K [28, 29] semantic segmentation (60.8 mIoU), COCO minival [15] instance segmentation (54.5 maskAP) and COCO test-dev panoptic segmentation (59.4 PQ) benchmarks (Figure 15). The pixel embedding map is a fusion of the highest resolution feature map of the backbone and an upsampled 2D map of the final encoder embeddings. This allows Mask DINO to select from local, high-resolution features from the backbone and global context information from the Transformer encoder (Figure 14).
Figure 14: Mask DINO overall architecture [14]. |
Figure 15: Mask DINO state-of-the-art results in semantic, instance and panoptic segmentation [14]. |
Vision Transformer Adapter (ViT-Adapter) for Dense Predictions [4] is yet another work, that adopts Mask2Former as segmentation head and yields state-of-the-art semantic segmentation results in ADE20K [28, 29] (60.5 mIoU), Cityscapes test [7] (85.2 mIoU), COCO-Stuff test [15] (54.2 mIoU) and PASCAL context [18] (68.2 mIoU) benchmarks (Figures 15, 16, 17).
Figure 15: Semantic segmentation results of ViT-Adapter in ADE20K benchmark [4]. |
Figure 16: Semantic segmentation results of ViT-Adapter in Cityscapes test benchmark [4]. |
Figure 17: Semantic segmentation results of ViT-Adapter in COCO-Stuff (left) and PASCAL Context (right) benchmarks [4]. |
Datasets
ADE20K dataset
ADE20K [28, 29] is a comprehensive scene parsing benchmark, including a split of 20,210, 2,000 and 3,000 densly annotated images for training, validation and testing set. The object categories enfold a tree with a depth of three, hence even parts of objects and again parts of these parts are annotated. This makes this benchmark a challenging semantic segmentation task.
PASCAL Context dataset
The PASCAL Context [18] dataset comprises 10,103 trainval images of the PASCAL VOC 2010 detection challenge. Every image has pixel-wise annotations from a pool of 540 possible categories, divided into the three types objects, stuff and hybrids.
Cityscapes dataset
The Cityscapes [7] dataset contains 5,000 high-resolution (\(2048\times 1024\)) images of urban scenes with 30 annotated object categories. It is split into 2,975, 500 and 1,525 samples for training, validation and testing respectively and provides in addition 20,000 coarse annotated images for further training.
COCO dataset
Common Objects in Context (COCO) [15] is an object detection, segmentation, key-point detection and captioning dataset. It contains 80 object and 91 stuff categories in more than 328,000 images.
Conclusion
We have seen powerful designs and benchmark results to conclude, that Transformers definitely improve the field of computer vision. They are often successfully used in combination with feature maps, created by large, pre-trained CNN-backbone networks. In this regard they enhance the locality of convolutions by global attention, hence easily increase the receptive field to the whole image right away. We can recognize a strong research direction of increasing the inductive bias, i.e., providing more information about the vision task, for example the image patch location, to the network. One key idea of Chen et al. with the ViT-Adapter is inspiring: First strongly pre-training the network on all sort of data, e.g., text, images, audio, to benefit from more available data and better generalization, and finally fine-tuning on the specific task at hand. Their results show, that this can be a successful path and it could potentially lead to stronger general AI.
References
[1] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. “Language Models Are Few-Shot Learners.” arXiv, July 22, 2020. https://doi.org/10.48550/arXiv.2005.14165.
[2] Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-End Object Detection with Transformers.” arXiv, May 28, 2020. https://doi.org/10.48550/arXiv.2005.12872.
[3] Chen, Jieneng, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation.” arXiv, February 8, 2021. http://arxiv.org/abs/2102.04306.
[4] Chen, Zhe, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. “Vision Transformer Adapter for Dense Predictions.” arXiv, May 17, 2022. http://arxiv.org/abs/2205.08534.
[5] Cheng, Bowen, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. “Masked-Attention Mask Transformer for Universal Image Segmentation.” arXiv, December 10, 2021. http://arxiv.org/abs/2112.01527.
[6] Cheng, Bowen, Alexander G. Schwing, and Alexander Kirillov. “Per-Pixel Classification Is Not All You Need for Semantic Segmentation.” arXiv, October 31, 2021. http://arxiv.org/abs/2107.06278.
[7] Cordts, Marius, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. “The Cityscapes Dataset for Semantic Urban Scene Understanding.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3213–23. Las Vegas, NV, USA: IEEE, 2016. https://doi.org/10.1109/CVPR.2016.350.
[8] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.
[9] Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv, June 3, 2021. https://doi.org/10.48550/arXiv.2010.11929.
[10] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” arXiv, December 10, 2015. https://doi.org/10.48550/arXiv.1512.03385.
[11] Huang, Zilong, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, and Thomas S. Huang. “CCNet: Criss-Cross Attention for Semantic Segmentation.” arXiv, July 9, 2020. https://doi.org/10.48550/arXiv.1811.11721.
[12] Ioffe, Sergey, and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” ArXiv:1502.03167 [Cs], March 2, 2015. http://arxiv.org/abs/1502.03167.
[13] LeCun, Yann, Leon Bottou, Yoshua Bengio, and Patrick Ha. “Gradient-Based Learning Applied to Document Recognition,” 1998, 46.
[14] Li, Feng, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. “Mask DINO: Towards A Unified Transformer-Based Framework for Object Detection and Segmentation.” arXiv, June 6, 2022. http://arxiv.org/abs/2206.02777.
[15] Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. “Microsoft COCO: Common Objects in Context.” arXiv, February 20, 2015. https://doi.org/10.48550/arXiv.1405.0312.
[16] Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” arXiv, August 17, 2021. https://doi.org/10.48550/arXiv.2103.14030.
[17] Milletari, Fausto, Nassir Navab, and Seyed-Ahmad Ahmadi. “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation.” arXiv, June 15, 2016. https://doi.org/10.48550/arXiv.1606.04797.
[18] Mottaghi, Roozbeh, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. “The Role of Context for Object Detection and Semantic Segmentation in the Wild.” In 2014 IEEE Conference on Computer Vision and Pattern Recognition, 891–98. Columbus, OH, USA: IEEE, 2014. https://doi.org/10.1109/CVPR.2014.119.
[19] Otter, Daniel W., Julian R. Medina, and Jugal K. Kalita. “A Survey of the Usages of Deep Learning in Natural Language Processing.” arXiv, December 21, 2019. https://doi.org/10.48550/arXiv.1807.10854.
[20] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” arXiv, January 6, 2016. https://doi.org/10.48550/arXiv.1506.01497.
[21] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” arXiv, May 18, 2015. https://doi.org/10.48550/arXiv.1505.04597.
[22] Strudel, Robin, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. “Segmenter: Transformer for Semantic Segmentation.” arXiv, September 2, 2021. https://doi.org/10.48550/arXiv.2105.05633.
[23] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” arXiv, December 5, 2017. https://doi.org/10.48550/arXiv.1706.03762.
[24] Wang, Qilong, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. “ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks.” arXiv, April 7, 2020. https://doi.org/10.48550/arXiv.1910.03151.
[25] Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. “Non-Local Neural Networks.” arXiv, April 13, 2018. https://doi.org/10.48550/arXiv.1711.07971.
[26] Woo, Sanghyun, Jongchan Park, Joon-Young Lee, and In So Kweon. “CBAM: Convolutional Block Attention Module.” arXiv, July 18, 2018. https://doi.org/10.48550/arXiv.1807.06521.
[27] Zheng, Sixiao, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, et al. “Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.” arXiv, July 25, 2021. https://doi.org/10.48550/arXiv.2012.15840.
[28] Zhou, Bolei, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. “Scene Parsing through ADE20K Dataset.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5122–30. Honolulu, HI: IEEE, 2017. https://doi.org/10.1109/CVPR.2017.544.
[29] Zhou, Bolei, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. “Semantic Understanding of Scenes through the ADE20K Dataset.” arXiv, October 16, 2018. http://arxiv.org/abs/1608.05442.