I. Introduction

Semantic Segmentation is a Computer Vision problem which involves taking as input some raw data (eg., 2D images) and converting them into a mask with regions of interest highlighted. This can be viewed as a pixel level classification problem where a class label is predicted for each pixel of the input image.

Among numerous approaches that has been made to solve this problem, the most popular ones probably involve FCN based encoder-decoder architecture. In this paper, however, the authors tried to approach semantic segmentation as a sequence to sequence problem, and used a transformer only architecture (SETR) for feature extraction.

II. Motivation

Since the publication of the “Fully Convolutional Networks for Semantic Segmentation” paper, FCN based models strongly influenced in most of the approaches made to solve semantic segmentation problem. A standard FCN follows an encoder-decoder architecture, where the encoder is for learning the feature representation, and the decoder is for pixel level prediction based on the learnt representation by the encoder.

Figure 1: Convolutional Encoder-Decoder for Semantic Segmentation.

Like most of the CNN networks, the encoder builds upon stacked convolutional layers. Considering the computational cost, the dimension of feature map is reduced gradually, which allows learning both low level and high level features based on progressively increased receptive field. While this design works perfectly for most of the image understanding problem, it also brings a limitation in handling semantic segmentation. Though the receptive field is gradually increasing in such design, it is still limited and hence faces problems in handling long range dependency which is an essential property in a good semantic segmentation model. To tackle this problem, different approaches have been made, including larger kernel size, feature pyramid[16] or dilated convolutions[15]. There have been approaches[17] to integrate attention module within FCN network, which tries to map the global interactions among pixels in feature map.

There was even attempt[8] to remove convolutions altogether and build a model that contains an attention-only encoder. However even those models are not free from the influence of FCN based design, where the encoder progressively reduces the spatial dimension to get a latent feature representation, and the decoder upsamples it back to original dimension. In this paper, the authors try to replace the stacked convolution based encoder (with gradual reduction in spatial size) with a pure transformer. This transformer that not only brings a different perspective in solving semantic segmentation problem, but also pushes the state of the art in one of the most competitive datasets.

III. Background and methodology

III. 1. FCN-based semantic segmentation

A FCN[18] encoder is build upon a stack of sequentially connected Convolutional Layer. The first layer takes as input a $\begin{array}{l}H * W* 3\end{array}$ image and then as output it provides a $\begin{array}{l}h*w* d\end{array}$ dimensional tensor, where h, w and d represents height, width and channel dimension respectively. Applying the convolutional layers sequentially then allows one location of the tensors of later layers to hold information of many locations of tensors of previous layers, the measure of which is called receptive field.

Along the depth of layers, the spatial dimension is typically decreases, and the receptive field increases linearly, due to the locality nature of convolution operation. As a result, in FCN architecture, long-range dependencies can be modeled only by higher layers with big receptive fields. However, shows that the benefits of adding more layers would diminish rapidly once reaching certain depths. Having limited receptive fields for context modeling is thus an intrinsic limitation of the vanilla FCN architecture.

III. 2. Transformers

Before the advent of transformer[19], Recurrent Neural Networks based models, for specifically LSTM/GRU was the state-of-the-art for solving sequence based machine learning problems. The famous paper “Attention Is All You Need”[19] introduced transformers - a sequence-to-sequence architecture which handles the problem of handling long range dependencies among sequences. Due to the architecture, transformers are particularly good at Natural Language Processing, and instantly became state-of-the-art for solving NLP and sequence based problems in general. However, transformers were never really utilised to their potential in image understanding.

Figure 2: Architecture of a transformer

As the CNN based models have the general limitation to receptive fields, they can’t handle long-range dependencies which is pivotal in semantic segmentation tasks. As transformers have the ability to handle long range dependencies, it can be a candidate for solving semantic segmentation problems.

Figure 3: Intuition of attention module

More recently, a pure vision transformer or ViT [1] is used effectively in image classification tasks. It thus provides direct evidence that the traditional stacked convolution layer (i.e., CNN) design can be rethought and image features are not bound to be learned progressively from local to global context by reducing spatial resolution. However, extending a pure transformer from image classification to a spatial location sensitive task of semantic segmentation is non-trivial. Some of the state-of-the-art methods found attention mechanism particularly effective for long-range context learning when combined with FCN architecture. These methods work fine, however as they limit attention learning only for higher layers due to quadratic complexity of attention with respect to the pixel number, they still have limitations in learning dependency.

III. 3. Segmentation Transformers (SETR)

Segmentation transformer (SETR) is highly inspired by typical attention module used in NLP problems, and hence follows same input-output structure as NLP (1D sequence). As image spatial dimension is 2D, processing is needed before feeding it to the model as input. As the figure suggests, the transformer takes a 1D feature embedding, where L and C is the length of embedding sequence and channel dimension respectively. To sequentialise image of dimension $\begin{array}{l}H*W*C\end{array}$ , the most trivial way to flatten the image as an 1D array of length 3HW. However, with this procedure, even for a moderate size image of dimension $\begin{array}{l}480 * 480 * 3\end{array}$ , transformers have to deal with 1D sequence of dimension 691,200. As transformer has a quadratic complexity, processing a sequence of such huge length is not feasible.

To tackle that, the authors decide to sequentialise image into dimension H/16 * W/16 * C. While this design decision keeps the sequence length manageable for the transformers, it can mimic the downsampling a typical FCN encoder does. Moreover, upsampling to the original size is pretty trivial for a decoder. To obtain $\begin{array}{l}H * W / 256\end{array}$ long sequence, the image is divided into a grid of $\begin{array}{l}H/16 * W/16\end{array}$ size patches. The grid is then flattened into a sequence. Each vectorised patch P is then mapped to a C dimensional embedding using a linear projection function $\begin{array}{l}f:p -> e \in R^C\end{array}$ . A patch spatial embedding pi is also learnt for each location i, which is then added to e_i to get the final input $\begin{array}{l}E = {e_1 + p_1, e_2 + p_2, e_3 + p_3}\end{array}$ , which adds order information to orderless attention models.

A transformer based encoder takes the input embedding E and learns feature representation. Here, each transformer layer has global receptive field as opposed to a FCN encoder. The transformer layer builds upon Le layers of multi-head self-attention (MSA) and Multi Layer perceptron (MLP) blocks. At each layer l, the input triplet $\begin{array}{l}(query, key, value)\end{array}$ is computed from the input $\begin{array}{l}Z^{l-1} \in R^{L*C}\end{array}$

Here, $\begin{array}{l}W_Q, W_K, W_V\end{array}$ are learnable parameters.

As the equation for attention[19] is below:

Using the above values, the equation derived is,

These self-attention is then stacked together to build the multi-head self-attention.

$\begin{array}{l}MSA\left(Z^{l−1} \right) = [SA_1\left(Z^{l−1}\right); SA_2\left(Z^{l−1} \right); · · · ; SA_M\left(Z^{l−1} \right)] W_O\end{array}$

Then MLP is applied on multi-head self attention to get the info for next layer.

III. 3. I. Decoder design

The authors introduced three different decoder designs to upsample to target dimension, which is used to evaluate the performance of SETR’s encoder. In decoder phase, the feature dimension $\begin{array}{l}HW/256 * C\end{array}$ is converted to a feature map of shape $\begin{array}{l}H/16 * W/16 * C\end{array}$ . The three different decoder design is described below [Figure 4]:

Naive upsampling: Naive decoder first changes the channel dimension in transformer feature Z^L_e to category number. This can be achieved by a 2 layer network (1 * 1 conv layer + sync batch norm with RELU + 1* 1 conv layer). Finally through bilinear upsampling, full image resolution is reached. Finally a classification layer is applied with per pixel cross-entropy loss.
Progressive upsampling: As one step upsampling may result in noisy predictions, progressive upsampling is introduced as a different decoder, where conv layers and upsampling is used alternatively.Moreover, the upsampling is restricted to $\begin{array}{l}2\times\end{array}$ , therefore 4 upsampling in total is needed to reach the full resolution from $\begin{array}{l}Z^{L_e}\end{array}$ with size $\begin{array}{l}H/16 * W/16\end{array}$ .
Multi Level Feature Aggregation: Multi level feature aggregator decoder is inspired by feature pyramid network, although unlike the feature pyramid network, MLA maintains the same resolution without the pyramid shape. The feature representation from multiple layers are taken as input, and M streams are deployed with specific layer. In each stream, the encoder feature of size $\begin{array}{l}HW/256 * C\end{array}$ is reshaped to $\begin{array}{l}H/16 * W/16 * C\end{array}$ . A 3-layer network (kernel 1*1, 3*3, 3*3) is then applied on the feature map, where in first and third layer the channel size is halved and spatial resolution is upsampled 4* by bilinear interpolation. Moreover, after the first layer, a top-down aggregation via element wise addition is added to increase interaction between layers. Finally the fused features are then bilinearly upsampled to full size.

Figure 4: Schematic illustration of the proposed SEgmentation TRansformer (SETR) (a). An image is first split into fixed-size patches, linearly embedded each of them, added position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. To perform pixel-wise segmentation, different decoder designs are introduced: (b) progressive upsampling (resulting in a variant called SETRPUP); and (c) multi-level feature aggregation (a variant called SETR-MLA).

IV. Experiments

IV. 1. Experiment Setup

IV. 1. I. Datasets

Three different datasets are used to evaluate the SETR model. They are as follows:

Cityscapes [2] is a dataset of urban scenes with 19 object categories and 5000 finely annotated images (high resolution of 2048 × 1024).
ADE20K [3] is a scene parsing benchmark dataset of around 20,000 images with 150 fine-grained semantic concepts.
PASCAL Context [4] is a dataset with pixel-wise semantic labels for the whole scene that contains 5000 images and 60 object classes (59 classes and the background class).

IV. 1. II. Baseline Model

Dilated FCN [5] and Semantic FPN [6] are considered as baseline models with their results taken from [7]. SETR models are compared with the baselines after training and testing in the same settings. Though the dilated FCN has output stride of 8, output stride of 16 is used in SETR due to GPU memory constrain.

IV. 1. III. Variants of SETR

Besides the three decoder designs discussed above, two variants of the encoder “T-Base” and “T-Large” with 12 and 24 layers respectively are used during the experiment. (Table 1)

“T-Large” is the default encoder for three decoder variants unless specified otherwise. The model with “T-Base” encoder and SETR-Naive decoder setting is denoted as SETR-Naive-Base. A hybrid baseline Hybrid (a ResNet-50 based FCN encoder feed into SETR) is also used in comparison. As a result, Hybrid is ResNet-50 and SETR-Naive-Base combined.

IV. 1. IV. Pre-training

ViT [8] or DeiT [9] is used in the initialisation for both transformer layer and linear projection layer of the model, while other layers without pre-training are randomly initialised. SETR-Naive-Base with DeiT initialisation will be denoted as SETR-Naive-DeiT. Hybrid FCN encoder uses ImageNet-1k pre-trained weights. transformer part uses the ViT, DeiT or randomly initialised pre-trained weights.

IV. 1. V. Evaluation Metric

More most of the comparisons, the metric of mean Intersection over Union (mIoU) averaged over all classes is used. Only exception is ADE20K, where pixel-wise accuracy is reported with mIoU.

IV. 2. Ablation studies

In table 2, different variants of SETR is compared with the baseline FCN, Semantic FPN and Hybrid. It also shows the comparison between different pre-training methods with FCN baseline model.

.

From Table 2, following observations can be seen:

(i) In cityscapes, SETR-PUP performs the best among all the variants. SETR-MLA can't perform up to its full potential due to the fact that feature outputs of different transformer layers do not share the same benefits of resolution pyramid as in feature pyramid network (FPN).

(ii) SETR-MLA and SETRNaive (“T-Large” based encoder) performs superior to SETR-MLA-Base and SETR-Naive-Base (“T-Base” based encoder).

(iii) Initially SETR-PUP-Base can't reach Hybrid-Base in performance, but it recovers when trained with more iterations, which proves that Semantic Segmentation can successfully replace Semantic Segmentation.

(iv) While Randomly initialised SETR-PUP gives only 42.27% mIoU on Cityscapes, DeiT initialised SETR-PUP gives the best performance on Cityscapes, slightly better than the ViT initialised SETR-PUP. It proves that Pre-training is an important aspect of our model.

Table 3 shows performance based on different pre-training strategy.

For fair comparison with the FCN baseline, a ResNet-101 trained on Imagenet-21k is later initialised on a dilated FCN for Semantic Segmentation on ADE20K. According to Table 3, FCN with Imagenet-21K outperforms FCN with ImageNet-1K. However, As the SETR model performs both FCN, it shows that bigger pre-trained data is not always essential, and the sequence-to-sequence based model can outperform models with bigger pre-trained data.

IV. 3. Comparison to state-of-the-art

Results on ADE20K

Results of different SETR models on ADE20K dataset are shown in Table 4:

SETRMLA achieves superior mIoU of 48.64% with single-scale (SS) inference and 50.28% on multi-scale (MS) interface and achieves state-of-the-art on later one.

Figure 5 shows the qualitative results of SETR model and dilated FCN on ADE20K.

Figure 5: Qualitative results on ADE20K: SETR (right column) vs. dilated FCN baseline (left column) in each pair.

Results on Pascal Context

Results of different SETR models (Along with SOTA models ) on Pascal Context dataset are shown in Table 5:

While dilated FCN with the ResNet-101 achieves a mIoU of 45.74%, both of the SETR variants significantly outperforms the former, achieving mIoU of 54.40% (SETR-PUP) and 54.87% (SETR-MLA). With multi-scale (MS) inference is used, SETRMLA shows more significant performance improvement to 55.83%, which outperforms APCNet- the nearest rival.

Figure 6 gives some qualitative results of SETR and dilated FCN.

Figure 6: Qualitative results on Pascal Context: SETR (right column) vs. dilated FCN baseline (left column) in each pair.

Figure 7 shows that SETR can extract foreground regions that are semantically meaningful and demonstrates the ability to learn discriminative feature representations useful for segmentation.

Figure 7: Examples of attention maps from SETR trained on Pascal Context.

Results on Cityscapes

Results on the validation and test set of different SETR models (Along with SOTA models ) on Cityscapes dataset are shown in Table 6 and 7 respectively:

.

Both the tables shows that SETR outperforms FCN baselines, and FCN plus attention based approaches, such as Non-local [10] and CCNet [11]; and produces similar results to the best results reported so far. SETR-PUP model is still superior to Axial-DeepLab when multi-scale inference is adopted on Cityscapes validation set. Trained with 100k iterations, SETR model outperforms Axial-DeepLab-XL with a clear margin on the test set. Figure 8 shows the qualitative results of our model and dilated FCN on Cityscapes.

Figure 8. Qualitative results on Cityscapes: SETR (right column) vs. dilated FCN baseline (left column) in each pair.

V. Conclusion

To summarise, the authors have presented an alternative perspective (sequence-to-sequence prediction framework) for semantic segmentation. To improve the existing FCN based methods which struggle with limited receptive fields, most popular methods deal this problem of enlarging the receptive field with dilated convolutions and attention modules at the component level. In this paper, the authors made a change at the architectural level to completely remove FCN and solve the limited receptive field challenge from a different perspective. The authors implemented idea with Transformers- which uses attention module to model global context of feature learning. Along with a variety of decoder designs in different complexity, the complete architecture is built in a way that is without the influence of previous FCN based designs. Extensive experiments has been performed to prove that SETR variants pushes state of the art on ADE20, Pascal Context and competitive results on Cityscapes.

VI. Own review

Strengths:

Extensive experiment with different datasets and parameters.
First to provide complete non FCN style architecture for semantic segmentation.
Achieved state-of-the-art in multiple dataset at the time of publishing.
Available code on GitHub [13].

Weaknesses:

Can’t confidently explain the performance variation of different decoders.
Decoders are still not fully transformer based, achieved by other team[14] in pretty close time while pushing state-of-the-art.

REFERNECES

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[2] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.

[3] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. arXiv preprint, 2016.

[4] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.

[5] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

[6] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In CVPR, 2019.

[7] OpenMMLab. mmsegmentation. https://github.com/open-mmlab/mmsegmentation, 2020.

[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[9] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve J ´ egou. Training ´ data-efficient image transformers & distillation through attention. arXiv preprint, 2020.

[10] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.

[11] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.

[12] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Standalone axial-attention for panoptic segmentation. In ECCV, 2020.

[13] https://github.com/fudan-zvg/SETR

[14] Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid: Segmenter: Transformer for Semantic Segmentation. in arXiv:2105.05633v3 [cs.CV] 2 Sep 2021

[15] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015.

[16] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia, Pyramid Scene Parsing Network in CVPR 2017.

[17] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019

[18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

Seitenhierarchie

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers