This blog post is about "Vision Transformers for Dense Prediction" by Intel Labs, Rene Ranftl, Alexey Bochkovskiy, and Vladlen Koltun in 2021.

Introduction

In terms of past studies, the common methods for dense prediction is CNN and the architecture of Encoder-Decoder. However, although CNN as the backbone could downsample images to extract features at multiple scales, feature resolution and granularity may lost. Especially for the task of dense prediction, the importance of encoder is more affected than the decoder one, since the lost information by the encoder would be hard to be reconstructed in the decoder. Further, CNNs require large amounts of resources to compute and downsample. As a result, how to decrease feature granularity loss and create an efficient architecture in the model is a very challenging issue.

Related Work

The famous architecture called "Transformer" in Figure 1 is in the literature of Attention is all you need by Google [1]. Originally, Transformer has the reputation for the task of Natural Language Processing. Through attention mechanism, Transformer could utilize keys, queries, values to completely rely on the attention relationship between units to find the relationship between each unit in the sequence and give a reasonable explanation. Further, Transformer could make good use of multi-heads to compute in parallel with a view to reducing the numbers of sequence computing, indicating that Transformer is the first sequence transduction model that entirely relies on attention mechanism and replace the recurrent layers with multi-headed self-attention rather than most commonly used in the recurrent neural networks of encoder-decoder. Compared with CNNs and RNNs, Transformer could not only be trained in parallel in terms of efficiency, but also achieved a better performance at less training cost. Therefore, its similar architecture with attention mechanism could be applied for the audio, images, and time-series prediction in the future research. As a result, with its self-attention mechanism and less computation, it is worthy to adopting it to the task of dense prediction.

Figure 1: Transformer - Model Architecture [1]

Architecture

Figure 2: DPT Overview

Proposed DPT model in this paper (See Figure 2) could leverage vision transformers [2] as the backbone, indicating how the representation generated by Transformer encoder could be effectively reconstructed into dense predictions by convolutional decoder.

First, the input image is transformed into tokens (orange in Figure 2) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base with 12 transformer layers and DPT-Large with 24 transformer layers) or by applying a ResNet-50 feature extractor (DPT-Hybrid with 12 transformer layers). We could regard tokens as a bag-of-words representation of the image. Tokens have a one-to-one correspondence with image patches to maintain the spatial resolution of the initial embedding throughout all transformer stages.

Next, the image embedding is augmented with a positional embedding and a patch-independent readout token (red in Figure 2) is added. The learnable positional embedding would be added to the image representation, and the readout token is additional for other purpose.

Then, tokens are passed through multiple transformer stages. Through multi-headed self-attention mechanism, it not merely relate tokens to each other to transform the representation, but also do inherently global operation, resulting that every token could attend to and thus influence every other token. Compared to CNNs, the operation of Transformers is capable to increase their receptive field as features pass through consecutive convolution and downsampling layers.

By doing so, it could reassemble tokens from different stages into an image-like representation at multiple resolutions from Transformers (green in Figure 2). As shown in Figure 3 about reassemble operation, it mainly assembles the set of tokens into image-like feature representations at various resolutions, and feature representations would fuse into the final dense prediction. Through three-stage Reassemble, it could recover image-like representations from the output tokens of arbitrary layers of the transformer encoder. At the first stage of "Read", it processes the dimension of "readout token". Although readout is not strongly supportive for dense prediction, it definitely helps to capture and distribute global information. At the second stage, it applies a spatial concatenation operation and passes this representation to a spatial resampling layer that scales the representation to size. At the third stage, it employs 1x1 convolutions to project the input representation, as well as assembles features from deeper layers of the transformer at lower resolution, whereas features from early layers are assembled at higher resolution.

Figure 3: Reassemble operation

At last, fusion modules (purple in Figure 2) progressively fuse and upsample the representations to generate a fine-grained prediction. As as shown in Figure 4, it mainly combines features using residual convolutional units and upsample the feature maps to obtain the final representation with half resolution of the input image for producing the final prediction.

Figure 4: Fusion Block Overview

Further, proposed DPT model could handle varying image sizes, indicating that the transformer encoder is able to trivially handle a varying number of tokens. As long as the embedding and transformer stages are processed well, both of reassemble and fusion modules could be aligned to the stride of the convolutional decoder.

Experiments and Results

It mainly focuses on two tasks for monocular depth estimation and semantic segmentation in this paper.

Monocular Depth Estimation

It is a dense regression problem. There are two common challenges worth paying attention to solve. One is that how different representations of depth are unified into a common representation and that common ambiguities, for example scale ambiguity. The other one would be the large size of training data to achieve the effectiveness when applying the Transformer.

About experiment set-up, it closely follows the protocol of Ranftl et al. [3] and refer a meta-dataset as MIX 6 with 1.4 million images. For its optimization, it apply Adam optimizer, learning rate 1e-5 in encoder, learning rate 1e-4 in decoder, encoder weight initialization with ImageNet-pretrained weights, and decoder weight initialization with randomness. During training, it has 60 epoches and the batch size of 16.

Table 1: Comparison to the state of the art on monocular depth estimation. Evaluation for zero-shot cross-dataset transfer according to the protocol defined in [3]. Relative performance is computed with respect to the original MiDaS model [3]. Lower is better for all metrics.

For its performance, compared with its baseline model MiDaS [3], the average improvement of DPT-Hybrid is over 23%. And the average improvement of DPT-Large is 28%. In terms of the model size in Table 2, DPT-Hybrid could achieve the effect with similar network size to MiDaS [3]. However, the size of DPT-Large is three times larger than MiDaS [3]. Further, in order to prove that DPT does not simply rely on larger training dataset to increase its performance, it especially demonstrates the difference between MiDaS [3] with original dataset MIX 5 and extended dataset MIX 6 proposed in this paper. It is obvious that the performance of MiDaS[3] is worse when applying MIX 6 because of its architecture about fully convolutional network, indicating that both of DPT-Large and DPT-Hybrid model would be more effective than fully convolutional networks.

Table 2: Model statistics. DPT has comparable inference speed to the state of the art.

In terms of its quantitative results as shown in Figure 5, it is evident that DPT could efficiently reconstruct fine details and solve the global coherence in CNN, such as large homogeneous regions and relative depth arrangement across the image.

Figure 5: Sample results for monocular depth estimation. Compared to the fully-convolutional network used by MiDaS [3], DPT shows better global coherence (e.g., sky, second row) and finer-grained details (e.g., tree branches, last row)

Semantic Segmentation

This mission is not only a representative of discrete labeling but also be widely applied in various fields.

About experiment set-up, it closely follows the protocol of Zhang et al. [4] and the dataset ADE20K with random horizontal flipping and random rescaling for data augmentation. For its optimization, it applies the optimizer of SGD with momentum 0.9, dropout with 0.1, learning rate 2e-3, batch normalization, and cross-entropy loss accompanied by an auxiliary loss with 0.1 weight. During training, it has 240 epoches and the batch size of 48.

Table 3: Semantic segmentation results

In table 3, compared with other models, it is obviously to find that DPT-Hybrid is more effective than other fully convolutional methods. Although the performance of DPT-Large is less precise, it may result from ADE20K dataset which is smaller than MIX 6 dataset.

In terms of its quantitative results as shown in Figure 6, it is clear that DPT tends to produce cleaner and finer-grained delineations of object boundaries and that the predictions are also in some cases less cluttered.

Figure 6: Sample results for semantic segmentation on ADE20K (first and second column) and Pascal Context (third and fourth column). Predictions are frequently better aligned to object edges and less cluttered.

Ablation Studies

In terms of the comparison for the task of monocular depth estimation among different backbones (encoder) as shown in Table 4 with another 3 datasets, the performance of ViT-Large is the best, but its model size is three times bigger than both of ViT-Base and ViT-Hybrid. The performance of ViT-Hybrid is more effective than ViT-Base, and both of them are about the same size. ViT-Base model is better than ResNext101-WSL [5], pre-trained on a billion-scale corpus of weakly supervised data with ImageNet pre-training. Further, DeIT [6] is one of ViT variants with a more data-efficient pre-training procedure. The effect of DeIT-Base is equal to ViT-Base. Also, DeIT-Base-Dist [6] introduces an additional distillation token, ignoring in the Reassemble operation. Surprisingly, DeIT-Base-Dist is indeed more effective than ViT-Base, indicating that similarly to convolutional architectures, improvements in pre-training procedures for image classification could benefit dense prediction tasks.

Table 4: Ablation of backbones. Both of ViT-Hybrid and ViT-Large backbones consistently outperform the convolutional baselines. The base architecture could outperform the convolutional baseline with better pretraining DeIT-Base-Dist [6]

As for inference speed for different network architectures shown in Table 2, timings were conducted on an Intel Xeon Platinum 8280 CPU@2.70GHz with 8 physical cores and an Nvidia RTX 2080 GPU. We could observe that DPT-Hybrid and DPT-Large show comparable latency to the fully-convolutional architecture by MiDaS [3]. Although the size of parameters in DPT-Large is more than any other models, its latency would be am advantage. Since it employs parallel computing through its wide and comparatively shallow structure.

Table 2: Model statistics. DPT has comparable inference speed to the state of the art.

Conclusion

This paper proposes an architecture, called Dense Prediction Transformer (DPT). The transformer encoder is modified from past research vision transformer (ViT) [3], and reassembles the bag-of-words representation provided by ViT into image-like feature representations at various resolutions. For the decoder in DPT, it applies convolutional network and combines the feature representations into the final dense prediction. For the design of encoder-decoder in DPT, it abandons explicit downsampling operations and tries to maintain a representation with constant dimensionality throughout all processing stages. Through a global receptive field at every stage, DPT is able to have a fine-grained and globally coherent dense prediction. By the experiments of monocular depth estimation and semantic segmentation, DPT could produce more fine-grained and globally coherent predictions when compared to fully-convolutional architectures.

Own Review

The contribution for this paper is that instead of CNNs, it applies the Transformer architecture for image processing. Transformer originally employed for Natural Language Processing (NLP) had a more prestigious reputation than common CNN or RNN variants. Unexpectedly, its powerful structure is able to be applied in the field of image processing, even for accurate numerical prediction, not only for classification. One of its variants, DPT in this paper, could decrease computation and employ parallel computing to do multi-headed self-attention mechanism to avoid feature granularity loss for this specific task in dense prediction. Further, due to large amounts of training data, it is definitely supportive to feature learning for dense prediction. In terms of its ablation studies, it also mentions that DPT could be also fine-tuned on small datasets and achieve the effective results compared with other CNN-based methods. For the task for semantic segmentation, DPT works well and tends to produce cleaner and finer-grained delineations of object boundaries and that the predictions are also in some cases less cluttered. It is really helpful for developing other potential applications in other domains, such as medical imaging recognition, object detection in autonomous vehicles, and so on. As a result, its future work could develop in two directions. One is for modifying DPT for improvement. And the other one is be applied in other fields.

Reference

[1]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. (2017). Attention is all you need.

[2]. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. (2020). An image is worth 16x16 words: Transformers for image recognition at scale.

[3]. Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad, Schindler, and Vladlen Koltun. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.

[4]. Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Muller, R. Manmatha, Mu Li, and Alexander Smola. (2020). ResNeSt: Splitattention networks.

[5]. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. (2018). Exploring the limits of weakly supervised pretraining. In ECCV, 2018.

[6]. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. (2020). Training data-efficient image transformers & distillation through attention.

Seitenhierarchie

Vision Transformers for Dense Prediction