2: Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick

Blog post written by: Katharina Kessler

Based on the paper: He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000-16009).

1. Introduction: What is the Difference between Language and Images?

In recent years – as hardware performance has been improving and applications for machine learning have been emerging – deep learning technologies have come to face the challenge of input data. In order to be trained, neural networks generally need large input datasets and the higher the quality, the better the output accuracy 1. For networks training on image tasks such as image classification (assigning labels to images based on their content) or image segmentation (dividing an image into segments or regions, often to identify objects or boundaries) this means that a large number of labelled images are needed 1. These are images annotated with specific tags or labels that describe their content function as a way to verify the outcome of training the network.

In natural language processing (NLP), the field of artificial intelligence that processes language, great progress has been made by using a technique called masked language modelling. Here, a certain percentage of the words (tokens) in a sentence is randomly replaced with mask tokens and the model then tries to predict the original words based on the context. By learning to predict the masked words, the model gains an understanding of how words are used in context, leading to a better comprehension of the language structure and meaning 2 .

However, this method is not only applicable to NLP but also to computer vision and image-related tasks. To understand why progress has been made in researching one field but not the other, the authors have investigated exactly what makes language different from images:

Languages are made of human-generated signals that are rich in meaning and contain a lot of detailed information. To accurately predict missing words in a sentence, a model needs a deep understanding of language, including grammar, context, and the relationships between words. In contrast, images consist of natural signals with heavy spatial redundancy. This means that neighbouring pixels or regions often contain similar information so that information in one part of the image can often be predicted from nearby parts 3.

In order to overcome this difference, the masked autoencoder technique randomly masks a large part of the input image to reduce the spatial redundancy and to require the model to have a more comprehensive understanding of the context instead of just applying statistical methods.

2. Method

The masked autoencoder (MAE) consists of an asymmetric encoder-decoder-design, where the encoder only receives the non-masked image patches as input and transforms this into a latent representation while the decoder receives both the latent representation and the masked patches (mask tokens) as input to reconstruct the image.

Fig. 1: Asymmetric MAE Architecture 3

The masking strategy is to divide the image into small, non-overlapping patches. Then, a random selection is made and the rest are masked (removed). This selection is done uniformly, so that every patch has an equal chance of being selected, thus avoiding any bias towards certain areas of the image. Furthermore, a high masking ratio has proven efficient and is achieved by removing a large portion of the patches so that the solution cannot rely on simply filling in missing parts based on the neighbouring visible patches.

Fig. 2: Reconstruction: (masked image - reconstructed image - original image) 3

The encoder is a modified Vision Transformer that processes only the visible, unmasked patches of an image. It uses linear projection (a layer that applies a linear transformation to its input data) and positional embeddings (which provide information about the relative or absolute position of elements in a sequence), followed by Transformer blocks. Moreover, it operates on a reduced subset of the patches to save computational resources. The MAE decoder takes both the encoded visible patches and the mask tokens representing missing patches as input. It uses positional embeddings to provide spatial context to the mask tokens. The decoder comprises additional Transformer blocks and reconstructs the image during pre-training. As it is only used for pre-training and not for the actual recognition tasks, the decoder is designed to be smaller and more efficient than the encoder.

The MAE decoder reconstructs masked patches by predicting their pixel values, with the final layer using a linear projection to output vectors that are reshaped into the reconstructed image. For the masked patches, the loss function calculates the mean squared error (MSE) between the reconstructed and original images. An alternative method was tried where the normalized pixel values of the masked patches were calculated using the mean and standard deviation. This improved the quality of the learned representations.

As can be seen in figure 2, the reconstruction of images works quite well (left side), but sometimes also reconstructs a context that is not correct according to the original image (right side). However, the wrongly reconstructed image is still possible, e.g., in this case, the cow was turned into a dog. 3

3. Experiments

Experiments were conducted on the ImageNet-1K (IN1K) training set: first, self-supervised pre-training and then supervised training for evaluating the representations with end-to-end fine-tuning (taking a pre-trained model and further training it on a specific task or dataset) or linear probing (used to evaluate the transferability of features learned by a pre-trained model). The results are as follows:

Fig. 3: Effects of different masking ratios 3

The optimal masking ratio was found to be at around 75% and worked well for both linear probing and fine-tuning. Moreover, the flexibility of the decoder's design allows for optimizing depth and width, where the deeper structures improve pixel reconstruction tasks but have less impact on recognition, while narrower models perform efficiently for both fine-tuning and linear probing tasks. Regarding mask tokens, it was found that skipping them in the encoder of the MAE improves accuracy so that it only processes real-world images. It also reduces training computation, achieving up to a 3.3× reduction in FLOPs (calculations performed by the model) and a 2.8× wall-clock speedup (actual time elapsed), which is particularly good for training larger models efficiently.

Furthermore, comparing different reconstruction targets showed that using per-patch normalization of pixels enhances accuracy by improving local contrast, whereas attempts using PCA coefficients or token prediction through dVAE tokenization did not yield accuracy benefits and added complexity. Data augmentation experiments showed that the MAE pre-training performs well with cropping-only augmentation but not with color jittering, and even performs decently with no augmentation, contrasting with contrastive learning methods heavily reliant on augmentation. Moreover, in comparing different mask sampling strategies, simple random sampling proved most effective, allowing for higher masking ratios and achieving better accuracy compared to block-wise and grid-wise sampling methods. Also, the accuracy improves steadily with longer training schedules, contrasting with contrastive learning methods like MoCo v3, which saturate much earlier despite seeing more patches per epoch.

Additionally, the results of the experiments were compared to other models. It was found that the MAE consistently outperformed supervised pre-training, particularly with larger ViT models like ViT-L, and competes favourably with token-based approaches like BEiT, while being simpler and faster. Finally, the pixel-based MAE shows its effectiveness throughout the different tasks, thereby demonstrating that tokenization is not necessary for achieving good results. 3

4. Experiments on Medical Images

Conducting experiments on different datasets of medical images yielded the following results:

Firstly, the experimental setup involved using PyTorch and MONAI frameworks for implementing ViT-B/16 as the backbone model. The AdamW optimizer was applied for stable training and to prevent overfitting. The data preprocessing varied, depending on the dataset: ChestX-ray14 images underwent histogram equalization and random flips/crops, BTCV CT scans were clipped and scaled with volume-wise augmentation, and MSD MRI scans were normalized with volume-wise augmentation. The training configurations involved batch sizes of 256 for ChestX-ray14 and 6 for BTCV and MSD, with MAE pre-training using 800 epochs for ChestX-ray14, 10,000 epochs for BTCV, and 500 epochs for MSD.

Fig. 4: MAE reconstruction (first row: original image, second row: masked image, third row: reconstructed image; left: CXR - right: BTCV) 4

The lung disease classification (on the ChestX-ray14 dataset) used MAE self-pretraining on 112,120 images and obtained competitive multi-class AUC compared to supervised pre-training. The AUC (area under the curve) determines how well a model can classify across multiple classes simultaneously, with a higher AUC value meaning better model performance. The MAE outperformed the ViT pre-trained on ImageNet by 0.8%.

The second set of experiments was conducted on the BCTV dataset doing abdomen multi-organ segmentation. Here, MAE self-pretraining was applied on 30 abdominal CT scans. The results showed an improved average Dice similarity coefficient (DSC) from 78.8% to 83.5% and the MAE being superior to ImageNet pre-training despite its small dataset size. The DSC evaluates the performance of image segmentation algorithms. It provides a single scalar value that quantifies the agreement between the predicted and ground truth segmentations, taking into account both precision and recall aspects of segmentation performance.

$\begin{array}{l}\displaystyle \text{DSC } = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\end{array}$

(TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives)

Lastly, the brain tumour segmentation (on the MSD dataset) experiments applied MAE self-pretraining on 484 MRI brain scans. This increased UNETR performance to 78.91% DSC and reduced the 95th percentile Hausdorff Distance (HD95) to 7.22mm. The HD95 evaluates the dissimilarity between two sets of points. It calculates the maximum distance between points in one set to the nearest point in the other set. While DSC measures the overlap between two segmentations, HD95 captures the maximum deviation in boundary detection. Furthermore, a mask ratio of 12.5% was found to be optimal for segmentation tasks.

$\begin{array}{l}\displaystyle \text{HD95}(A, B) = \left\{ \max\left( \sup_{a \in A} \inf_{b \in B} d(a, b), \sup_{b \in B} \inf_{a \in A} d(b, a) \right) \right\}_{95\%}\end{array}$

(d(a,b): The Euclidean distance between a point a from set A and a point b from set B; sup: The supremum (or least upper bound), which in this case refers to the maximum distance in the context of HD; inf: The infimum (or greatest lower bound), which here refers to the minimum distance between a point in one set and the closest point in the other set.)

To conclude, using MAE self-pretraining improved the results of medical image classification and segmentation, thereby demonstrating its advantages over established supervised and ImageNet pre-training methods. 4

5. Outlook

In conclusion, the paper highlights the significance of simple yet scalable algorithms in the field of deep learning. While self-supervised learning methods in NLP have shown benefits from increasingly large models, computer vision has previously relied on supervised paradigms, even though progress has been made in self-supervised techniques. The paper demonstrates that using autoencoders, a straightforward self-supervised approach similar to those in NLP, can yield scalable advantages in vision tasks. Despite the inherent differences between image and language signals the MAE demonstrates that it can derive complex visual concepts and reconstructions. This paves the way for future research and innovation in the field of computer vision.

6. References

1 Stojnev, D., Stojnev Ilić, A. (2020). Preprocessing Image Data for Deep Learning. Paper presented at Sinteza 2020 - International Scientific Conference on Information Technology and Data Related Research.

2 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

3 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000-16009).

4 Zhou, L., Liu, H., Bae, J., He, J., Samaras, D., & Prasanna, P. (2023, April). Self pre-training with masked autoencoders for medical image classification and segmentation. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) (pp. 1-6). IEEE.

Seitenhierarchie