Masking Techniques for Computer Vision and Medical Imaging

1- Introduction

1.1- What is Masking?

In simplest way, masking is removing some parts of data.

Figure1 : Word masking example in BERT language model.

In Figure1, it is shown an example of masked language model BERT which is proposed by Jacob Devlin et al. [1]. In the input one word is masked (removed), and the model gives some possible words for that masked place as an output.

Figure2 : Grid masking example 75%, 85%, 95% mask coverage respectively.

In Figure 2, it is shown an example of masking application in image domain. In that sample, grid masking technique is used with different coverage ratios.

1.2- Masking for Machine Learning

Masking techniques are widely used in a lot of different applications. Some of the popular applications are:

Image Inpainting.
Reconstruction of intentionally or non-intentionally damaged images.
Language models
Object removal
Object detection

The main contribution of masking techniques can be summarized as follows:

It helps to train more robust and generalized models
It can be used as a data augmentation technique.
It helps to reduce training data size. In other words, it helps to handling with overfitting problem and getting more generalized models with less data.
As a result of decreased training data size, it provides computational cost saving.

1.3- Mask Types in Image Domain

1.3.1- Shape Aware masks:

Shape aware masking is the most sophisticated masking technique. The mask boundaries exactly matches with the shape of an organ or an object in the original image as it shown in the Figure 3.

Figure 3: Shape aware masking example.

1.3.2- Regular Masks:

Figure 4: Regular mask examples. The left side shows random patches, the middle one shows grid masking and the right one shows box masking.

1.3.2.1- Random Patches:

The image is divided into a grid that consists of square regions. Some of the squares are selected randomly for applying the masking.

1.3.2.2- Grid Masks:

Grid masking is very similar to the random patches method but rather than choosing the masked squares randomly it has a pattern as one masked one non masked square.

1.3.2.3- Box Masks:

Box masks consist of some rectangular mask. These boxes can be in various size and location on the image

1.3.3- Irregular Masks:

Figure 5: Irregular mask examples.

Irregular masks consist of completely random shapes. The mask does not have to be one piece.

1.4- Mask Sizes

Figure 6: Effect of different mask coverages onto the inpainting results

Mask coverage is pixel-wise ratio of masked pixels to the total number pixels in the image. Masking can be applied in various coverages. Very small coverage rates might leads to overfitting and very large coverage rates might leads to impossible reconstruction of some details in the images. There is not any optimal coverage rate, it is highly dependent on problem domain. One example of reconstruction under different coverage rates is shown in the Figure 6.

Figure 6: Examples of wide and narrow masks. In top 5 samples it is shown narrow mask examples. In bottom 5 samples, it is shown wide masks examples

The size of each piece of masks is also an important parameter. Wide masks make inpainting tasks harder since it is more probable that some details are completely removed.

1.5- Widely used architectures for dealing with masked data

The most commonly used architectures in the inpainting tasks are:

Partial and gated convolutions: it uses a normalization mechanism according to mask coverage in the kernel.
Fast Fourier convolutions: it is a novel architecture. It has an imagewise receptive field which makes it very successful in understanding the global structure of image.
Masked autoencoders: it is a type of generative network. It consists of an encoder and a decoder. The network takes masked input and the encoder only operates on the visible patches of the input. After that, the decoder reconstructs the image from latent representation and mask tokens.

2- Selected Papers

2.1- Image inpainting for irregular holes using partial convolutions

2.1.1 Architecture

In this paper, Lui Guilin et al. [2] proposed a U-NET [3] like architecture that uses a newly proposed partial convolution layers rather then the conventional convolution layers. Partial convolution layer is consist of two components: partial convolution and mask update.

Partial Convolution: W: weights, b: bias, X: features, M: binary mask, sum(1)/sum(M): Scaling factor

Before diving into the formula it would be beneficial to explain what a binary mask is. It is simply a 0-1 matrix that masked pixels shown by ones and the remaining pixels by zeros. As it shown in the formula, as long as there is a masked region, weights are multiplied by the element wise multiplication of input image and binary mask. And after that it is normalized by the number of pixels that are masked. This normalization is performed by the scaling factor.

Mask update: m': Updated mask

Mask update strategy is very simple. If the convolution was able to condition its output on at least one valid input value, then the mask for that location is removed.

2.1.2 Loss Function

The loss function used in this paper is a linear combination of different loss functions.

L_valid : it is the pixel loss for unmasked pixels
L_hole : it is the pixel loss for masked pixels
L_perceptual : it is L1 distance between the higher level features of the output and the ground truth. An imageNet [4] pretrained VGG-16 [5] network is used for getting the higher level features.
L_style : it is similar to perceptual loss but applied auto correlation on each feature map.
L_tv : it is used to smooth the penalty on the 1 pixel dilation of masked region.

2.1.3 Results

Figure 7: Comparison of partial convolutions with various methodology.

Compared methodologies:

PM: PatchMatch [6], the state-of-the-art non-learning based approach
GL: Globally and locally consistent image completion [7]
GntIpt: Generative image inpainting with contextual attention [8]
PConv: Partial convolutions [2]
GT: Ground truth

It is clearly seen in Figure 7 that partial convolutions outperforms the other methodologies.

Figure 8: Quantitive comparisons with various methods. Columns represent different hole-to- image area ratios. N=no border, B=border

In Figure 8, the qualitative comparison shows that partial convolutions performs best in L1 loss, peak signal to noise ratio (PSNR), structural similarity index (SSIM) [9], and the inception score (IScore) [10] metrics.

2.2- Resolution-robust large mask inpainting (LAMA) with fast fourier convolutions (FFC)

2.2.1- Architecture

Figure 9: Architecture of LAMA with FFC model.

In this paper, Suvorov Roman et al. [11] proposed LAMA with FFC architecture to dealing with the large masks. It is shown that large receptive fields from early layers is very important to understand the global structure of the image. There are two important component in this architecture:

Mask Generator: It concatenates the generated mask with the input image. Mask generation is operated by 2 different ways with 0.5 probability:

In the first method, the masks are generated as irregular masks.
In the second method, rectangular box masks are used in random sizes and positions.

Fast Fourier Convolutions: Fast Fourier Convolutions is based on a channel-wise fast Fourier transform (FFT) . It splits channels into local and global branches. After that, it applies conventional convolutions on local branch and applies FFT on global branch as the part of that spectral transform. Finally the outputs of global and local branches are fused together. The main advantage of Fast fourier convolutions is image-wide receptive fields starting from the early layers. Thus, it helps to understand the global context better.

2.2.2 Loss Function

L_Adv: Adversarial loss, it is responsible for generation of naturally looking local details

L_DiscPL: Discriminator loss, it is responsible for generation of naturally looking local details

L_HRFPL: High receptive field perceptual loss: it is responsible for the consistency of global structure.

R₁: Gradient penalty.

$\begin{array}{l}\kappa , \alpha , \beta , \gamma\end{array}$ : Weights of loss functions

The Loss function is a linear combination of adversarial loss, discriminator loss, high receptive field perceptual loss and the gradient penalty.

2.2.3 Results

Figure 10: Visual comparison of LAMA with FFC model

As it is shown in Figure 9, LAMA with FFC outperforms the other architectures in reconstruction of repetitive textures.

Figure 11: Quantitative evaluation of inpainting on Places and CelebA-HQ datasets. It is reported Learned perceptual image patch similarity (LPIPS) [12] and Fre ́chet inception distance (FID) [13] metrics. The ▲ denotes deterioration, and ▼ denotes improvement of a score compared to our LaMa-Fourier model (presented in the first row).

FID and LPIPS are commonly used metrics to show reconstruction quality. In both metrics, lower is better. The results are reported according to the masking strategies. As it shown in the Figure 11, LAMA with FFC is outperform most of the models in each category. Only CoModGAN [14] and MADF [15] architectures performs similar in narrow mask cases but their number of parameters are 3-4 times larger than the LAMA with FFC.

2.3- Masked Autoencoders (MAE) Are Scalable Vision Learners

2.3.1- Architecture

Figure 12: Masked Autoencoder Architecture

In this paper, He, Kaiming, et al. [16] proposed the masked autoencoder architecture. It is a derivative of autoencoder architecture which gets a masked input and masked encoder.

Masking Input: The input is divided into patches and some of patches are selected randomly as visible patches.

Encoder: It operates only the visible patches and generates the latent code from visible patches (shown by blue boxes in the Figure 12).

Decoder: It takes the latent code and masked tokens (shown by grey boxes in the Figure 12) and generates the target image.

2.3.2 Loss Function

MSE: Mean squared error

n: number of data points

$\begin{array}{l}Y_i\end{array}$ : observed values

$\begin{array}{l}\widehat{Y}_i\end{array}$ : predicted values

In this paper, mean squared error is used as a loss function. It is applied pixel wise values between the target and input image.

2.3.3 Results

Figure 13: MAE results on 80% mask coverage ratio. Each triplet consist of masked input on the left side, MAE output on the middle and ground truth on the right side.

MAE is very fast and simple architecture. As it shown in the Figure 13, even if the results are blurry, it works very well to reconstruct the global structure with very high masking coverage ratio.

2.4- Incremental transformer structure enhanced image inpainting with masking positional encoding

2.4.1 Architecture

Figure 14: Zero-initialized Residual Addition based Incremental Transformer Structure (ZITS)

In this paper, Qiaole Dong et al. proposed ZITS architecture. It is shown that transformers can use to understand the global structures of images. Additionally, masking positional encoding (MPE) method is proposed to improve generalization of the model.

Input: There are 4 different input style: a mask, masked input image, masked edge and masked lines. Mask and masked input image are clear. Masked edge shows the output of an edge detector on image and the masked lines shows the continuous lines in the images. At the end they all concatenated and input to the transformer structure restoration (TSR).

Transformer Structure Restoration (TSR): It consists of a Down-sampler and up-sampler CNNs and transformer blocks between them. The main aim of that component is to understand the global structure of image using the transformers abilities. The output of TSR is the recovered edges and lines as it shown in the Figure 14.

Simple Structure Upsampler: It is a learning based up-sampler which consists of simple CNNs.

Structure Feature Encoder: It takes upsampled recovered lines and edges. It is an autoencoder architecture that consists of a down-sampler encoder with gated convolutions[18], 3 residual blocks [19] with dilated convolutions [20] and an upsampler decoder with gated convolutions.

Masking Positional Encoding: It shows the location of masked pixels according to other masked pixels. For example, as it show in the Figure 14 when checkout the masked image and positional encoding of it, it is seen that if a pixel is far away from the border of the masked region, it is shown by brighter green. As get nearer to the borders the brightness is decreasing.

Fourier CNN Texture Restoration (FTR): It is used for texture restoration and the main element is Fast Fourier Convolutions (FFC).

2.4.2 Loss Function

L_L1: L1 loss that is calculated only unmasked pixels.

L_adv: Adversarial loss, it is responsible for generation of naturally looking local details

L_fm: Feature match loss: It is used for a stable generative network training.

L_hrf: High receptive field loss: it is responsible for the consistency of global structure.

$\begin{array}{l}\lambda_{L1} , \lambda_{adv}, \lambda_{fm}, \lambda_{hrf}\end{array}$ : weights of the loss functions in final loss.

The loss function used in that architecture is a linear combination of some loss functions.

2.4.3 Results

Figure 15: Visual comparison of the outputs of ZITS with other architectures

It is shown that ZITS architecture is very successful in reconstructing the indoor scenes and clearly outperform the other methodologies.

Figure 16: Quantitive results with different datasets

In Figure 16, it is shown quantitive comparison of the PSNR, SSIM , FID [13] and LPIPS [12] metrics. In PSNR and SSIM higher is better and in FID and LPIPS lower is better. It is shown that ZITS outperformed the other models in all datasets and metrics.

Figure 17: Ablation studies of MPE on 512×512 Places2 finetuned with dynamic resolutions from 256 to 512.

In Figure 17, it is shown the effect of MPE usage in terms of PSNR, SSIM, FID and LPIPS metrics. In all metrics MPE makes it better.

3- Review

Papers	Image inpainting for irregular holes using partial convolutions	Resolution-robust large mask inpainting (LAMA) with fast fourier convolutions (FFC)	Masked Autoencoders (MAE) Are Scalable Vision Learners	Incremental transformer structure enhanced image inpainting with masking positional encoding
Strengths	Its base architecture U-Net is very simple and well known architecture Proposed partial convolution layer is very effective way of reconstructing images with irregular masks.	LAMA with FFC architecture is very powerful especially on the images that has repetitive structures. Thanks to FFT it has image wise receptive field from early layers which makes it better in understanding of global structure. It can generalize to never seen high resolutions with very parameter-efficient structure compared to state of arts in that resolutions	It has a very simple architecture. It learns global structure well even with very high mask coverage.	It showed that transformers can used to understand global structures of images. Proposed MPE improves the inpainting results. It is very powerful in reconstruction of indoor scenes.
Weaknesses	It is not very successfull when the samples are sparsely structured It is also not very powerful with the large masked images.	This architecture is not very successful in the samples with strong perspective distortion	The reconstructed scenes are generally blurry.	It has a lot of complex component, hence a complex architecture

Summary

All papers has great approaches for dealing with the inpainting problem. The first paper is far ahead of its time with the proposal of partial convolutions. It leads to a lot of new research topics. MAE gives promising results for the ones who is looking for a simple architecture for inpainting. LAMA with FFC and ZITS architectures are both published in 2022 and achieve state of art in some of datasets however did not compared with each other yet. ZITS seems to be more successful in its own quantitive comparison compared to the LAMA, however LAMA is simpler architecture. Also LAMA has less parameters than the compared architectures which seems to perform similar with it. So, it seems more efficient.

4- References

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Liu, Guilin, et al. "Image inpainting for irregular holes using partial convolutions." Proceedings of the European conference on computer vision (ECCV). 2018.
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: A ran- domized correspondence algorithm for structural image editing. ACM Transactions on Graphics-TOG 28(3), 24 (2009)
Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image com- pletion. ACM Transactions on Graphics (TOG) 36(4), 107 (2017)
Yu,J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892(2018)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Im- proved techniques for training gans. In: Advances in Neural Information Processing Systems. pp. 2234–2242 (2016)
Suvorov, Roman, et al. "Resolution-robust large mask inpainting with fourier convolutions." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. arXiv preprint arXiv:1706.08500, 2017.
Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. Large scale image comple- tion via co-modulated generative adversarial networks. In In- ternational Conference on Learning Representations (ICLR), 2021.
Manyu Zhu, Dongliang He, Xin Li, Chao Li, Fu Li, Xiao Liu, Errui Ding, and Zhaoxiang Zhang. Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Transactions on Image Processing, 30:4855–4866, 2021.
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Dong, Qiaole, Chenjie Cao, and Yanwei Fu. "Incremental transformer structure enhanced image inpainting with masking positional encoding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Dauphin, Yann N., et al. "Language modeling with gated convolutional networks." International conference on machine learning. PMLR, 2017.
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." arXiv preprint arXiv:1511.07122 (2015).

Seitenhierarchie