Unsupervised Anomaly Detection

Introduction

Anomaly detection identifies samples that deviate from a data set’s normal behavior. This normal behavior is learned by feeding the network samples from one class. This class must have an overwhelming amount of data which is deemed to be regular. The network then learns how to represent this class so well that it can easily distinguish an irregular sample. Some fields applicable for anomaly detection include industrial and medicine. For industrial applications, anomaly detection can be used to determine defects in production while in medical applications, it can be used to determine if focal lesions are present in a medical image.

The rise in anomaly detection can be credited to the advancement of deep learning, especially convolutional neural networks (CNN). CNNs take an image as input and learn various aspects of that image in order to do classification, localization, detection or segmentation. As it relates to detection, common architectures that use CNN layers include U-Net [12], auto-encoders (AE) [1] and even generative adversarial models (GAN) [7] to some extent. Many solutions for anomaly detection include variants of these architectures or a combination of these architectures [4, 5, 6, 13, 14, 17]. In recent years, there has also been a growth in using transformers in anomaly detection solutions [10, 15]. We can employ these architectures to detect anomalies in different ways through supervised, self-supervised and unsupervised learning.

Supervised Learning

Supervised learning involves feeding a model input-output pairs with the objective of learning to match the input to the correct output. The data fed to these networks are referred to as labeled data and this helps the network with its predictions. Using supervised learning solutions for anomaly detection involves pre-training a model on large datasets (i.e., ImageNet) then doing feature adaptation and finally anomaly scoring [11]. For this approach, it is important that the low level features extracted are beneficial to the model they are being adapted to. Consider pre-training a model on a data set such as ImageNet and doing feature adaptation on a model for anomaly detection for medical images. The difference in these data sets makes it challenging to achieve a good performance of the model. Another shortcoming is the dependency on labeled data. Using the medical field as an example again, expert annotation with the quantity of data required for such networks is costly and might not be possible. Furthermore, as experts are humans, there is also a chance of the data being mislabelled. Perhaps we can use self-supervising methods to overcome these challenges.

Self-supervised Learning

In self-supervised learning no labels are given. The aim is to leverage one part of the data to predict or discover information about other parts of the data and generate labels accordingly. Some solutions include pre-training a model so it is more suited to the specified task. In the context of anomaly detection, pre- training can be done on a self-supervised model and adapted to a state of the art unsupervised anomaly detection model. The self-supervised models are usually quite robust and commonly utilize contrastive learning for optimization. Contrastive learning allows a model to distinguish which parts are similar or different about the data. One notable solution in this area is the Constrained Contrastive Learning (CCD) model proposed by [14]. CCD does anomaly detection by: extracting patches from the images and training a network on those patches, training a network on the entire image and then doing an image level anomaly scoring at the end that combines the output from both training processes. CCD’s encoder used ResNet18 [8] which is a CNN that is eighteen (18) layers deep. The results of pre-training with CCD and using SOTA UAD methods can be seen in table 1. Notice that CCD improved performance but performance varies depending on which UAD method is used. Additionally, training on these models are quite long due to the pre-training and fine tuning steps as compared to unsupervised methods.

Unsupervised Learning

The goal of unsupervised learning is to find underlying patterns within a data set. The data set used for training these models are unlabeled and only consist of one class. This is a benefit for applications where there is an overwhelming amount of data in one class. This is the case in medical imaging where more normal images are available than abnormal images. Hence, the idea is that a model is trained on the normal images, the representation of that normal images is learned and at test time, the model would be able to identify anomalous images because it never learned their representation. Most solutions uses the reconstruction method to achieve this task.

Reconstruction Methods

Reconstruction methods leverage the fact that they are more images representing normal behavior compared to images that represent abnormal behavior. Hence, a model is trained only on images depicting the normal behavior and it learns to represent that class well. Thus, being able to reconstruct a normal image with a low reconstruction error should make it relatively easy to identify abnormalities. At inference, the reconstruction of images containing an anomaly are poorly reconstructed where the anomaly exists or the anomaly is not reconstructed at all, meaning that a healthy image is produced. In both cases, the reconstruction error would increase since the generated and reconstructed image differ where the anomaly exists. The higher the reconstruction higher, the higher the chances of the image being abnormal. A popular solution for the reconstruction of images is Variational auto-encoders (VAE). VAEs consist of an encoder to convert an input into its latent space representation and a decoder that reconstructs an image from the latent representation. The advantage VAEs have is a regularized, continuous latent space which allows sampling from the latent space and receiving useful results. [4] proposed a Segmentation Regularized Anomaly (StRegA) tailored for medical data with the aim of improving VAEs for the clinical field.

StRegA

StRegA has three main components; pre-processing, compact context encoding VAE (cce-VAE) and post-processing.

1) Pre-Processing

The first two steps in preprocessing use the FSL library which is an analysis tool for brain imaging data. In the context of this paper, the FSL BET step removes non-brain tissue from an image and the FSL FAST step separates the image into different tissue types; gray matter, white matter and cerebrospinal fluid (CSF) and the background. Thereafter, both intensity based (i.e Gaussian noise) and spatial based (i.e horizontal flip) augmentation was used on the brain image. The volumes were then divided into 2D slices using the axial orientation. The interpolation applied at the end was bi-linear interpolation in which the assigned value is an intermediate value between the four nearest pixels.

2) cce-VAE

This proposed solution is an extension of context encoding (ceVAE) proposed by [17]. The objective of context encoding is to produce the contents of a randomly masked out image region based on its surrounding. Hence, the model is forced to understand the image as best as it can. The VAE section of this solution functions as previously described except that reconstruction term is combined with density-based anomaly scoring. The objective function of ceVAE consists of a Kullback Leibler loss, reconstruction loss and context encoding loss as shown below:

Note that the Kullback Leibler loss tells us how two distributions differ from each other. This ceVAE approach detects anomalies on a sample and pixel level facilitating better reconstruction and incorporating model internal variations such as deviations of the latent representation from the normal range. cceVAE differs from ceVAE only in the feature maps and latent variable size. ceVAE has a symmetric encoder and decoder with 16, 64, 256, 1024 feature maps and a latent variable size of 1024 [17] while cce-VAE is a more compact version with 64, 128, 256 feature maps and a latent variable size of 256 [4].

3) Post-processing

After receiving the model output from cce-VAE, all negative values are set to 0 and Otsu thresholding is applied. Otsu thresholding is a technique that exhaustively searches for the threshold that minimizes the weighted intra class variance or maximizes the weight between class variance. Otsu uses the gray-value histogram of the image. After generating the Otsu representation, morphological opening removes unwanted pixels from the image based on some criteria before converting the image back to a 3D representation.

Implementation and Evaluation

T1 and T2 images magnetic resonance imaging (MRI) brain images were used for training. Two models were used; a model that consisted of T1 images only and a model that consisted of both T1 and T2 images. The MRI images used for training were merged from two data sets (MOOD and IXI data set) which increases the robustness of the model. The data sets used for testing were MOOD Toy Data, Synthetic Anomalous Data (generated by the authors) and BraTS data set. When testing the model, T1 contrast enhanced images were used for the T1 model and regular T2 images were used for the T2 model.

DICE was used to evaluate both the T1 and T2 models. It can be seen in table 2 and 3 that StRegA outperformed its counterparts. The table also shows that the T2 model performs better than the T1 model. After all, this might be intuitive since tumors are more visible in T2 images. In figures 5 and 6, under-sampling is visible in every sample when comparing column (e) with the ground truth in (f). Once again, the T2 model performs better here with their final outcome being closer to the ground truth than the T1 model even though under-sampling is present. It is worth mentioning that the T1 model was trained with regular T1 images but tested with T1 contrast enhanced images, it is unknown how this affected the performance of the model if at all. What was noticeable about the performance of the model was that subtle anomalies were not detected, this can be seen in the second row of figure 5. That subtle anomaly was in a T1 image, it is unknown if the T2 model would have recognized an anomaly of similar size. If it does, then maybe the T2 model should be standard. Overall, reconstructing subtle anomalies with an acceptable error rate (labeling it as normal) is a shortcoming of reconstruction methods, especially with VAE approaches.

MemMC-MAE

Memory-augmented Multi-level Cross-attentional Masked Auto-encoder (MemMC-MAE) is a reconstruction method that is transformed-based and seeks to tackle the aforementioned shortcoming [15]. This approach involves masking or removing pixels and feeding the visible parts of the image to the encoder. In this experiment, 75% of the image was masked and 25% remained visible and was fed to the encoder. The output of the auto-encoder is the latent representation of the unmasked section of the image. This is then paired with the masked section of the image and fed to the decoder whose objective is to reconstruct the image. In this model design:

The generation of patches is random.
Training is faster since the encoder processes 25% of the image.
The accuracy increases since the model has to be thorough to reconstruct the image.
Each masked token is a shared, learned vector that indicates the presence of a missing patch.
Positional encodings on the masked tokens allow the decoder to know where the patches are on the original image.

Transformers are unable to model long term memories effectively due to growing calculations and finite memory capacity causing models to drop information. To address this, the key and value operations in the encoder now have a learnable memory matrix that stores normal patterns as shown in figure 7. For self-attenuation, a weighted sum of value vectors is computed through the cosine similarity distribution between query and key. These techniques allow for normal patterns to be captured at every layer. Therefore, the decoder is forced to reconstruct a normal patch where an anomaly exists, increasing the reconstruction error as shown in figure 8.
The decoder uses both the masked and unmasked patches for image reconstruction. The key feature of this decoder is the use of cross attention which is computed using the outputs from all the encoding layers and the decoder layer output from the self-attention operator.

The anomaly score for MemMC-MAE was calculated using the multi-scale structural similarity (MS-SSIM). This method considers image details at varying resolutions [18]. MemMC-MAE was evaluated using the area under the curve (AUC). The data sets used for testing are the Hyper-Kvasir data set [2] and Covid-X data set [16]. Residual connections were used in both the encoder and decoder to transfer information between the blocks. As seen in table 4, MemMC-MAE outperformed its counterparts. On the Covid-X data set, MemMC-MAE surpassed the other models by at least 17.1%. Based on the results, it can be seen that this model improves UAD detection of subtle anomalies. However, it would be difficult scaling MemMC-MAE to other medical images such as pathology images. This is because pathology images are more complex and pixel-wise anomaly measures do not perform well on such images. A more plausible architecture in this instance would be s2-AnoGAN.

s2-AnoGAN

This GAN-based UAD model was designed specifically for pathology images. GAN models consist of a generator that does the reconstruction and a discriminator that tries to determine if the image was reconstructed or real. S2-AnoGAN used styleGAN2 [9] as the generator which featured some improvements from the original styleGAN architecture. Those improvements include replacing adaptive instance normalization (AdaIN) with modulation and normalization, moving the noise and bias outside the block and adjusting the weights through weight de- modulation. The baseline configuration for styleGAN is progressive GAN which is an approach to reconstruct high quality images. Given the high data complexity of pathology images, such an architecture is needed. In addition to a powerful model for reconstruction, s2-AnoGAN uses edge information in the anomaly score with canny edge detector [3] technique. The intuition behind this is that the edge characteristics should be similar both in the generated and original image. Considering the structure of pathology images, it’s easy to see how this edge information might be beneficial.

This model was trained on gray scaled healthy pathology images. The authors used grayscale to easily compare with previous results and reduce the color variation of the images since they are sourced from different hospitals. In figure 10 we can see how different GAN models performed on the testing data. We can see that the reconstruction of healthy images with s2-AnoGAN results in less blurring in the image, hence a quite decent reconstruction. The reconstruction of tumors images was done poorly but that is exactly what is expected. Based on the results in table 5, s2-AnoGAN produced better scores two out of the three times. Thus far, this seems like a promising model.

Review

The UAD methods explained above produced acceptable results and are very promising for the development of anomaly detection in the medical field. These methods were quite model specific and won’t scale well to other differing datasets. StRegA solely works on brain images. Using this model as is will not yield useful results if MRI images of another anatomical area were used due to the FSL library in the preprocessing step which only works on brain imaging data. Using other modality images will not work either. MemMC-MAE is less model specific since it was trained on endoscopic and x-ray images. However, it is possible that using MRI or computed tomography (CT) images could require some preprocessing steps such as choosing an orientation. The scalability of s2-AnoGAN is a bit trickier to analyze. In some cases, being able to perform well on complex data would mean a better performance is plausible on less complicated data. This isn’t quite the case for anomaly detection because it all depends on the anomaly scoring method.

Anomaly scores can be sample wise, pixel wise or a combination of both. It can use structural information, edge information or any other image information deemed useful for the task. This wide range of possibilities increases the difficulty of choosing an anomaly score. Considering all the information that can be included in these scores, it’s natural for the scores to be different for varying images. Circling back to s2-AnoGAN which uses edge detection information in their anomaly score for pathology images, it’s not clear how helpful this would be for diagnostic images. Detecting anomalies in brain axial images, more consideration should probably be given to the intensity values versus edge information. In essence, the performance of these models hinges on two things: the reconstruction method and the anomaly scoring method.

The models analyzed used robust reconstruction methods. MemMC-MAE's encoder and decoder pair must be highlighted for its ability to capture normal patterns at every layer in the encoder. The only improvement to this model using better data augmentation. The only data augmentation technique employed is RandomResizedCrop. Perhaps the MedMix data augmentation used in [13] could be applicable here. S2-AnoGAN’s generator is also impressive, producing satisfactory reconstructed images. The only limitation noted is long training time due to gradient descent optimization. This is in no way problematic for the anomaly detection task and the entire model design is still quite robust.

Regardless of the impressive results achieved thus far, UAD solutions can only be used in an assistive way in the medical realm. Thus, it can show doctors that maybe an anomaly exists, but the final decision belongs to the doctor. Additionally, note that UADs do not provide information about the anomalies. For detecting anomalies, more experiments, especially with an increased focus on detecting subtle anomalies, need to happen in this field to create confidence in these models. Thereafter, the scalability of these models to different imaging types, modalities and different anatomical structures should be assessed.

References

[1] Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders. arXiv preprint arXiv:2003.05991, 2020.

[2] Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, et al. Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific data, 7(1):1–14, 2020.

[3] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.

[4] Soumick Chatterjee, Alessandro Sciarra, Max Du ̈nnwald, Pavan Tummala, Shubham Kumar Agrawal, Aishwarya Jauhari, Aman Kalra, Steffen Oeltze-Jafra, Oliver Speck, and Andreas Nu ̈rnberger. Strega: Unsupervised anomaly detection in brain mris using a compact context-encoding variational autoencoder. arXiv preprint arXiv:2201.13271, 2022.

[5] Yuanhong Chen, Yu Tian, Guansong Pang, and Gustavo Carneiro. Unsupervised anomaly detection and localisation with multi-scale interpolated gaussian descriptors. arXiv preprint arXiv:2101.10043, 1(2):5, 2021.

[6] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. In International Conference on Pattern Recognition, pages 475–489. Springer, 2021.

[7] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.

[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[9] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.

[10] Walter Hugo Lopez Pinaya, Petru-Daniel Tudosiu, Robert Gray, Geraint Rees, Parashkev Nachev, S ́ebastien Ourselin, and M Jorge Cardoso. Unsupervised brain anomaly detection and segmentation with transformers. arXiv preprint arXiv:2102.11650, 2021.

[11] Tal Reiss, Niv Cohen, Liron Bergman, and Yedid Hoshen. Panda: Adapting pretrained features for anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2806–2814, 2021.

[12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.

[13] Yu Tian, Fengbei Liu, Guansong Pang, Yuanhong Chen, Yuyuan Liu, Johan W VERJANS, and Rajvinder Singh. Self-supervised multi-class pre-training for unsupervised anomaly detection and segmentation in medical images. 2021.

[14] Yu Tian, Guansong Pang, Fengbei Liu, Yuanhong Chen, Seon Ho Shin, Johan W Verjans, Rajvinder Singh, and Gustavo Carneiro. Constrained contrastive distribution learning for unsupervised anomaly detection and lo- calisation in medical images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 128–140. Springer, 2021.

[15] Yu Tian, Guansong Pang, Yuyuan Liu, Chong Wang, Yuanhong Chen, Fengbei Liu, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Unsupervised anomaly detection in medical images with a memory- augmented multi-level cross-attentional masked autoencoder. arXiv preprint arXiv:2203.11725, 2022.

[16] Linda Wang, Zhong Qiu Lin, and Alexander Wong. Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Scientific Reports, 10(1):1–12, 2020.

[17] David Zimmerer, Simon AA Kohl, Jens Petersen, Fabian Isensee, and Klaus H Maier-Hein. Context-encoding variational autoencoder for un-supervised anomaly detection. arXiv preprint arXiv:1812.05941, 2018.

[18] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398– 1402. Ieee, 2003.

Seitenhierarchie