f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks

Introduction

Problem Statment and Motivation

Detection and localization of imaging biomarkers indicating the state of diseases is a time-consuming and expensive task in different medical fields [1].
As a confined problem in need of optimization, it represents an active and highly promising research field of machine learning.

Supervised machine learning paradigms already achieve expert-level accuracy in corresponding individual applications [1]. However, such paradigms depend on rare and expensive expert-labeled training data, and their predictive power is limited to known markers. In contrast, Schlegl et al - the authors of "f-AnoGAn" - propose a generative model that follows an unsupervised paradigm to overcome these limitations.

Contribution

Schlegl et al. published in March 2017 the widely recognized paper "Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery", presenting the architecture AnoGAN (see Additional Read: AnoGAN).

"To the best of our knowledge, this is the first work, where GANs are used for anomaly or novelty detection" [2]

Two years later, "Fast"-AnoGAN embodies the authors' progress in this field and presents an improved and, at least during the inference phase, a computationally more efficient approach for the use of GANs for anomaly detection.

Methodology

Data

The authors' goal is to detect retinal fluid in Spectral-Domain Optical Coherence tomography (SD-OCT) volumes of the retina. To be able to process the data, the volumes are intensity normalized, cropped to the relevant area, flattened, and cut into 64x64 pixel patches.


Pre-processing of the SD-OCT volumes [1]

For the training, the authors use patches of solely normal appearance, meaning patches in which no retina fluid is present.

Architecture


f-AnoGAN Architecture for the Detection of Anomalies [1]

The proposed f-AnoGAN architecture is comprised of a Wasserstein-GAN (WGAN) and a Convolutional Autoencoder (AE).

A WGAN differs from a classical Deep Convolutional GAN (DCGAN) in a couple of ways. E.g. the cost function of a WGAN has a smoother gradient everywhere that allows the model to learn even if the Generator (G) does not perform well. Further, the Discriminator (D) does not have an output sigmoid function and, therefore, outputs a scalar score rather than a probability, due to which weight clipping is applied to D [3]. In contrast to the original AnoGAN model (see section Additional Read: AnoGAN), the addition of the AE enables f-AnoGAN to map from latency space Z to image space X.

Training

The training is split into two stages. In the first stage, the WGAN is trained. In the second stage, the weights of the WGAN are frozen and the AE is trained using the previously trained WGAN.

GAN Training


Wasserstein-GAN Training [1]

During the WGAN training, the Generator (G) and the Discriminator (D) are simultaneously updated. The G of the WGAN is trained to generate images that fit the distribution of the training images [1]. The D of the WGAN learns to discriminate between real and generated samples. Concurrently, D learns the feature representation of the data, that can be extracted from its intermediate layers of D, which is relevant for the training of the AE as well as for the anomaly detection [1].

AE Training


Encoder Training [1]

In the second stage, the Weights of the WGAN are frozen and AE is trained. For the training of the AE, the authors propose a novel izi_f training Strategy based on the following loss function:

$\begin{array}{l}\displaystyle L_{{ziz}_f}(x) = \frac{1}{n} ||x - G(E(x))||^2+ \frac{1}{n_d} ||f(x)-f(G(E(x)))||^2 \quad [1]\end{array}$

The strategy extends the "image to latent space to image" strategy (izi). For the izi-strategy, the AE learns to encode an image which in turn is decoded by the trained G of the WGAN. The weights of the AE are then updated based on the residual loss of the query (x) and the image (G(E(x))).

$\begin{array}{l}\displaystyle L_R(x) = \frac{1}{n} ||x - G(E(x))||^2 \quad [1]\end{array}$

For the izi_f strategy, the authors additionally utilize the D of the WGAN. Feeding x and G(E(x)) to D an additional discriminator loss is calculated based on the richer feature representation of the intermediate layers $\begin{array}{l}f(\cdot)\end{array}$ of D - following the concept of Feature Matching - the authors do not only minimize the visual dissimilarity but also the statistical dissimilarity.

$\begin{array}{l}\displaystyle L_D(x) = \frac{1}{n_d} ||f(x)-f(G(E(x)))||^2 \quad [1]\end{array}$

The leading fractions of $\begin{array}{l}L_R(x)\end{array}$ and $\begin{array}{l}L_D(x)\end{array}$ are normalizing constancies based on image size and the dimensionality of the intermediate feature representation.

Anomaly Detection

After the training of the WGAN and the AE is complete a query image x can be tested for anomalies by following the same procedure as during the AE training.

The anomaly score of the query image is then calculated via:

$\begin{array}{l}\displaystyle A(x) = A_R(x) + \kappa · A_D(x) \quad [1]\end{array}$

With $\begin{array}{l}A_R(x) = \frac{1}{n} ||x - G(E(x))||^2\end{array}$ [1] as the residual loss and $\begin{array}{l}A_D(x) = \frac{1}{n_d} ||f(x)-f(G(E(x)))||^2\end{array}$ [1] as the discriminative loss. Again $\begin{array}{l}f(\cdot)\end{array}$ is the feature-rich representation of the intermediate layers of D while the leading fractions are normalizing constancies based on image size and the dimensionality of the intermediate feature representation.

A pixel-level anomaly localization is performed via the pixel-wise residuals $\begin{array}{l}\dot{A}_R(x) =||x - G(E(x))||^2\end{array}$ [1].

Results

Generation

To evaluate the generative power of the model the authors apply the Visual Turing Test, meaning that they give a set of generated and real images to two domain-specific experts and have them classify these images.

The mean accuracy of the two experts turned out to be 44%. Accordingly, the experts were wrong more often than right. The consensus between the classifications of both raters was 58%. Meaning that they picked a different label for the same sample nearly as often as they picked the same label.

To further evaluate the generative power of the model and to test for mode collapse the authors evaluated the smooth latent representation of the latent space by randomly selecting points-pairs in Z and sampling along the line connecting the two randomly sampled z's.


Linear Interpolation from Z [1]

Detection

To evaluate the model's capability to detect anomalies the authors compare the model's performance, besides others, to that of an Adversarial Autoencoder (AdvAE), an Adversarially Learned Inference model (ALI) as well as an instance of AnoGAN.

Qualitatively

Samples from the Evaluation of the Different Models [1]

The image depicts three main columns. The first depicts patches from the training set, that are accordingly free of anomalies. The second column depicts patches of healthy samples from the testing set. The third column show patches from diseased samples and therefore contains normal and anomalous patches. The first individual row shows the original samples and the second the corresponding ground truth markers. Afterward, follow respectively the generated images and then the generated images overlaid with the anomaly localization of the corresponding model.

Based on the visualization above the authors deduce that the Adversarial Convolutional Autoencoder (AdvAE) has a good general anomaly localization performance but performance poorly on pixel-level anomaly detection. Further, the model shows more false positives on normal images.

The Adversarially Learned Inference model (ALI) generates compelling realistic images. However, a deviation in pixel-level appearance leads to over-segmentation.

By comparison, f-AnoGAN shows the best image-level anomaly detection and the best pixel-level anomaly localization, according to the authors.

Quantitatively

The performance gain becomes clearer when looking at the receiver operating characteristic curve (ROC-Curve).


ROC-Curve of the Models Performances [1]

The left graph shows that f-AnoGAN outperforms all the other approaches, based on the Area Under Curve (AUC). The second-best approach, based on AUC, is the iterative approach of the original AnoGAN paper. The authors once again emphasize the computational performance gain of f-AnoGAN in detecting anomalies.

The right graph of the image above compares different AE training strategies showing the superiority of the novel izi_f strategy proposed by the authors.

Conclusion and Discussion

Authors Conclusion

The authors conclude that based on the Visual Turing Test their model captures the anatomical variability within SD-OCT volumes and generates compelling images that fool even experts. However, the authors also admit that said experts are used to rating full-width OCT slices, not 64 × 64 pixel images. Further, the authors point out that the calculation of the detection accuracy is dependent on annotations and the chance of false positives that might actually be true anomalies.

Overall the authors conclude that their approach shows very good detection performance. However, the localization of the anomalies is rather coarse and serves as a hint for further assessment.

General Discussion

The discussion surrounding AnoGAN and f-AnoGAN criticizes that the approaches are technically not purely unsupervised due to the dependency on anomaly free data during training [4].

AnoGAN in particular gets criticized for its dependency on the accurate reconstructions by a DCGAN, which is known to suffer from mode collapse [4]. The authors of ADGAN propose to at least ”account for the non-convexity of the underlying optimization by seeding from multiple areas in the latent space” [5] when using AnoGAN.

f-AnoGAN is criticized for not being end-to-end trainable [6] and that the WGAN and AE should be trained jointly to derive a smoother representation of the latent space [4]. Additionally, it is repeatedly pointed out that contaminated training data reduces the performance which is consistent with the first criticism [7].

Student's View

The paper is well structured and references and visualizations are informative. Overall the paper is easy to follow along.

Methodology vise I believe that the switch from a DCGAN to a WGAN was an important advance. From my own experience, a DCGAN is extremely difficult to train and the idea of achieving a latent space that is smooth enough to be effectively navigated via gradient descent is mind-boggling. Achieving the same using a WGAN seems way more realistic and reproducible.

Further, I am questioning the informative value of the Visual Turing Test for the evaluation of the generative power. Generally, the concept of the Visual Turing Test is convincing and I am aware of the challenge to qualitatively evaluate the performance of generative models. However, for this particular problem, I got the impression, that a 64 × 64 pixel image differs too much from a high-resolution SD-OCT for the test to be of great value.

Additionally, I find a hard to believe, considering the vast advancement in the field, that for the evaluation the authors could not find a model that performs better than their original two years old AnoGAN approach (see "Iterative" in the left ROC-Curve plot). For me, this suggests that the chosen approaches, like AdvAE and ALI, are either not configured and trained sufficiently or are not methodologically sound for the given task.

At this point, it would have been interesting to see the f-AnoGAN's performance compared to one of those formerly mentioned supervised approaches.

Finally, I do share the opinion, that the AnoGAN and f-AnoGAN are not truly unsupervised. Corrupted data training data would diminish the detective power. Accordingly, a prefiltering of the training data seems a necessity. That in turn contradicts the claim of being unsupervised.

Additional Read: AnoGAN

AnoGAN

The paper AnoGAN is based on a Deep Convolutional GAN (DCGAN) that learns the manifold of anatomical variability within SD-OCT volumes (see section Data) of solely normal appearance [2].

The novelty of the paper is comprised of two things:

First, the loss function is extended by a discriminative loss $\begin{array}{l}L_D(z_\gamma) = \sum | f(x) - f(G(z_\gamma))|\end{array}$ [2]- in addition to the classical residual loss $\begin{array}{l}L_R(z_\gamma) = \sum | x - G(z_\gamma)|\end{array}$ [2].

$\begin{array}{l}\displaystyle L(z_\gamma) = (1-\lambda)*L_R(z_\gamma)+\lambda*L_D(z_\gamma) \quad [2]\end{array}$

The discriminative loss evaluates the statistical similarity of generated and query image by comparing the feature-rich data representation of the intermediate layers $\begin{array}{l}f(\cdot)\end{array}$ of the Discriminator (D) of the DCGAN - following the concept of feature matching. Importantly $\begin{array}{l}L(z_\gamma)\end{array}$ provides a gradient for the update of the coefficient z within the latent space (Z).

Secondly, the authors propose to search the Z via gradient descent to locate the z that generates G(z) that is visually the most similar to the query image x [2]. For this to work the authors assume that Z has smooth transitions.

For the anomaly detection, a random z is sampled from Z and the corresponding G(z) is generated. Then the visual and statistical similarity of the generated to the query image is computed using the novel loss function. Based on the loss the location of z is updated via Backpropergation. This iterative z-mapping is repeated until it converges to the z that generates the image G(z) that is most similar to the query image.[2]

AnoGAN vs f-AnoGAN

The Authors transitioned from a DCGAN to a WGAN and added AE. The addition of the AE allows a mapping from image to latent space. While the WGAN is more stable during training and leads to a latent space with smoother transitions.

For the detection of anomalies, f-AnoGAN now uses a learned mapping instead of AnoGAN's gradient descent based iterative z-mapping.

In comparison, f-AnoGan is considerably faster in detecting anomalies, it is more flexible as it can e.g. use different strategies for the AE training and also has better detection values, as can be seen in the next section.

References

[1] Schlegl, T., P. Seeböck, S. M. Waldstein, G. Langs, et al. (2019). “f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks”.
In: Medical Image Analysis 54, pp. 30–44. issn: 13618423. doi: 10.1016/j.media.2019.01.010.

[2] Schlegl, T., P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, et al. (2017). “Unsupervised anomaly detection with generative adversarial networks to guide marker discovery”.
In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10265 LNCS, pp. 146–147. issn: 16113349. doi: 10.1007/978-3-319-59050-9_12. arXiv: 1703.05921

[3] Arjovsky, M., S. Chintala, and L. Bottou (2017). “Wasserstein GAN”. In: arXiv: 1701.07875. url: http://arxiv.org/abs/1701.07875.

[4] Berg, A., J. Ahlberg, and M. Felsberg (2019). “Unsupervised Learning of Anomaly Detection from Contaminated Image Data using Simultaneous Encoder Training”. In: arXiv: 1905.11034. url: http://arxiv.org/abs/1905.11034.

[5] Deecke, L. et al. (2019). “Image anomaly detection with generative adversarial networks”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11051 LNAI, pp. 3–17. issn: 16113349. doi: 10.1007/978-3-030-10925-7_1.

[6] Zhou, K. et al. (n.d.). “Sparse-GAN: Sparsity-constrained Generative Adversarial Network for Anomaly Detection in Retinal OCT Image”. In: arXiv: 1911.12527. url: http://arxiv.org/abs/1911.12527.

[7] Beggel, L., M. Pfeiffer, and B. Bischl (2020). “Robust Anomaly Detection in Images Using Adversarial Autoencoders”. In: pp. 206–222. issn: 16113349. doi: 10.1007/978-3-030-46150-8_13. arXiv: 1901.06355.

Seitenhierarchie