Contrastive Training for Improved Out-of-Distribution Detection

This is the blog post written by Mykhailo Kulakov for the paper 'Contrastive Training for Improved Out-of-Distribution Detection' from authors Jim Winkens, Rudy Bunel, Abhijit Guha Roy, Robert Stanforth, Vivek Natarajan, Joseph R. Ledsam, Patricia MacWilliams, Pushmeet Kohli, Alan Karthikesalingam, Simon Kohl, Taylan Cemgil, S. M. Ali Eslami and Olaf Ronneberger.

Introduction

Deep Neural Networks demonstrated significant improvement in solving classification tasks over the last decade, but still, struggle to make meaningful predictions for data from unseen distributions. In medical imaging or other safety-critical domains, the presence of out-of-distribution(OOD) samples can lead to incorrect predictions due to inconsistent data. A human expert could sort out OOD samples from the training cohort, but this would require manual effort. This problem has motivated the research community to find a reliable and scalable solution that allows automatic detection and removal of OOD samples from training data.

The main idea of modern OOD detection techniques [3, 4, 6, 7] is to calculate a scalar score $\begin{array}{l}s(z)\end{array}$ by using the values $\begin{array}{l}z\end{array}$ 's from activation functions taken from multiple hidden layers of the network and detect OOD samples by comparing the scores of samples taken from inlier and outlier distributions. The quality of such a process depends on the quality of hidden feature space, i.e., how well the model is able to capture all semantic differences of the training data objects (e.g., pose, shape, texture) and detect variations in the imaging process (e.g., lighting and camera positions).

The majority of modern OOD detection approaches [3, 4, 6] use supervised learning models to capture the semantic differences of training data, but only to the extent sufficient to properly classify the data with a correct class label. In contrast, the authors of this paper proposed to use the contrastive learning model which is based on the recent SimCLR [1] approach shown in Figure 1. The key idea of contrastive training is to "learn visual representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space" [1]. The learning of visual representations in addition to supervised learning allows the model to capture both semantic differences and different imaging processes even between the samples from the same class which leads to better OOD detection. No samples from outlier distributions are used during the model training.

Figure 1: SimCLR contrastive learning approach. Input $\begin{array}{l}x\end{array}$ is transformed using augmentations $\begin{array}{l}t\end{array}$ and $\begin{array}{l}t'\end{array}$ , which were randomly selected from the family of augmentations $\begin{array}{l}\mathcal{T}\end{array}$ . Then, two augmented inputs $\begin{array}{l}\tilde{x}_i\end{array}$ and $\begin{array}{l}\tilde{x}_j\end{array}$ are fed into the encoder network $\begin{array}{l}f(\cdot)\end{array}$ and afterwards into projection head (2-layer MLP) $\begin{array}{l}g(\cdot)\end{array}$ . Finally, the contrastive loss function is used to maximize the agreement between the latent input representations $\begin{array}{l}z_i\end{array}$ and $\begin{array}{l}z_j\end{array}$ [1].

To demonstrate the advantage of the contrastive learning approach compared to a fully-supervised model a toy example is shown in Figure 2. The task is to classify inputs with only 2 features $\begin{array}{l}(x_1, x_2)\end{array}$ . As can be seen, it's sufficient to learn only $\begin{array}{l}x_1\end{array}$ input feature to perfectly distinguish between 2 classes. Applying only supervised learning the model ignores the presence of a second input feature $\begin{array}{l}x_2\end{array}$ which leads to a complete inability to detect OOD samples. However, the supervised model with contrastive learning takes into account both $\begin{array}{l}x_1\end{array}$ and $\begin{array}{l}x_2\end{array}$ features resulting in a perfect distinguishing of out-of-distribution samples.

Figure 2: The model with and without contrastive training comparison. $\begin{array}{l}z\end{array}$ is a penultimate (second to the last) activation value, $\begin{array}{l}s(z) = log p(z)\end{array}$ is an OOD score obtained from a trained network. Left: Supervised training of $\begin{array}{l}z\end{array}$ without contrastive training rejects to learn the unnecessary for the classification task but required for OOD detection features. Right: Supervised model with contrastive training effects $\begin{array}{l}z\end{array}$ to be sensitive to both input features.

Another challenging problem is to measure the 'similarity' (distance) between inlier/outlier distributions. Based on this measure we can introduce near and far OOD samples. From medical imaging perspective near OOD means that the model, which works with medical pathologies, should still provide reliable outputs for atypical combinations of pathologies. Far OOD may usually occur by an incident (e.g., broken sensor) and should be easy to detect for both human experts and the proposed model. To measure the OOD similarity of inlier/outlier pairs the authors have proposed a new metric called Confusion Log Probability (CLP) described below.

Methodology

Architecture

The proposed architecture (Figure 3) is similar to the introduced above SimCLR (Figure 1) approach. First, to input $\begin{array}{l}x_i\end{array}$ two different augmentations ( $\begin{array}{l}T^0\end{array}$ and $\begin{array}{l}T^1\end{array}$ ) are applied from the pool of transformations $\begin{array}{l}\mathcal{T}\end{array}$ (e.g., crop-resize or color distortions). Next, the augmented inputs $\begin{array}{l}x^0_i\end{array}$ and $\begin{array}{l}x^1_i\end{array}$ are fed into an encoder network $\begin{array}{l}f_\theta\end{array}$ to produce the input representations $\begin{array}{l}z^0_i = f_\theta(x^0_i)\end{array}$ and $\begin{array}{l}z^1_i = f_\theta(x^1_i)\end{array}$ in latent space, i.e., $\begin{array}{l}z^0_i\end{array}$ and $\begin{array}{l}z^1_i\end{array}$ are penultimate activations of the encoder network $\begin{array}{l}f_\theta\end{array}$ . The obtained input representations are fed into two projection head networks: projection $\begin{array}{l}g_\phi\end{array}$ maps the obtained input representation to the class predictions while a projection $\begin{array}{l}h_v\end{array}$ maps the same representation to low dimensional embeddings and $\begin{array}{l}\hat{z}^1_i = h_v(z^1_i)\end{array}$ used for contrastive loss.

Figure 3: The architecture. $\begin{array}{l}x_i, x_j\end{array}$ - training samples, $\begin{array}{l}T\end{array}$ - image transformation (cropping, brightness etc.), $\begin{array}{l}f_\theta\end{array}$ - wide ResNet-50 encoder network, $\begin{array}{l}\textbf{z}\end{array}$ - image representation in latent space, $\begin{array}{l}g_\phi\end{array}$ - linear model to project image representation to k classes, $\begin{array}{l}h_v\end{array}$ - 2-layer MLP network to project representation to a lower-dimensional space.

The image representation is learned by applying a cosine similarity function: $\begin{array}{l}sim(\textbf{u}, \textbf{w}) = \textbf{u}^\top\textbf{w} / (\lVert\textbf{u}\rVert\lVert\textbf{w}\rVert)\end{array}$ which aims to maximize the similarity for image transformations originated from the same input (i.e., $\begin{array}{l}sim(\hat{\textbf{z}}^0_i, \hat{\textbf{z}}^1_i) \to 1\end{array}$ ) and minimizing all other pairs of image representations (i.e., $\begin{array}{l}sim(\hat{\textbf{z}}^a_i, \hat{\textbf{z}}^b_j) \to 0\end{array}$ , with $\begin{array}{l}j \in \{1, ..., N\} \setminus i\end{array}$ and $\begin{array}{l}a \in \{0, 1\}\end{array}$ ). Hence, the contrastive loss for sample $\begin{array}{l}i\end{array}$ is defined as follows:

where $\begin{array}{l}\tau\end{array}$ is a scaling temperature parameter.

The model training is performed in two stages: first, the model is trained only with $\begin{array}{l}L_{con}\end{array}$ for a large number of epochs to learn explicitly the internal representations of images, second, the model is optimized using a combined loss $\begin{array}{l}L_{con} + \lambda L_{class}\end{array}$ to learn the classification task.

Density estimation

The detection of OOD samples is implemented similar to the method introduced by Lee et al. [6] where they have shown that formula for softmax classifier

( $\begin{array}{l}\textbf{w}_c\end{array}$ and $\begin{array}{l}b_c\end{array}$ are weights and bias of a discriminative classifier correspondingly) is equal to the posterior distribution defined by a Gaussian Discriminative Analysis (GDA) (see detailed explanation in Appendix A). Based on this property, a density estimation for each sample can be computed by fitting Gaussian distributions using model activations $\begin{array}{l}\textbf{z} = f_\theta(\textbf{x})\end{array}$ . To achieve better OOD detection results two properties are applied:

above mentioned contrastive loss encourages the model to learn semantic differences not only for images from different classes but also for inputs from the same class, i.e. with contrastive learning model takes into account all the input features and not only the features required to solve the classification task;
label smoothing is added to cross-entropy loss $\begin{array}{l}L_{class}\end{array}$ , which is a regularization technique to avoid model overconfidence, i.e. limits the model logits from infinite growth by replacing the one-hot encoded label vector $\begin{array}{l}y_{hot}\end{array}$ with a mixture of $\begin{array}{l}y_{hot}\end{array}$ and Uniform distribution controlled by a smoothing parameter $\begin{array}{l}\alpha\end{array}$ : $\begin{array}{l}y_{ls} = (1-\alpha)y_{hot} + \frac{\alpha}{K}\end{array}$ , where $\begin{array}{l}K\end{array}$ is a number of classes.

For each class $\begin{array}{l}c\end{array}$ the authors have estimated an $\begin{array}{l}n\end{array}$ -dimensional multivariate Gaussian $\begin{array}{l}\mathcal{N}(\bm{\mu}_c, \bm{\Sigma}_c)\end{array}$ , where $\begin{array}{l}n = dim(\textbf{z})\end{array}$ . The highest score over all class-conditional Gaussians is called an OOD score $\begin{array}{l}s(\textbf{x})\end{array}$ and is computed as follows:

where $\begin{array}{l}\bm{\mu}_c, \bm{\Sigma}_c\end{array}$ are obtained empirically using standard estimators. High OOD score $\begin{array}{l}s(\textbf{x})\end{array}$ means the sample representation $\begin{array}{l}\textbf{z}\end{array}$ in embedding space is similar to one of the inlier distributions used for training. Low $\begin{array}{l}s(\textbf{x})\end{array}$ score means that the input sample is far from all inlier distributions and the sample is likely to be an OOD example.

Confusion Log Probability (CLP) as a Measure of Dataset Distance

Current benchmarks only report the area under the receiver operating characteristic (AUROC) curve which is a metric to demonstrate how well the model distinguishes the samples from inlier/outlier distributions. But this metric cannot detect near versus far OOD samples. To solve this problem the authors propose a novel metric called Confusion Log Probability (CLP) which measures the 'similarity' of a given sample based on the probability with which a classifier confuses the outlier with inlier samples. An ensemble of $\begin{array}{l}N_e\end{array}$ independent classifiers $\begin{array}{l}\{\hat{p}^j\}^{N_e}_{j=1}\end{array}$ (in this paper 5 models ResNet-34) is trained on the joint datasets $\begin{array}{l}\mathcal{D} = \mathcal{D}_{in} \cup \mathcal{D}_{out}\end{array}$ with labels $\begin{array}{l}\mathcal{C} = \mathcal{C}_{in} \cup \mathcal{C}_{out}\end{array}$ respectively. The expected probability for a single unseen test sample $\begin{array}{l}\textbf{x}\end{array}$ to belong to class $\begin{array}{l}k\end{array}$ can be calculated as follows:

So we basically find the average probability that a given sample $\begin{array}{l}\mathbf{x}\end{array}$ belongs to class $\begin{array}{l}k\end{array}$ . The generalized formula for all $\begin{array}{l}\mathcal{C}_{in}\end{array}$ classes and the whole $\begin{array}{l}\mathcal{D}_{test}\end{array}$ dataset, i.e., confusion log probability (CLP) of $\begin{array}{l}\mathcal{D}_{test}\end{array}$ , looks as follows:

If CLP is low, it means that the sample is far OOD, and in the case of high CLP value, the sample is near OOD. A CLPs can also be computed class-wise and indicate how near/far the samples are from each $\begin{array}{l}\mathcal{D}_{out}\end{array}$ class w.r.t. inlier distributions of $\begin{array}{l}\mathcal{D}_{in}\end{array}$ (Figure 4).

Figure 4: Confusion Log Probability (CLP). Each point on a chart shows the AUROC detection probability for each class of the CIFAR-100 [5] dataset ( $\begin{array}{l}\mathcal{D}_{out}\end{array}$ ) when the model ensemble was trained on the inlier CIFAR-10 [5] data ( $\begin{array}{l}\mathcal{D}_{in}\end{array}$ ). As can be seen, the model produces less confident CLP scores for near OOD samples, i.e., for CIFAR-100 classes which are semantically close to CIFAR-10 classes. For instance, CIFAR-10 has classes similar to leopard (e.g., cat, dog) but none similar to oak tree or orange.

Experiments

Setup

The following datasets were used to explore and test the OOD detection:

CIFAR-10 [5]
CIFAR-100 [5]
Street View House Numbers (SVHN) [8]

It's worth mentioning that CIFAR-10 and CIFAR-100 datasets are mutually exclusive.

As the main evaluation metric area under the receiver operating characteristic (AUROC) curve was chosen for convenient comparison with other papers' results. An additional metric called OOD rank was introduced to compare the produced OOD score $\begin{array}{l}s(\textbf{x})\end{array}$ of an outlier sample $\begin{array}{l}\textbf{x}\end{array}$ with OOD scores of inline samples. The OOD rank is a percentage of inlier test samples for which $\begin{array}{l}s(\textbf{x})\end{array}$ scores are lower (means they are further away from the inline distribution) than the score of an OOD sample.

The authors have used a wide ResNet-50 [8] as the encoder network $\begin{array}{l}f_\theta\end{array}$ which produces fixed penultimate activation of size 6144-D, i.e. input representations $\begin{array}{l}\textbf{z} \in \mathbb{R}^{6144}\end{array}$ . Given an output vector $\begin{array}{l}\textbf{z}\end{array}$ , the OOD score $\begin{array}{l}s(\textbf{x})\end{array}$ can be computed for an input sample $\begin{array}{l}\textbf{x}\end{array}$ . Next, $\begin{array}{l}\textbf{z}\end{array}$ is fed into a supervised linear $\begin{array}{l}g_{\phi}\end{array}$ projection head which benefits from label smoothing to get tighter output class clusters for inlier distributions, and, in parallel, into a contrastive head that transforms the input representation $\begin{array}{l}\mathbf{z}\end{array}$ to a lower (128-D) dimension using batch normalization and ReLU activation.

Results

The main results are shown in Table 1. The proposed approach shows state-of-the-art results for all three setups if we don't consider the results obtained using the data explicitly labeled as OOD for training because it would require a human expert to perform the labeling manually. Accordingly, the average performance state-of-the-art results are achieved on all three dataset pairs, even outperforming the previous state-of-the-art result [12]. Moreover, Figure 5a demonstrates that contrastive loss indeed helps significantly in the identification of near OOD samples while the baseline (supervised) approach behaves worse than random performance. For far OOD setting the same tendency can be observed: even considering the better performance of a baseline model a contrastive model is still significantly better in detecting far OOD samples (Figure 5b). Figure 5c only confirms the fact that contrastive learning outperforms the baseline model by 18 AUROC points in the most challenging task - detection of near OOD samples, and approximately the same performance difference stays when we are shifting to far OOD samples.

Figure 5: Results comparison for approaches with and without contrastive training(CT). $\begin{array}{l}\mathcal{D}_{in}\end{array}$ is CIFAR-10 dataset and $\begin{array}{l}\mathcal{D}_{out}\end{array}$ is CIFAR-100 setting. Histograms of OOD scores $\begin{array}{l}s(\textbf{x})\end{array}$ with respect to inlier dataset for (a) near OOD class (leopard) and (b) far OOD class (oak tree) taken from the CIFAR-100 dataset are shown. Chart (c) demonstrates the performance difference between baseline (supervised) and CT approach across the whole CLP spectrum.

Ablation Study

In addition, an ablation study was conducted to explore the effect of the proposed label smoothing and contrastive learning properties, i.e., which of them has a larger impact on the OOD detection results. It can be seen from Table 2 that applying a single technique either label smoothing (LS) or contrastive training (CT) does make a substantial improvement in AUROC performance results, as well as, in OOD rank (lower is better because then a probability of a truly OOD sample is higher). But the best results are obtained by applying both LS and CT properties. Label smoothing results in tighter class clusters and in combination with contrastive training, the model can learn semantically richer features. As result, better OOD detection is achieved.

Failure mode analysis

The authors demonstrate the worst failure cases for both baseline and contrastive training (Figure 6). As can be seen for most baseline failures (Figure 6a) a contrastive training was able to detect those samples as outliers, i.e. the percentile rank was >50% (higher is better because then a sample has a lower OOD score, i.e., higher probability to be an OOD sample). Figure 6b demonstrates that the CT model was trained with classes 'automobile' and 'truck' and logically could not define some samples from new classes 'pickup truck' and 'bus' as OOD samples because they are too similar (near) to inlier distributions.

Figure 6: Failure mode analysis. $\begin{array}{l}\mathcal{D}_{in}\end{array}$ is CIFAR-10 and $\begin{array}{l}\mathcal{D}_{out}\end{array}$ is CIFAR-100 setting. BL: baseline, CT: contrastive training. The percentages represent the percentile rank of the OOD score. Text in brackets indicates the most probable inlier class activated by Gaussian density estimation. Green text means that CT method successfully detected a sample as an outlier compared to the baseline method.

Discussion and Conclusion

The authors have proposed a novel approach on how to detect out-of-distribution samples using contrastive training. Additionally, a new metric, confusion log probability (CLP), to capture the differences between test samples from inlier and outlier distributions was introduced. Numerous experiments have confirmed the fact that contrastive training leads to better OOD detection because even for images within the same class contrastive loss pushes image representations apart and, at the same time, supervised training guarantees a decent clustering of samples by classes. Another advantage compared to prior methods is that no additional OOD labeled data required, i.e. no manual labeling by the human expert is necessary.

Using the fitted Gaussian distributions to estimate the density of unseen test samples allows the extraction of useful information from unlabelled images, which came from arbitrary distributions. It can be beneficial, for instance, for OOD detection in medical imaging where a lot of unlabelled data is present and can be used to better distinguish between inlier/outlier distributions. The especially challenging part is to detect near OOD samples, but it's a crucial task to be solved because in high-risk domains (e.g., medical imaging, self-driving cars) a correct OOD detection is essential to avoid catastrophic consequences caused by a model error.

Student Review

This paper was submitted to the NeurIPS 2020 conference and is very well structured, written in a clear way, and doesn't contain any errors or inconsistencies. Quite detailed introduction of the paper provides a lot of OOD domain-specific details, which are necessary to understand the content of the paper and can be followed by a non-expert reader. The main novelty of the paper is that the authors looked at the problem of OOD detection from a different perspective and have adapted the model from a self-supervised domain to learn visual representations of images with contrastive training. As a result, new state-of-the-art results were achieved and the authors proposed a novel metric to measure the 'similarity' between inlier/outlier samples which allowed to introduce two groups of OOD samples, namely near and far OOD.

However, I would like to mention a few minor remarks:

The section regarding density estimation was not described clearly and the reader has to read another paper [6] to get a basic understanding of what was meant by the density estimation approach. Maybe a short mathematical derivation and intuition could be added, for instance, in Appendix.
The authors have mentioned several times that the proposed solution is scalable and applicable to medical imaging, but no experiments were performed to verify that, and only the results for non-realistic datasets (e.g., CIFAR-10, CIFAR-100) were demonstrated.
Also, the authors have demonstrated that the proposed solution is effective for the classification tasks, but in medical imaging, a segmentation task is more crucial, e.g., to detect cancer regions or pathologies. Unfortunately, no arguments were provided on how the proposed approach can be adapted for the segmentation task.

References

[1] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020.

[2] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.

[3] Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.

[4] Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pp. 15637–15648, 2019.

[5] Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.

[6] Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177, 2018.

[7] Liang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018.

[8] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems, 2011.

Appendix A

In this section, a brief derivation is provided of why a density estimation using Gaussian distributions is possible. From the discriminative classifier perspective, a classification task can be defined using a posterior distribution $\begin{array}{l}P(y|\textbf{x})\end{array}$ , where $\begin{array}{l}\textbf{x}\end{array}$ is a random input variable and $\begin{array}{l}y\end{array}$ its label. A popular choice for categorical classification tasks is to use a softmax classifier at the output layer:

where $\begin{array}{l}\textbf{w}_c\end{array}$ and $\begin{array}{l}b_c\end{array}$ are weights and bias for class $\begin{array}{l}c\end{array}$ , respectively.

On the other hand, from the generative classifier perspective, a posterior distribution $\begin{array}{l}P(y|\textbf{x})\end{array}$ can be indirectly defined as $\begin{array}{l}P(y|\textbf{x}) = \frac{P(\textbf{x}, y)}{P(\textbf{x})}\end{array}$ , where $\begin{array}{l}P(\textbf{x}, y) = P(y)P(\textbf{x}|y)\end{array}$ is a joint distribution. One of the popular choices for generative classifier is Gaussian discriminant analysis (GDA) in which a class conditional distribution follows the multivariate Gaussian distribution and class prior follows Bernoulli distribution:

where $\begin{array}{l}\bm{\mu}_c, \bm{\Sigma}_c\end{array}$ is the mean and covariance of multivariate Gaussian distribution, $\begin{array}{l}\beta_c\end{array}$ is unnormalized prior for class $\begin{array}{l}c\end{array}$ . To achieve precisely the softmax expression a special case of GDA, Linear discriminant analysis (LDA) should be considered. In LDA we assume that all classes have the same covariance matrix, i.e., $\begin{array}{l}\bm{\Sigma}_c = \bm{\Sigma}\end{array}$ . After all the calculations are performed the next expression can be derived:

It can be easily seen that this expression is equivalent to the softmax classifier if we substitute $\begin{array}{l}\textbf{w}_c = \mu^\top_c\bm{\Sigma}^{-1}\end{array}$ and $\begin{array}{l}b_c = -\frac{1}{2}\mu^{\top}_c\bm{\Sigma}^{-1}\mu_c + \log{\beta_c}\end{array}$ . This leads to a conclusion that $\begin{array}{l}\textbf{x}\end{array}$ can be fitted in Gaussian distribution during the model trained with a softmax classifier.

Seitenhierarchie