Visual Explanation by Interpretation: Improving Visual Feedback Capabilities of Deep Neural Networks

This blogpost is on Visual Explanation by Interpretation: Improving Visual Feedback Capabilities of Deep Neural Networks, a paper by José Oramas M., Kaili Wang and Tinne Tuytelaars.

Introduction

It is interesting that, despite all the advances in the field of deep learning, we are still far away from understanding exactly what is going on inside neural networks. This is not surprising: the highly non-linear convolutional neural network with skip connections, max pooling and dropout is admittedly more complex than a linear regression model where you could simply look at the coefficients and immediately tell which variables were the most important. For many people, a DNN (deep neural network) is more or less like a magic blackbox: it is the final performance that matters and we do not need to bother ourselves with the inner mechanisms. Well, they are wrong.

A DNN may very well be looking at the wrong parts of the image and still be making the right decisions by taking advantage of some correlations inside the dataset. Have a look at the following images, for example:

Figure 1: DNN fails big time [1]

All the boat images inside the dataset that this DNN was trained on were images of boats on water. Apparently, the DNN also associated water with the “boat” class and it was using the “water” information as a helping hand when deciding whether an image was a boat image. Confronted with images of boats outside the water, it failed miserably. It also hallucinated of a boat when given an image of a sea. If you look at the rightmost image, you can see that it was the existence of water that lead the network to decide that it was a boat.

Now, that was rather unimportant. Misclassifying a boat lying outside the water is probably not a matter of life and death. However, we cannot say the same thing for a similar scenario in medicine or in a self-driving car, such as misclassifying an X-ray or failing to detect a kid passing the street.

This might have been too dramatic for an introduction, but the point I want to make is that if you really want your DNN to be robust and generalise well to unseen, different data, then you should make sure that the network is focusing on exactly where it is supposed to focus on.

Related Work

Having established the importance of interpretation in DNNs, let us have a look at what kind of methods are being used to interpret neural networks.

Manual inspection of filters (Zeiler & Fergus [2]): We first feed our input image to the neural network, store all the activations from all channels and by inspecting the visualisations of channel activations manually, we try to find something meaningful, such as a channel that detects edges or a channel that detects circular objects. In addition to being manual, this method is also inherently subjective.
Concept dataset (Bau et al. [3]): We compare the visualisations of channel activations in our network with pixel-level annotations from a concept dataset. This method suffers from the limited number of “concepts” that can be defined.
Ablation study (Zeiler & Fergus [2]): By covering different parts of the input image and measuring the effect on the final performance, we can construct some kind of a heatmap that shows us how important each region is in the determination of a certain class. However, “covering” usually means making that region black, gray or applying pure noise to that region, which actually introduces incorrect information to the input and can, for example, activate the edge detecting channels.

As you can see, each of these methods has a drawback. This is the first observation of this paper.

Next, some quick refresher on what kind of visualisation methods we have at our disposal: The most popular methods can, in essence, be grouped into two classes:

Deconv-based methods: Also called saliency maps, these methods compute the gradient of an output (which can be a logit score or a channel activation) with respect to the input image. They all use deconvolution operations (also called transposed convolutions) to go from a higher layer channel activation to a lower layer channel activation. They also employ “switches” during training to store the position of the highest activations in max pooling operations. Gradients by Simonyan et al. [4], DeconvNet by Zeiler & Fergus [2] and Guided Backpropagation by Springenberg [5] all belong to this class. These three methods differ only in how they handle the ReLU activation layer during backpropagation.
Class-Activation-Map-based methods: These methods generate visualisations by taking a weighted average of the final convolutional layer activations (the activations before Global Average Pooling and the FC layer). Class Activation Maps by Zhou et al. [6], Grad-CAM and Guided Grad-CAM by Selvaraju et al. [7] are the main methods in this class.

There are some drawbacks to both of these groups. Deconv-based methods usually generate visualisations with finer details that are much more pleasing to the eye. However, they tend to display “checkerboard” or “grid” artifacts, which is especially the case for the visualisations of lower layer channels. CAM-based methods on the other hand, suffer from low-resolution, “crude” visualisations. This is due to the fact that the weighted average of activations at the final convolutional layer is often of a much smaller dimension than the input image. The interpolation up to the input image dimensions inevitably results in visibly pixellated heatmaps. Another downside of CAM-based methods is their unavailability for FC layers.

Can we come up with a better visualisation method? This is the second question raised in this paper.

Let us assume that, one way or another, we have obtained a visualisation for a certain input image in our dataset. This visualisation is supposed to show us which parts of the input image were the most important for the DNN. As an example, have a look at the visualisations below for the class “brushing teeth”:

Figure 2: Two images and their visualisations for "brushing teeth" class [6]

Now, just by visual inspection, we can definitely say that these localisations look really accurate. However, unless we come up with a numerical measure, our evaluations are going to remain subjective. The first idea, by Selvaraju et al. [7], is to conduct user studies. They show their visualisations to workers on Amazon Mechanical Turk and measure what percentage of these visualisations were correctly identified. As is the case for most user studies, this is also highly subjective. Another method is using the visualisations we have obtained on o proxy task and measure the difference in performance, as done by Zhou et al. [6]. The problem with this second approach is that it changes our objective from finding the most accurate visual explanation, to finding a visualisation that maximises the performance in the proxy task. Clearly this is not ideal.

Here comes the third and the last idea of the paper: we need an objective and quantitative method for the evaluation of visualisations.

As a remedy for each of these 3 issues we talked about, the authors propose 3 ideas:

To address the problem of model interpretation, they propose a method to find the most important channels for any given model.
To address the problem of visualisation of models, they propose a modification to DeconvNet, which preserves the fine details of DeconvNet while at the same time eliminating the grid artifacts.
To address the problem of objective evaluation of visualisations, they release an8fFlower, an artificially constructed dataset.

Methodology

The starting point the authors’ work is that for any given class, the importance of the encoded features is different for different channels, as shown by Bau et al. [3] and Yosinski et al. [8]. This means that in the determination of a certain class, some channels are very important while some channels are not important at all. As an example, the features that are relevant for identifying a ball are probably not very critical for identifying a dog. The problem is to identify the “relevant” features for each class.

The method proposed in this paper is as follows:

Take a training input image and feed that to the DNN.
Store the activations for all of the channels in the DNN architecture.
For each of these stored activations, find the L2 norm, which I will refer to as the “score” of that channel from this point on.
L1 normalise the scores that belong to the same layer. For example, if the 14th layer has 3 channels with scores [15.3, 6.7, 8.9], normalise these scores to [0.49, 0.22, 0.29]. Do the same thing for all layers, to cover all the channel scores.
Put these normalised scores into an array. At the end, if you have 100 channels in total in your DNN architecture, you are going to end up with a score vector of length 100.
Repeat steps 1-5 for all the training images in your dataset, and form a matrix X. So, if you have 100 channels in your architecture and 2000 images in your training dataset, this X matrix (strictly speaking XT) is going to have 2000 rows and 100 columns.
Similar to step 6, construct a matrix L (strictly speaking LT) that contains the one-hot-encoded ground-truth labels for all training images. So if there are 2000 images in your training set and the total number of classes is 50, then this matrix is going to have 2000 rows and 50 columns.

(As a side note, I am not sure what they wanted to achieve by L1 normalising the scores inside each layer. I accept that there needs to be some normalisation, because an activation might get a high L2 norm just thanks to its large dimension, so we need to correct for size, but I do not get how their normalisation helps.)

Now that we have constructed the X and L matrices, we try to find the optimum W matrix for the following matrix system:

By “optimal”, we mean that this W should be the solution to the problem:

What we are trying to achieve here is as follows: We are trying to find a weight associated with each channel and class pair such that when we multiply a score vector with its corresponding weights, we will get as close an approximation of the one-hot encoded label vector as possible. The W matrix has dimensions (# channels) x (# classes), which means that for each class, the score vector needs to be multiplied with a different weight vector.

As an example, suppose that we have a cat image and that we have calculated the score vector for this image. When we multiply this score vector with the weight vector for the cat class (the entries of the W matrix in the cat column), the result should be close to 1. On the other hand, when we multiply this score vector with the weight vector for the dog class, the result should be close to 0.

One constraint in this problem is that the weight vector for each class should have an L1 norm lower than μ:

This μ is the sparsity parameter and it controls the number of non-zero entries in the solution. A higher μ value results in more non-zero entries whereas a lower μ gives you a W matrix with lots of zeros. This constraint makes the optimisation problem a “μ-lasso optimisation”.

Now, we can talk about the pipeline proposed in the paper: First we train our DNN using our training data: nothing different here. Then, again using training data, we solve the μ-lasso optimisation problem we have just talked about to get the W matrix. At test time, we feed the image to the DNN, extract the x vector and get the DNN’s class estimation. Then we element-wise multiply the score vector from the image and the weight vector corresponding to the estimated class to obtain a “response” vector. The largest entries of the response vector are going to tell us which channels have been the most critical in the estimation of this class. Finally we take the 3 (or 5, for that matter) top-responding channel activations and visualise them.

Figure 3: The proposed pipeline [9]

Now, the authors use a modified DeconvNet to visualise these identified relevant features (For a detailed discussion of deconvolution, you can refer to Zeiler & Fergus et al. and Springenberg et al.). As was pointed out in the previous chapter, a drawback of deconvolution-based methods is that when applied with a stride larger than 1, the visualisations display a checkerboard (grid) artifact. To overcome this problem, authors propose first interpolating the original input activation to another size and then applying a deconvolution with a stride of 1 to end up with the same dimensions as the original approach with a high stride. You can have a look at the image below for a visual explanation:

Figure 4: Vanilla deconvolution

Figure 5: Oramas et al.'s modification

Instead of going from a dimension A to a dimension B with a stride-2 deconvolution, we first interpolate the original input activation from dimension A to dimension Â. Afterwards, we apply a deconvolution with stride 1 to end up with a dimension B. The dimension Â that will result in dimension B after a stride-1 deconvolution can easily be calculated using the formula:

Results & Discussion

The authors conduct 4 sets of experiments on 4 datasets:

MNIST by LeCun & Cortes (2010): Hand-written digits
Subset of Fashion144k by Simo-Serra et al. (2015): Instagram photos
imageNet by Russakovsky et al. (2015)
imageNet-cats, which is actually a subset of imageNet

Now let’s go one by one over the experiments they conducted.

Experiment 1: Relevant Filters

We established that we could identify the most critical channels for any given class by looking at the channel responses, which we find by multiplying the channel scores with the channel weights for that class. However, are these channels really the most important ones? This is what we try to find out in this experiment.

The authors reason that if the features we have identified for a certain class are indeed highly relevant, then their removal (setting their activations to zero) should have a larger effect on the final performance than the removal of some random features. Below you can see a figure showing the results on the 4 datasets. Let’s concentrate on the MNIST dataset first.

Figure 6: Changes in mean classification accuracy as channels are removed [9]

The reddish column (rightmost) shows the original performance, i.e. without removing any channels. The light blue column (first) shows the performance when we remove the identified channels. The striped-blue column (second) is the performance when some random channels were removed. The performance drop is evidently greater in the case of identified filters, which means that the identified features are meaningful. Honestly, just because these filters are more important than some random filters does not prove in any way that these filters are the MOST important. In that sense, I do not think that this experiment has much meaning.

In the orange and striped-orange columns, you can see the performance when the relevant feature finding and random feature selection was limited to convolutional layers. The performance in both cases are inferior to the performances obtained in the first and the second columns, which shows us that FC layers also contain important information.

Now if you have looked at the results for other datasets, you might have noticed that the performance difference between identified features and randomly selected features is not as dramatic. There is no explanation or comment on this point in the paper.

Experiment 2: Qualitative Evaluation of Visualisation Methods

We talked about how the authors came up with a modification to DeconvNet to address the checkerboard artifact problem. In this experiment, they compare their visualisations with the visualisations obtained through upsampled activation maps and Guided Back Propagation:

Figure 7: Visualisations of identified filters with different methods [9]

Here, we see how the identified "relevant" features were visualised. (2/21 means that the channel was in the 2nd layer among 21 layers and the pink colour for the 20/21 layer means that layer is a FC layer, which means we cannot get an upsampled activation map). We notice that the modification has indeed solved the artifact problem (compare the second and the third images from the first column) and that it has also been able to retain the fine details of GBP.

Experiment 3: Quantitative Evaluation of Visualisation Methods

Visualisations are nice to look at, but we cannot really come into any conclusions just by looking. For an objective evaluation, we need some numbers and this is where the an8flower-single-6c dataset comes in.

Figure 8: Classes in an8flower-single-6c dataset [9]

an8flower-single-6c is an artificially constructed dataset which was released with this paper. A 3D plant image was rotated 360 degrees and snapshots were taken at different rotation angles. There are 6 classes in total and the only thing that differentiates one class from another is the color of the flower (or the petals, to be exact). The main body of the plant is the same in all classes. Accordingly, a neural network trained on this dataset can only differentiate between different classes by looking at the flower part, because that is the only discriminative region. An ideal visualisation method should also highlight only this region, since the DNN cannot be focusing on the main body. Based on this idea, the authors use the petal region as the ground truth to evaluate visualisation methods on.

Figure 9: Ground-truth masks for different input images [9]

In the figure below, we see a comparison of visualisations from different methods and the ground-truth mask for each image. Authors' visualisation seems to have captured the ground-truth region more or less accurately and we also note that Guided Grad CAM++ is a serious competitor:

Figure 10: Authors' method (rightmost) compared with other visualisation techniques [9]

(As a side note, there is actually a second part of this dataset, an8flower-double-12c. In this dataset, there are again 6 different colours but this time it is either the petals or the main body that is coloured, so this makes a total of 12 classes. The authors again use the coloured region as the ground truth for the visualisations, but it actually makes no sense this time. In the case of 6 classes, only the petals are discriminative but when the body is also allowed to have different colours, then actually the DNN should be looking at the whole plant when making its decision. I have therefore decided to exclude the results for this second dataset, because they have no significance.)

Visualisations are basically heatmaps; they are not binary. To be able to measure their performance on a ground truth mask, we need to binarise them through thresholding. After thresholding, we can use the IoU (intersection over union) as a measure for the accuracy. However, the thresholded visualisation (and therefore the IoU score) depends on the selected threshold. To solve this issue, we extract the IoU score for a large number of threshold values. Putting these scores together on a graph, we can construct an IoU curve. Similar to the evaluation of a binary classification network, we can then use the AUC (area under the curve) of this IoU curve to compare different methods. Below you can see how different methods have fared in this experiment:

Figure 11: AUC scores of different methods on an8flower dataset [9]

The authors’ method achieves the highest AUC score, which they say is another evidence of how well their method works. However, I would take these results with a pinch of salt, because there is a caveat here: They use the identified relevant features only for their method, which means that for GBP, they applied deconvolution on the logit class score, not on the identified features. In this sense, I think this is not an apple-to-apple comparison for their modification to the DeconvNet. This is more of a comparison of their whole pipeline with other visualisation methods. In CAM-based methods, relevant filters would not be of any use anyway, so it makes sense to compare whole pipelines but for GBP I would still want to see an additional evaluation using the identified filters, because right now there is no evidence to show that their modification to deconvnet is quantitatively better than the baseline.

Experiment 4: Sanity Check for Visual Explanations

In the last experiment, the authors perform a sanity check for the generated visualisations. More specifically, they test whether their visualisations are class-sensitive. A class-sensitive visualisation method should yield different visualisations for different classes. This makes sense because the DNN is certainly not looking at the same regions in the input image when making predictions for different classes. In the figures below, we see a cat as the input image on the left. On the right are visualisations for different classes, the first one being the actual class of this image, namely “cat”. In our case, these visualisations are (to the best of my understanding) obtained by finding the top-responding channel for each class. In the case of GBP, this would mean applying deconvolution on the corresponding class logit score. The key point is that the input image is the same cat for all these visualisations. Just as how we have a score associated with each class for a single input, we can also generate a visualisation for each class.

Figure 12: Visual explanations for different classes obtained by using the authors' method [9]

As you can see, these visualisations, which correspond to different classes, are quite different from each other and we can say that the authors’ method (the whole pipeline) is really class-sensitive. You might think that this is very natural and straightforward, but actually it is not. Nie et al. [10] and Adebayo et al. [11] conduct many sanity checks in their work, including a test for class sensitivity and they show that two very popular visualisation techniques, DeconvNet and Guided Back Propagation, are not class-sensitive at all. As it turns out, these two methods actually work similar to edge detectors. Have a look at the visualisations obtained through DeconvNet:

Figure 13: Visualisations for different classes obtained by using DeconvNet [10]

The visualisations on the right are almost completely identical. The first visualisation is for the tabby class, whereas the other two to the right are for some random classes. This shows us that not all glitters is gold; just because the visualisations look nice does not mean that it is a good visualisation method or even that it is a valid visualisation method at all.

Although the authors’ method seems to have passed this sanity check, some questions remain. First of all, in their work, Nie et al. [10] and Adebayo et al. [11] claim that the reason why DeconvNet and GBP do not work while vanilla “Gradients” by Simonyan et al. [4] works is about gradients. These three methods are basically identical except for how they treat the ReLU layers. When backpropagating through a ReLU nonlinearity, Gradients method discards the activations at the pixels that had negative activations during the forward pass. DeconvNet on the other hand, discards the pixels that have negative activations during the backwards pass (so in a sense, it applies ReLU during backpropagation). GBP combines these two together and discards all pixel activations that satisfy either of these conditions. Here is a nice visual to clarify this difference:

Figure 14: Comparison of backpropagation, deconvnet and GBP [5]

Now, it seems a little bit strange that, just by changing the high-stride convolutions, the authors have been able to achieve class-sensitivity, because the underlying technique (how it deals with the ReLU layers) is just the same. However, on their defence, the authors admit that this method may turn out to be faulty but they still maintain that the main point of their work is the identification of relevant features and that if someday somebody comes up with a better method to visualise these features, they would be happy to accept it.

Conclusion

In this paper, we saw the authors’ attempt of coming up with a solution to each of the three problems they identified in the field of DNN interpretation: How do we interpret a trained model? How can we visualise a model or a filter of a model? How can we quantitatively evaluate a visualisation?

The main contribution of the paper is, clearly, their answer for the first question.Their method for finding the relevant filters is simple and easy to understand; however, expressing the final class labels as weighted summations of L2 norms of channel activations might be too much of an approximation. The field of deep learning is full of these “it-just-works” heuristics (CAM can also be said to be a very simple approximation, for that matter), so I do not mean to say I am against their method categorically. Nevertheless, as I have expressed it before, there is no evidence that these “relevant” features are more than just “above-the-average”. More experiments are needed for a more accurate analysis.

I also liked their idea of a dataset for objective evaluation of visualisations (bar the second dataset that I talked about). Again, this dataset might be too small and simplistic compared to ImageNet, where you would need tens of layers, skip connections and so on to get a decent classification performance. Therefore I again have doubts about the generalisability of this method: A visualisation method that shows great performance in this dataset may not work at all for a different dataset with a different architecture.

A serious downside about this paper was that, although there were a lot of images, sometimes the explanations were lacking as to how these images were obtained. The inconsistent usage of the words “visualisations”, “interpretation”, “visual explanation” etc. did not make understanding any easier. Their code was supposed to be accessible under the link https://homes.esat.kuleuven.be/~joramas/projects/visualExplanationByInterpretation/ but at the time of writing (Nov 2019), more than 6 months after the publication of this paper, the code section was still “under construction” for some reason.

As a final "verdict", I think they presented some simple ideas, which might really be working well, but I should say that we definitely need to see more in-depth experiments to be convinced (and, ideally, the code).

References

[1] Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst and Yun Fu. Tell Me Where to Look: Guided Attention Inference Network. 2018. URL https://arxiv.org/abs/1802.10171

[2] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), 2014. URL https://arxiv.org/abs/1311.2901

[3] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition (CVPR), 2017. URL https://arxiv.org/abs/1704.05796

[4] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations (ICLR) Workshops, 2014. URL https://arxiv.org/abs/1312.6034

[5] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. Striving for simplicity: the all convolutional net. In International Conference on Learning Representations (ICLR) Workshops, 2015. URL https://arxiv.org/abs/1412.6806

[6] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba, Computer Science and Artificial Intelligence Laboratory, MIT. Learning Deep Features for Discriminative Localization. 2015. URL https://arxiv.org/abs/1512.04150

[7] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), 2017. URL https://arxiv.org/abs/1610.02391

[8] Jason Yosinski, Jeff Clune, Anh Mai Nguyen, Thomas J. Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. In International Conference on Machine Learning (ICML) Workshops, 2015. URL https://arxiv.org/abs/1506.06579

[9] José Oramas M., Kaili Wang, Tinne Tuytelaars. Visual Explanation By Interpretation: Improving Visual Feedback Capabilities Of Deep Neural Networks. International Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/1712.06302

[10] Weili Nie, Yang Zhang, Ankit B. Patel. A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations. 2018. URL https://arxiv.org/abs/1805.07039

[11] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, Been Kim. Sanity Checks for Saliency Maps. 2018. URL https://arxiv.org/abs/1810.03292

Seitenhierarchie