Understanding deep networks via extremal perturbations and smooth masks

Paper review

Problem statement/Big picture

Introduction

This paper was presented during the international conference on computer vision. It presents a method to understand the decision and the functionnement of a convolutionnal neural network applied to image task (it can be detection, classification or segmentation).
Neural network outperform previous methods for this tasks, however they are acting like black boxes and are almost impossible to understand as a result. This can be a problem in many domains, including medical application, where we would like to have the option for an human to make sure the decision proposed by the AI is the right decision.

It has been written by Ruth Fong, from the University of Oxford; Mandela Patrick from the University of Oxford and Andrea Vedaldi from Facebook AI Research.

Related work

The proposed method to understand a CNN contains two parts:

understanding what part of the image is responsible for the output
understanding the hidden convolutionnal layer, which are acting as feature engineering (the visualization part)

The method presented in this paper will give some help for both of them.
Concerning 1, there exist 3 kind of methods:

The backpropagations methods, which consist in using the back-propagation algorithm to relate the information of the output to the input. The most classical of them is the gradient method [1] . There exists improvements of this method, as presented in another paper for this class.
The approximation based methods:
They are general method to interpret any Neural Network model (not just for images). The principal is to approximate the Neural Network with another AI model that would be more interpretable, for instance a decision three. [2]
The approximation methods. This method consists in applying a perturbation to the input and observing the result on the output. For instance, the occlusion method proposes to occlude a part of the image. The extremal perturbation method presented in the current paper is an approximation method as well.

About the visualization part (2), the related work proposes two ways of visualizing the intermediate layers. Either you try to find the image of the dataset that activate the intermediate layer the most, or you reconstruct an image given an activation of the intermediate layer (using back propagation) [3]

The best though is to use combinations of these methods in order to reconcustruct image that are close to the images of the dataset and though more realistic and understandable.

Methodology

Previous concept

The principle of the method is based on a previous paper written by Ruth Fong and Andrea Vedaldi, 2 of the authors of the current paper.
As mentioned before, the idea is a perturbation method. Particularly, we look at the effect of applying a mask to the input image.
On the previous paper, the idea was to optimize at the same time: the activation of the output, the size of the mask and the smoothness of the mask. Therefor, they defined an energy with this 3 terms and apply an optimization algorithm to it.

Hence, the energy was defined as such:

With:

$\begin{array}{l}\phi\end{array}$ the activation
m the mask
x the input image
$\begin{array}{l}\lambda\end{array}$ and $\begin{array}{l}\beta\end{array}$ some weight to balance
S a smoothness measure function

New idea

The main problem of this method that the author of the paper are solving here is that it's hard to know how to balance the 3 terms, i.e to set $\begin{array}{l}\lambda\end{array}$ and $\begin{array}{l}\beta\end{array}$ correctly. Therefor, they propose to work with a fixed max sized and on a predefined space of smooth mask, so that we don't need the 2 last terms of the equation anymore when we optimize.

We obtain the following equation, where a is the ration $\begin{array}{l}\frac{size of the mask}{size of the image}\end{array}$

Now we can notice that m depends of a instead of $\begin{array}{l}\lambda\end{array}$ and $\begin{array}{l}\beta\end{array}$ . For the concept of extremal perturbation, we need to define $\begin{array}{l}\phi_0\end{array}$ a threshold of activation. It is not really precised in the paper how this threshold should be chosen, but in the example later, they take for instance a proportion of the initial activation of the image (i.e before applying a mask).

Then we define $\begin{array}{l}a^{*}=\min \left\{a: \Phi\left(\boldsymbol{m}_{a} \otimes \boldsymbol{x}\right) \geq \Phi_{0}\right\}\end{array}$

It's the minimum size for which we find a mask that has a larger activation than the threshold.

An extremal perturbation is a mask associated with this minimum.

How to implement it

To optimize that energy, they want to use a gradient method technique. Therefor, it is required to work on a continuous space for the mask. So, instead of using a mask of $\begin{array}{l}\Omega^{\{0;1\}}\end{array}$ , we will use a mask of $\begin{array}{l}\Omega^{[0;1]}\end{array}$ . In the end though, we still want to have a binary mask (otherwise it is impossible to fix it's size for instance). Therefor, we reintroduce a a term in the equation that will take care of the size of the mask. But this time, instead of having a term that tries to get a mask that is as small as possible, we will get a term that forces the size of the mask to tend quickly to the $\begin{array}{l}a \Omega\end{array}$ .

To do so, they introduce the operator $\begin{array}{l}R_{a}(\boldsymbol{m})=\| \text { vecsort }(\boldsymbol{m})-\mathbf{r}_{a} \|^{2}\end{array}$ .

We obtain the final equation of the energy we will effectively optimize $\begin{array}{l}\boldsymbol{m}_{a}=\underset{\boldsymbol{m} \in \mathcal{M}}{\operatorname{argmax}} \Phi(\boldsymbol{m} \otimes \boldsymbol{x})-\lambda R_{a}(\boldsymbol{m})\end{array}$

We still need however to define the space of the masks $\begin{array}{l}\mathcal{M}\end{array}$ . It is a bit complicated, but to summarize, we use the ensemble of the Gaussian Blur mask $\begin{array}{l}\pi_{g}(\boldsymbol{x} ; u, \sigma)=\frac{\sum_{v \in \Omega} g_{\sigma}(u-v) \boldsymbol{x}(v)}{\sum_{v \in \Omega} g_{\sigma}(u-v)}, \quad g_{\sigma}(u)=e^{-\frac{\|u\|^{2}}{2 \sigma^{2}}}\end{array}$

Then we convolve this ensemble, such that the maximum of the mask is restored to 1.

Attribution at intermediate layers

Instead of applying the mask to the input image, it is possible to apply it to the intermediate layer. Therefor, the energy is modify so: $\begin{array}{l}\boldsymbol{m}_{a}=\underset{\boldsymbol{m}}{\operatorname{argmax}} \Phi_{l+}\left(\boldsymbol{m} \otimes \Phi_{l}(\boldsymbol{x})\right)-\lambda R_{a}(\boldsymbol{m})\end{array}$

With $\begin{array}{l}\Phi_{l+}\end{array}$ being the activation of the subnet after layer $\begin{array}{l}l\end{array}$ and $\begin{array}{l}\Phi_l\end{array}$ the activation until layer l. A feature inversion method can then be used to observe the results.

Results and conclusion

Qualitative results

For measuring how well a method explains a neural network, there is no direct method. As we will see in the next part, it can be measured by indirect methods, but these are never exactly measuring what we want, so that is also important to show some qualitative results. So here are some masks produced by the algorithm (the image comes from the paper).

The green/red bar represents $\begin{array}{l}\frac{the activation}{0.25*original image activation}\end{array}$ it becomes red when >1. Relevant parts of the image for classification are correctly included in the mask.

Sanity check

A recurent problem for previous work was to give the same result regardless of the neural network. Here it is not the case, as exposed on this image.

Pointing game

Pointing game is an indirect way to measure how well the method explains the network. More specifically, it measure the quality of a saliency map (an image representing the importance of each pixel of the image). We can creat a Saliency map with extremal mask by summing gaussian blur of different mask sizes.

The pointing game mark points when the maximum of a saliency map for a certain object detection is effectively contained on the object. We then divide the number of point by the total number of objects. Here are the result presented. All/Diff means all image or difficult images. Meaning we measure only for difficult images of the dataset.

Monoticity of visual evidence

One intuitive thing with this method that computes masks of different sizes is that the larger the mask is, the more intensive the output $\begin{array}{l}\phi\end{array}$ will get. It means $\begin{array}{l}\phi (a)\end{array}$ is increasing. In practice, it is true in 98.45% of the cases. It means that the area of the images can be sorted by importance by considering in which mask they belong.

Using that we can generate such saliency map by just adding the masks of different sizes

The curve represents the activation with respect to the mask size.

Visualizing per-class channel attribution

The best way to observe the masks produced for hidden layers is to sum them for every instance of a class. If we then look at this sum for each channel, we can automaticly find class specific channels! This is what has been done on the following example.

Conclusion

The extremal pertubation method improves a method that consist of trying to find the mask that activates the channel of a class the most. It introduce new computational concepts to get smooth mask of a fixed size without having to play with parameters. The concept of extremal perturbation comes next as the minimum max size that activates the output over a certain threshold.

They also extend the method to create mask that observe what happen for intermediate layers. The results are very convincing and enable some new kind of NN observation using the masks in different ways.

My own review

Strengths

The paper shows some use cases of the method
The computational details are very interesting (for instance the vecsort operator)

Weaknesses

They don't prove that the method archives better results than previous work
Pointing game measures the accuracy of a saliency map while the idea of the method is to create a mask

Suggestions for improvement/future work

I wonder why it is necessary to force binary mask while we could output directly a saliency map by having continuous mask (then we could still force a size by restricting the norm-2 of the mask for instance)

Seitenhierarchie