Introduction
Convolutional neural networks achieve state-of-the art performance in many computer vision tasks. But even if a model performs really well according to all available reasonable metrics, it’s still not guaranteed to follow the logic, which we consider to be correct. We do not know, why the prediction is correct or what the reasons behind the failures are.
Sometimes a bias in the dataset or feature representation may be decisive for a network prediction [1]. Should we use these models? Think about the application of deep learning models in the medical field, where the doctors may decide for a certain patient treatment based on the network predictions. How should the doctor act if the prediction contradicts his or her experience? If we do not know why the model fails, we can not trust it.
Trustiness is the key component of the successful implementation of a model into the practice.
There is a lot of the research in the field of interpretability or explainability of neural networks, but the currently used methods are restricted to the pixel-level features visualization and correlation between input and output [2, 3, 4]. They still require human interaction and interpretation for every image to find the potential problems with the model and can be prone to subjectiveness.
Unlike previous work, the paper Explaining neural networks semantically and quantitatively introduced a method to explain network predictions semantically and quantitatively. It uses clear meaningful concepts, e.g. head, wings, and calculates the numerical contribution of each of the concept to the final prediction score.
Moreover, the method also overcomes two main problems related to making network interpretable - accuracy vs. interpretability trade-off [5] and the bias interpreting problem [6].
Methodology
The method proposed in the paper distills knowledge [7] from a classifier network into an explainable additive model. The goal is to approximate the class prediction score, even if the prediction is incorrect. The approximation is a sum of quantitative contributions of visual concepts, which are used to explain the prediction.
The general framework consists of three components
- Performer, a classification network. Let’s denote its class prediction score for a given input image I as \hat{y}.
- Detection models for visual concepts, n pre-trained models that are used to detect n pre-defined visual concepts. For the i-th visual concept the detection confidence y_{i}(I) is the output of the corresponding detection model.
- Explainer, a network which estimates weights \alpha_{i}(I) for all visual concepts. The weights are necessary, because we want to keep the initial flexibility of the performer, i.e. correctly explain different input images of the same class with the same visual concepts differently represented in the image. The product \alpha_i y_i is the quantitative contribution of the i-th visual concept, we convert it into percentage value to produce the final explanation results.
Once the performer and detection models are learned, the approximation equation looks as follows
\hat{y} \approx \alpha_1(I) \cdot y_1 \ + \ \alpha_2(I) \cdot y_2 \ +...+ \ \alpha_n(I) \cdot y_n \ + b , |
where b denotes the bias term. We only need to train the explainer to estimate the weights \alpha_i that minimize the approximation error using knowledge distillation loss
L= \| \hat{y} - \sum_{i=1}^{n} \alpha_i \cdot y_i \ - b \|^2. |
Two approaches
For this quite simple and intuitive algorithm the authors suggest two different network settings and approaches
Case I: Performer contains interpretable filters in the top convolutional layers
Each filter is a detection model since it is activated by a unique visual concept. For more details see [5]. In this setting, the detection confidence is given by y_i = \sum_{h,w} x_{hwi} , where x \in \mathbb{R}^{H \times W \times n} is a feature map for a top convolutional filter.
Case II: Detection models are neural networks
Detection models share low layers features with the performer. The detection confidence y_iis just the output of the network head responsible for i-th visual concept.
Note an important similarity of both cases. The detection models are either part of the performer or have a common architecture. The reason behind this design choice is that we want to mimic the logic, e.g. feature extraction, of the performer.
Bias interpreting problem
Using the knowledge distillation loss only, we face a bias interpreting-problem, where the bias mostly explains the prediction and only very few features are selected to support it. The explainer does not learn the true relationship between the visual concepts and the prediction.
To overcome this problem the authors propose to guide the training of the explainer with the prior weights w_i for \alpha_i and extend the original loss with the prior loss \mathcal{L}.
Loss = L + \lambda(t) \cdot \mathcal{L}(\alpha, w), \ \ s.t. \ \lim_{t \to \infty}\lambda(t) = 0. |
The prior loss penalizes dissimilarity between alpha and w. The choice of the loss function depends on whether we require all weights to be non-negative or not.
\mathcal{L}(\alpha, w) = \begin{cases} crossEntropy(\frac{\alpha}{\|\alpha\|_1}, \ \frac{w}{\|w\|_1} ) \ \ \forall i, \alpha_i, w_i \geq 0 \\ \| \frac{\alpha}{\|\alpha\|_2} - \frac{w}{\|w\|_2} \|_2^2\\ \end{cases} |
The guidance of the learning process is necessary only in early epochs, for this constraint the authors introduce \lambda(t) = \frac{\beta}{t}, with\beta a constant and t epoch number.
Regarding the computation of prior weights estimation, we have to consider the two cases from above. Note that prior weights are different for different images.
For case I, we proceed similar to GradCAM [3,1] and backpropagate the weights or importance of the very last feature vector components to the interpretable convolutional filter
w_i = \sum_{h,w} \frac{\partial \hat{y}}{\partial x_{hwi}}, |
where x \in \mathbb{R}^{H \times W \times n} is a feature map of the interpretable conv-layer, and x_{hwi} activation unit in the location (h,w) of the i-th channel
For case II, we also use gradients and convolution operations to estimate the prior weights as ratio of the change in performer prediction score to the change in detection confidence of visual concepts. For mathematical details, please, refer to the paper and bibliography
Experiments
Experiment 1 - Explaining the animal detection using it body parts as visual concepts
In this experiment we follow the case I approach introduced in the section above. We use the Pascal-Part dataset [8]. For six different animals we learn a binary classification CNN performer and consider its filter in the top convolutional layer to be the detection models. The reason behind this binary classification setting is that we work with interpretable filters. Each animal has different body part and we also learn different filters. A single body part can be associated with different filters and we have to sum up the contribution of the involved filters to compute the final quantitative contribution.
For the performer authors used AlexNet and VGG-M, VGG-S, VGG-16 networks. Explainer was a 152-layer ResNet with different number of output channels depending on the performer’s architecture. For the prior weights estimation the case I was used, with the cross entropy prior loss and \beta = 10.
Experiment 2 - Explaining general face attributes based on other face attributes
The CelebA dataset [9] with 40 face attributes was used. Performer and detection models are of VGG-16 structure and share the low layer features following the case II approach with the corresponding prior weights estimation and the L-2 norm prior loss for beta = 0.2. Explainer is the same ResNet-152 network as in the first experiment. But now it may produce negative weights. For example, detection of the „male attribute“ decreases the prediction score for the „heavy makeup“ attribute. Please, refer to the supplementary materials for more qualitative and numerical results.
Evaluation
For qualitative we use the GradCAM baseline and compare the results with feature map visualization.
The body parts, which contributes to the prediction the most, are also highlighted by GradCAM, which indicates the correctness of the proposed method.
For quantitative evaluation the authors introduced three new metrics to assess the correctness of the explanation. Here the baseline denotes a model with the knowledge-distillation loss only.
Error of the estimated contribution for different parts
\mathbb{E}[|\sum_{i \in \Omega_p} \alpha_i y_i - y^\star_p|] / \mathbb{E}[y], \ \ \ \ y^\star_p= y \frac{∆y_p}{\sum_{p'} ∆y_{p'}} Bias interpreting method (Entropy of contribution distribution)
c_i = |α_iy_i | / \sum_{j} |α_jy_j| Accuracy and relative deviation to measure the information that could not be represented by visual concepts
|\hat{y_I} - \sum_{i}\alpha_{I,i}y_{I,i} -b| / (\max_{I'\in I} \hat{y_{I'}} - \min_{I'\in I} \hat{y_{I'}})
To sum up the evaluation results, we see that the explainer does a good job to approximate the prediction score of the performer. The bias-interpreting problem is successfully solved.
Conclusion
The paper proposed a completely new strategy for the prediction explanation of different benchmark CNNs. It decomposes the prediction score into the sum of quantitative contributions of semantically meaningful visual concepts. Moreover, the knowledge distillation method with the modified loss for prior weights estimation avoids hurting the discrimination power of the performer and the bias interpreting issues with the explainer. The method looks convincing according to the introduced qualitative and quantitative evaluation metrics.
Student's View
In my opinion, the paper is well structured, the motivation and methodology in general are clearly explained. I found the introduction and related work sections very useful, they contain an overview of existing and trending approaches in the network interpretability research and provide a very informative introduction into the field. In particular, the authors emphasize the novellness of the proposed method compared to the common pixel-level network explanations. They developed theory for two different network settings, what I found a little bit confusing first, but experiments and their description provide more clearness. The interpretation of the experiment results is available in the extensive supplementary materials section.
The paper is partially heavily based on the related work as in the prior weights estimation and experiment 1 section and hence the important details of methodology are skipped. I had to read three further papers thoroughly to understand the algorithm and the experiment 1 in detail.
To the potential drawbacks I would count the missing discussion and future work section. Related to the content, a weak point would be that the authors only considered performer to be a binary classifier. It would be interesting to see, if there are any issues in extending the preformed to a multi-class classifier. Moreover, the paper also lacks information about the influence of prior loss and beta hyperparameter on the explainer training. Another interesting topic for discussion would be a selection of visual concepts. What would happen if we intentionally introduce some concepts for bias (e.g. sky for bird, couch for cat category) or noise, can we still explain the performer’s predictions?
Regarding the applicability of the proposed method in the medical field I look very positive into the future. It would be possible to follow the second experiment for explanation of a disease detection, which can be described by visual concepts. One concrete example is skin lesion analysis and melanoma detection. There is already a large dataset available of skin lesion images annotated with different visual attributes.
References
All images are taken from [10] and modified by myself.
[1] Q. Zhang, W. Wang, and S.-C. Zhu. Examining CNN representations with respect to dataset bias. In AAAI, 2018.
[2] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
[3] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
[4] M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should i trust you?” explaining the predictions of any classifier. In KDD, 2016.
[5] Q. Zhang, Y. N. Wu, and S.-C. Zhu. Interpretable convolutional neural networks. In CVPR, 2018.
[6] J. Vaughan, A. Sudjianto, E. Brahimi, J. Chen, and V. N. Nair. Explainable neural networks based on additive index models. In arXiv:1806.01933, 2018.
[7] G. Hinton, O. Vinyals and J. Dean, Distilling the knowledge in a neural network. In arXiv:1503.02531, 2015.
[8] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
[9] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
[10] R. Chen, H. Chen, G. Huang, J. Ren, and Q. Zhang, “Explaining neural networks semantically and quantitatively,” arXiv preprint arXiv:1812.07169, 2018.