This is the blog post for the paper 'Uncertainty and interpretability in convolutional neural networks for semantic segmentation of colorectal polyps' written by Kristoffer WickstrØm, Michael Kampffmeyer and Robert Jenssen.

Introduction

Problem Statement

In the last ten years neural networks (NNs) have experienced an enormous growth in popularity. Higher available computation power and outperforming results of NNs compared to other Machine Learning methods have led to a wide adaptation in the world. However, NNs do not come without flaws. The major disadvantages are the low robustness (slight fluctuations in the input can lead to entirely different predictions), their black box behavior (no apparent reasoning behind predictions) and their lack of uncertainty representation (no measurement for a NNs confidence in its prediction). In safety critical domains like autonomous driving or medicine all three flaws can lead to possibly fatal situations and therefore need solutions.

Figure 1: Slight perturbations can lead to totally different predictions. Adversarial attacks can make use of that to harm others.

(Adversarial Examples that Fool Detectors [1])


Figure 2: NN trained for disease diagnosis on x-ray images used metal tokens for inference (top right). Accuracy on test images without metal tokens dropped significantly.

(Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs [1])


Proposal

The authors try to equip a Fully Convolutional Neural Network (FCN) with a measurement for uncertainty and with means of interpretability. For that they propose a combination of two existent methods, namely Monte Carlo Dropout for the uncertainty part and Guided Backpropagation for the interpretability part. To test their idea in practice, they decided to try it on the segmentation of colorectal polyps.   

The Problem of Colorectal Polyps

A colorectal polyp is an abnormal growth of tissue in the colon. The polyp can, if not treated, evolve to cancer, one of the leading causes of cancer-related deaths [1]. The good thing is, the polyps can be detected and safely removed by a doctor during a colonoscopy (a visual examination of the inside of a patients colon). However, as the colon is a very large organ, especially area wise, it can happen that the doctor misses some polyps. The estimated missing rate lays between 8-37% [1]. Each missed polyp can have potentially fatal consequences.     

Using the proposed FCN can help detecting these polyps and thus decreases the missing rate. As the network acts as a Decision Support System here, it is important that it provides explanations for its decisions and that it shows the amount of confidence it has in them. In other words, a measure of uncertainty and means of interpretability are necessary.

Methodology

Uncertainty: Monte Carlo Dropout

Bayesian Statistics

When hearing the word uncertainty, the first thing that might come to mind is Bayesian Statistics. More specifically, the Posterior Predictive Distribution (Figure 3). Let x* be the input sample, y* the target, W the weights of the network and D the (training) data. Then the posterior predictive distribution looks as follows: p(y^*|x^*,D) = \int{p(y^*|x^*,W) \, p(W|x^*,D) \, dW}. The posterior predictive distributions models the uncertainty in W, (i.e., it is not just using a single best estimate of W) and thus would be perfect to equip the neural network with an uncertainty property. 

Figure 3: Posterior Predictive Distributions. The blue line shows all possible values. 

The wider the line, the higher the uncertainty in that region. (https://stats.stackexchange.com/q/71037 [4])


However, the integral over all possible weights W is computationally intractable, especially for deep neural networks. Usually, neural networks just compute p(y^*|x^*, W) using a single best estimate of W and thereby lose the uncertainty property. 

Monte Carlo Dropout

With Monte Carlo Dropout we try to use the positive aspects (uncertainty) of the posterior predictive distribution, without the negative aspects (computational intractability). To avoid the integral over the enormous space W, we just approximate it using Monte Carlo Integration. To break down Monte Carlo Integration, just think of it as drawing T samples of the domain of integration, i.e., the weights, and then sum over the samples and normalize them. The Equation then looks as follows: 

Now to draw these T samples from W dropout comes into play. Remember dropout deactivates a random number of neurons in our network and thus creates a set of active weights Wt. If we do that T times, we get our T samples of W.

In practice this looks as shown in Figure 4. 


Figure 4: Monte Carlo Dropout to generate an uncertainty map.

(Uncertainty and Interpretability in Convolutional Neural Networks for Semantic Segmentation of Colorectal Polyps [3])


The input is fed to the network T times with T different weight activations due to dropout. Of course, this leads to T more or less different predictions (As you can see, each run segments the polyp differently, especially the edges). Then we compute the Standard Deviation over the T runs and map it to an uncertainty map. Here, the bright spots indicate a high Standard Deviation (i.e., the T predictions differed a lot there) and thus a high uncertainty. And vice versa for the dark spots.   


Interpretability: Guided Backpropagation

Let's get to the interpretability part. The good thing is Guided Backpropagation is built up on a concept that you all should be more than familiar with. Exactly, Backpropagation. In Backpropagation we propagate the gradients of the loss function back to the inputs by using the chain rule. In Guided Backpropagation we do exactly the same, but we clip negative gradients to zero. Let's have a look at a simple pet classifier to demonstrate what is happening. 

As it is a classifier, we have Softmax as last activation function: S(y_{cat}) = \frac{e^{y_{cat}}} {\sum_j{e^{y_j}}}

Now we compute the gradients with respect to the input pixel xby using the chain rule. Important! We compute the gradients of our desired class (in our case cat) only. 


 

Figure 5 shows what happens inside the network (in the second and third image the numbers indicate the magnitude of the gradients):

Figure 5: Backpropagation and Guided Backpropagation operations.

(STRIVING FOR SIMPLICITY [5])


Now comes the key part. Gradients of high magnitude have a large impact on the prediction. In other words, slight changes of pixels that correspond to high gradients will have a large impact on the output of the network. Thus, if we visualize all gradients of the input layer, we end up with the following image that shows us which features are considered important and which are not (Figure 6).

Figure 6: Visualization of Guided Backpropagation. Negative gradients are set to zero. Only focus on features that are important for our desired class. 

(STRIVING FOR SIMPLICITY [5])


The attentive reader might still wonder, why we do the negative gradient clipping. The answer is not mathematical. It is more a heuristic with the intuition to only visualize the features that speak in favor of the prediction (pixels with large positive gradients) and not the ones that speak against it (pixels with negative gradients). For example in Figure 5, the eyes speak in favor for the classification of a cat. As I said, this is a heuristic but it turns out that it works very well. In Figure 7 you can see what happens if we do not apply the heuristic, i.e., the visualization of vanilla backpropagation. 

Figure 7: Visualization of Backpropagation. The negative gradients suppress the features that speak for the classification of a cat. 

(STRIVING FOR SIMPLICITY [5])


Sidenote:

Some might wonder how to transfer the Guided Backpropagation from the classification case, where we have only one output, to the segmentation case, where we have num_pixels outputs. As this is not clear in the paper, I asked the author Kristoffer WickstrØm about it and got the following answer:

You sum up the difference between the predictions and the desired class to get a loss for the entire image, and calculate the gradients of the score for all prediction with respect to the input. The score is related to each neuron in the output layer, which again is related back to the input through the backpropagation algorithm. 

In other words, they create a new loss function by summing over the outputs (value between 0 and 1 how much it is considered to be part of a polyp) of all pixels and then treat this loss function just like the loss function from the pet classifier explained above. 



Merger: Monte Carlo Guided Backpropagation

If we combine Monte Carlo Dropout with Guided Backpropagation (i.e., we approximate the Posterior Predictive Distribution over the input gradients using Monte Carlo Dropout), we get the following equation (Figure 8):

Figure 8: Predictive Distribution over the input gradients and its approximation with Monte Carlo Dropout


In practice this sums up to the following steps:

  1. Run T forward- and backward passes with random Dropout

  2. Compute standard deviation of T predictions = uncertainty of prediction

  3. Gradients from backward pass indicate importance of pixel

  4. Standard Deviation of gradients from T runs = uncertainty of pixel importance

As a result, the NN can assess the uncertainty of its prediction and the importance of certain input features. In the case of image segmentation for polyp cancer, this has a two fold advantage. Not only can a doctor see whether the NN is confident in its prediction, he can also see why it came to its conclusion. Depending on what features caused the prediction, the doctor can approve or discard the prediction.

Results

Quantitative Results

The authors tested their network on the EndoScene Dataset. The dataset consists of 912 RGB images from colonoscopies of 36 patients. The ground truth was labeled by physicians where pixels belonging to a polyp are white and pixels belonging to the colon, i.e., the background, are black. 

As evaluations metrics they used the Intersection over Union (IoU) score, IoU(c) = \frac{\sum_i{(\hat{y}_i == c \, \wedge \, y_i == c)}}{\sum_i{(\hat{y}_i == c \, \vee \, y_i == c)}},  and global accuracyGlobalAcc = \frac{\sum_n{\hat{y}_n == \, y}}{N}.


Figure 9:  (Uncertainty and Interpretability in Convolutional Neural Networks for Semantic Segmentation of Colorectal Polyps [3])

As these results don't give much insight to the impact of uncertainty and interpretability in NN, I will not go into detail here. Two aspects that might be interesting though, are the performance discrepancy of NN compared to traditional machine learning methods (SDEM [6], algorithm that uses Energy Maps for the polyp segmentation), and that Monte Carlo Guided Backpropagation can be applied to a variety of architectures with the only requirement that dropout layers exist. (You can find the detailed architectures in Kristoffer's master's thesis - Appendix C [7], which this paper is based on.)       


Qualitative Results

Now to the interesting part. As as quick recap, a prediction consists now of three parts:

  1. The segmentation with uncertainty measure (Figure 10 image 3), generated by the Monte Carlo Dropout.
  2. The visualization of feature importance (Figure 10 image 4), generated by Guided Backpropagation.
  3. The visualization of uncertainty of feature importance (Figure 10 image 5), generated by Monte Carlo Guided Backpropagation.


Figure 10: Correct prediction by the network, little uncertainty (image 3), some features misinterpreted (image 4), but high confidence in correctly interpreted features and low confidence in misinterpreted features (image 5). 

(Uncertainty and Interpretability in Convolutional Neural Networks for Semantic Segmentation of Colorectal Polyps [3])  



Figure 11: Wrong prediction by the network, high uncertainty (image 3), most features misinterpreted (image 4), but low confidence in misinterpreted features (image 5). 

(Uncertainty and Interpretability in Convolutional Neural Networks for Semantic Segmentation of Colorectal Polyps [3]) 


It is now clearly visible what features a network uses for its prediction and how certain it is about its prediction. Especially in Figure 11, we can see what happens when the network is wrong with its prediction. We can see a high uncertainty, we can see which features in the image confuse the network, and we can also see that the network thinks that theses features are important for the polyp prediction, however, it is very unsure about it. For a doctor, all this helps to quickly and transparently assess the prediction as wrong and to confidently discard it.  

Conclusion

The combination of Monte Carlo Dropout and Guided Backpropagation equips neural networks with very interesting properties. The uncertainty property is important in all safety critical domains, the interpretability property is a step in a direction away from black box neural networks. The paper showed the usefulness of the method in the domain of medicine. However, I think one could also use it as a tool for analyzing and debugging a network. For example, if a network performs considerably worse for certain inputs, one can explore what features of the input are causing trouble. Using then more training samples containing these features will help to improve the networks performance. 

Comments

Now some last comments to the paper from my side. It was not always easy to follow the argumentation of the authors especially when it came to the equations. Switching of variable names and skipped steps in the derivation of the equations (e.g. approximation of p(W) with q(W) by minimizing the Kullback-Leibler divergence) made it difficult to fully comprehend the math behind the methods. On the positive side, however, I must say that the paper provides clear explanations for how to implement the proposed methods in practice, and also stands out with many visualizations of results, of which I only showed a fraction in this blog. Furthermore, because I had some questions, I wrote one of the authors, Kristoffer, and received a quick and detailed explanation, which I appreciated very much. It is always good to know that authors care about their reader's understanding. As final statement, the paper provided new and interesting insights to me, and if you enjoyed the blog, I strongly recommend to have a look at the paper yourself.   

References

[1] Adversarial Examples that Fool Detectors, Jiajun Lu, Hussein Sibai, Evan Fabry

[2] Zech, John & Badgeley, Marcus & Liu, Manway & Costa, Anthony & Titano, Joseph & Oermann, Eric. (2018). Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine. 15. e1002683. 10.1371/journal.pmed.1002683.

[3] m, Kristoffer & Kampffmeyer, Michael & Jenssen, Robert. (2019). Uncertainty and Interpretability in Convolutional Neural Networks for Semantic Segmentation of Colorectal Polyps. Medical Image Analysis. 60. 101619. 10.1016/j.media.2019.101619.

[4] (modified) user25658, What is the difference between posterior and posterior predictive distribution?, URL (version: 2019-06-18): https://stats.stackexchange.com/q/71037

[5] STRIVING FOR SIMPLICITY: THE ALL CONVOLUTIONAL NET Jost Tobias Springenberg∗ , Alexey Dosovitskiy∗ , Thomas Brox, Martin Riedmiller

[6] Bernal, Jorge & Núñez, Joan & Sánchez, F. & Vilariño, Fernando. (2014). Polyp Segmentation Method in Colonoscopy Videos by Means of MSA-DOVA Energy Maps Calculation. 8680. 41-49. 10.1007/978-3-319-13909-8_6. 

[7] Uncertainty Modeling and Interpretability in Convolutional Neural Networks for Polyp Segmentation — Kristoffer Wickstrøm FYS-3900 - Master’s thesis in physics 60 SP - Mai 2018


  • Keine Stichwörter