“Neural Encoding with Visual Attention”
Khosla, Meenakshi, Gia H. Ngo, Keith Jamison, Amy Kuceyeski, and Mert R. Sabuncu.
This is a blogpost for the paper ‘Neural Encoding with Visual Attention’ by
Khosla, Meenakshi, Gia H. Ngo, Keith Jamison, Amy Kuceyeski, and Mert R. Sabuncu.
Introduction
This work tries to advance the understanding of the brain with the help of the brain-inspired neural networks. It is a computational neuroscience paper which means that it employs a mathematical model, here a deep neural network (DNN), to understand the nervous system. The tackled problem is neural encoding. This is the field study of how neurons represent information. One tries to characterize the relationship between sensory stimuli or action and neural signals.
As sensory stimuli, the work uses movies while the neural signals are represented by fMRI acquisitions. The goal is to model which cortex areas (the brain part associated with cognition) are used to process the movies. The authors take a classical deep neural network to model this and extend it with an attention component.
Humans do not process all the data we are exposed to. We filter what we pay attention to. Thus an attention component might lead to better performance as it has been demonstrated in the field of natural language processing (Transformers [1]). Moreover, it leads to better interpretability of the mathematical model.
Methodology
Model
A typical neural encoding model architecture takes the raw stimuli, performs feature extraction, and then uses a response model to output neural activation patterns. The authors take this classical architecture and add an extra trainable attention module that takes the stimuli representation from the feature extraction and outputs an attention, or otherwise said, saliency map. Consequently, the extracted features get weighted with the saliency map and then passed to the response model. The architecture is depicted in figure 1 below.
Fig. 1: The architecture of the neural encoding model
The state of the art feature extraction is done via a ResNet-50 [2], an object classification convolutional neural network (CNN), which is pre-trained on ImageNet [3]. One takes one of the last layers as they represent more high-level feature information. In this case, this is the last residual layer. The dimensions of the output are 2048x23x32.
The attention module has one convolutional layer with a 5x5 filter for each channel. This is then moved to a ReLU operation, blurred with a 5x5 Gaussian, and consequently run through a spatial softmax.
The response model is a convolutional network, the architecture of which is depicted in figure 2. Interestingly enough, the choice of a convolutional network instead of a dense one is a novel approach to neural representation. All typical advantages of a convolutional model over a dense one apply.
Fig. 2: The architecture of the convolutional response model
Data
The used data comes from the Human Connectome Project (HCP). The authors have taken a dataset of 158 volunteers watching four 15-minutes cuts from movies. One has the movie, the fMRI scan taken during the movie, and the recorded eye-gaze locations.
The neural response, which is what the proposed model is trying to predict, is given by the fMRI scans. The abbreviation stands for functional magnetic resonance imaging. The scan depicts a 4D neural activation map (3D in space plus time) by measuring the blood flow changes in the brain. This relies on the fact that there is an association between blood flow and neural activation. [11]
The goal is to predict common to all humans responses and therefore the fMRI scans are averaged and one has a single ground truth scan corresponding to one movie frame. The authors have estimated a 4 seconds hemodynamic delay between scans and frames. This means that a scan is paired to a frame from 4 seconds ago. This comes from the nature of fMRI: the blood flow changes due to neural activations occur with this so-called hemodynamic delay.
The eye-gaze locations have been directly measured by eye-tracking. They correspond to visual attention [12].
Evaluation
There are two things that we are concerned about: the neural response prediction improvement and the saliency (attention) prediction.
To evaluate the first, the authors have trained in total eight models. Three linear models because using a CNN instead of a linear model for the response prediction is novel. These three vary in the attention used: one without, one with a center-weighted attention, and one with the ground truth gaze-weighted map. The other five models employ the convolutional response prediction model. One without the first step of spatial pooling and attention (due to computational reasons they had to also quarter the dimensionality), one without attention weighting, another center-weighted one, a gaze-weighted and the proposed learned attention.
They have then used Pearson's correlation coefficient to compare the output to the fMRI. Only the fMRI voxels with high inter-group correlation have been taken into account. This means that the varying activations between participants have been discarded. It is okay because the purpose of the model is to investigate the visual system common across people.
They have further compared the prediction accuracy in regard to the different cortical regions.
The evaluation of the saliency prediction consists of comparing the attention maps and the human fixations maps. How one quantitatively compares them is an active research area though. Therefore, five different indicators have been measured:
- Similarity or histogram intersection (SIOM)
- Pearsons's correlation coefficient (CC)
- Normalized scanpath saliency (NSS)
- Area under the ROC curve (AUC)
- Shuffled AUC (sAUC)
There are three models taken as baseline.
- The Itti-Koch model: an older popular model that does not use neural networks. [13]
- Deepgaze-II [14]: the state of the art, neural network model. It extracts high-level features from the pre-trained object classification network VGG19 [10] which are then passed to a network predicting fixations. This second network is trained with gaze data.
- Intensity contrast features (ICF) model [14]. It is similar to Deepgaze-II with the difference that the extracted features are from a shallower part of the feature extraction network.
Training
The training of all models has been kept the same. The loss function is a mean squared error with the fMRI response, the optimizer is Adam with a learning rate 0.0004 and the training has been kept going for 25 epochs.
Results
Neural response
The neural response improves with the use of attention. The ground truth eye-tracking attention gives the best results, followed by the trainable attention module. Another observation is that the convolutional response model performs better than the linear one.
The regions which show significant improvement in the ventral stream areas are Posterior Inferior Temporal Cortex (PIT) and the Fusiform Face Complex (FFC). Parts of the lateral occipital complex (LOC1, LOC2, LOC3) are also quite improved as well as an area in the temporo-parieto-occipital junction.
The PIT and FFC are regions of the ventral stream. The ventral stream is the so-called vision-for-perception pathway which is associated with the recognition and discrimination of visual shapes and objects. Its counterpart is the dorsal stream or the vision-for-action pathway which is associated with the processing of the spatial location of an object. They are also called the "what" and "where" pathway. One can see the pathways depicted in figure 3.
Fig. 3: The dorsal and ventral stream
In the PIT happens the transformation of retinal representation into object representation [9]. The LOC is selective for object shapes, the FFA deals more with face and body recognition rather than object recognition. The temporo-parieto-occipital junction is involved in object recognition and representation of facial attributes.
Figure 4 shows the accuracy of the different models across different regions in detail.
Fig. 4: Regions-level analysis. Error bars represent 95% confidence intervals around mean estimates.
Saliency prediction
The saliency map outputted by the attention module corresponds well with the eye-tracking data. Some of the frames and the predicted saliency maps along with the tracked human fixations are shown in figure 5. One can observe that the attention module has indeed learned meaningful saliency maps. Nevertheless, there are outliers in the last column. In the bottom triplet, it can be reasoned that the human attention is driven to the speaker in the frame, however, the model does not have information neither about the audio nor about the previous frames which might signal the speaker by the movement of the lips. The top scene is quite cluttered and the human fixations might have been a result of contextual information from previous frames as well.
Fig. 5: Qualitative assessment of learned saliency maps.
The quantitative assessment against the chosen prediction models is portrayed on table 1 with the best results bolded. One can note that the attention network performs on par with the other models. This is especially impressive given that they are trained with eye-gaze data and the attention network from the paper is only implicitly trained to predict saliency with fMRI scans.
Table 1: Saliency evaluation against the baseline models.
Conclusion
The work proposes an extended neural network model for the task of neural encoding of images. The extension is an attention module that implicitly learns the saliency map with which the image extracted features are weighted. The authors show that the module improves the neural response prediction and the learned saliency map corresponds well to the ground truth attention measured by eye-tracking.
Student's review
The work clearly demonstrates the advantages of the proposed attention network integration in the task. It compares the suggested architecture to well-chosen alternatives. It shows clear improvement in the neural response prediction with the saliency map. The qualitative and quantitative analysis of the saliency maps prediction is equally well done. Overall this shows good evidence for the value of the modeling of attention. Both to increase performance in the neural encoding, implicitly generate additional information regarding the question of saliency and provide interpretability to the method.
Additionally, the paper stands out with its commentary on ethical considerations. It mentions the data-inlaid bias, data privacy issues, and informed consent all of which should be taken seriously when developing an application based on works in the field.
A weaker point of the paper is that saliency is implied to equal attention which is correct in the computer science field. However, neuroscience sees saliency only as an item quality driving a certain type of attention: the bottom-up, memory-free, or reactive attention. This attention is towards salient objects meaning statically standing out from their neighbors. Conversely, the top-down, memory-dependent, or voluntary attention is active in a dynamic setting e.g. when looking ahead of moving objects or in a task-oriented manner. [5, 6, 7, 8]
Another point that I was missing is the mention of class activation maps [4]. I would have been interested to see an evaluation of the encoding with the class activation map of the feature extraction network as an attention map.
References:
[1] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need,” n.d., 11.
[2] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” ArXiv:1512.03385 [Cs], December 10, 2015. http://arxiv.org/abs/1512.03385.
[3] Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision 115, no. 3 (December 2015): 211–52. https://doi.org/10.1007/s11263-015-0816-y.
[4] Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.” ArXiv:1610.02391 [Cs], October 7, 2016. http://arxiv.org/abs/1610.02391.
[5] Poltoratski, Sonia, Sam Ling, Devin McCormack, and Frank Tong. “Characterizing the Effects of Feature Salience and Top-down Attention in the Early Visual System.” Journal of Neurophysiology 118, no. 1 (April 5, 2017): 564–73. https://doi.org/10.1152/jn.00924.2016.
[6] Schneider, Daniel, Christian Beste, and Edmund Wascher. “On the Time Course of Bottom-up and Top-down Processes in Selective Visual Attention: An EEG Study.” Psychophysiology 49, no. 11 (2012): 1660–71. https://doi.org/10.1111/j.1469-8986.2012.01462.x.
[7] “Introduction to Visual Attention.” In Selective Visual Attention, 1–24. John Wiley & Sons, Ltd, 2013. https://doi.org/10.1002/9780470828144.ch1.
[8] Borji, A., and L. Itti. “State-of-the-Art in Visual Attention Modeling.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35, no. 1 (January 2013): 185–207. https://doi.org/10.1109/TPAMI.2012.89.
[9] Conway, Bevil R. “The Organization and Operation of Inferior Temporal Cortex.” Annual Review of Vision Science 4 (September 15, 2018): 381–402. https://doi.org/10.1146/annurev-vision-091517-034202.
[10] Simonyan, Karen, and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” ArXiv:1409.1556 [Cs], September 4, 2014. http://arxiv.org/abs/1409.1556.
[11] Logothetis, Nikos K. “What We Can Do and What We Cannot Do with FMRI.” Nature 453, no. 7197 (June 2008): 869–78. https://doi.org/10.1038/nature06976.
[12] O’Connell, Thomas P., and Marvin M. Chun. “Predicting Eye Movement Patterns from FMRI Responses to Natural Scenes.” Nature Communications 9, no. 1 (December 4, 2018): 5159. https://doi.org/10.1038/s41467-018-07471-9.
[13] Itti, Laurent, Christof Koch, and Ernst Niebur. “A Model of Saliency-Based Visual Attention for Rapid Scene Analysis.” Pattern Analysis and Machine Intelligence, IEEE Transactions On 20 (December 1, 1998): 1254–59. https://doi.org/10.1109/34.730558.
[14] Kümmerer, M., T. S. A. Wallis, L. A. Gatys, and M. Bethge. “Understanding Low- and High-Level Contributions to Fixation Prediction.” In 2017 IEEE International Conference on Computer Vision (ICCV), 4799–4808, 2017. https://doi.org/10.1109/ICCV.2017.513.