Medical Visual Question Answering (VQA)

This is the blog post for the topic Medical Visual Question Answering (VQA). It is an overview of the most important state-of-the-art methods that try to overcome the medical visual question answering task.

1. Introduction

Visual question answering (VQA) answers questions based on images. This field gained a lot of attention in the last few years. It covers different machine learning, deep learning, natural language processing, and computer vision approaches. Because of the need for expertise in these different fields, it is still a difficult task. A subcategory of VQA is medical visual question answering (MVQA). It is a new research area, which aims to support physicians, and radiologists with their diagnoses on different types of images in the medical field, mainly X-Ray, MRI, CT, and pathology images. The motivation behind it is to create a better workflow in the daily clinic routine. For example, such models could make a faster diagnosis and if there is no abnormality, the radiologist doesn’t even have to look at it. Another advantage could be to see missed lesions, that the radiologist oversaw. It could also help to tackle the issue of a shortage of human experts by providing a reliable online diagnosis.

A major impact on the MVQA tasks had the challenges from the ImageCLEF (Cross-Language Education and Function) [1]. They started the first medical visual question answering competition in 2018 and with that, they introduced the first dataset VQA-med-2018 [2]. This was the beginning of the research in the MVQA field. Since then, the challenges took place every year and with them, a new dataset was generated. As an orientation for this work, the datasets and winner models of the challenges were further investigated. Furthermore, the papers and methods were chosen by their best performance (accuracy and BLEU score). In addition, the most recent models were selected between 2020 and now. Finally, seven different approaches were chosen, which use four different datasets.

This blog post is structured as follows: The datasets are described in section (2). Section (3) describes the general approach of MVQA models and compares the models based on that. In the following, a conclusion is made in section (4), and a personal review concludes this work in section (5).

2. Datasets

Compared to the VQA, there are fewer datasets for the MVQA. Most datasets are from the ImageCLEF challenges, which extract the images from the MedPix database. There are nine important datasets VQA-med-2018 [2], VQA-RAD [3], VQA-med-2019 [4], RadVisDial silver-standard [5], RadVisDial gold-standard [5], PathVQA [6], VQA-med-2020 [7], SLAKE [8], and VQA-med-2021 [9]. They all contain either radiology or pathology images. A main difference between the datasets is the question-answer creation, this is done either naturally or synthetically. The first means that they are either manually created by physicians or synthetically from the images and their corresponding caption. In the dataset itself, there are open-ended and closed-ended questions. The former can have a variety of answers that can be correct, and the closed-ended ones have just a few predefined answers like yes and no. An overview of the datasets is shown in table 1.

Dataset	Images	QA Pairs	QA Creation	Type of Image
VQA-Med-2018 [2]	2,866	6,413	Synthetical	CT
VQA-RAD (2018) [3]	315	3,515	Natural	CT, MRI, X-Ray
VQA-Med-2019 [4]	4,200	15,292	Synthetical	CT, MRI, X-Ray, US
RadVisDial (2019) (silver-standard) [5]	91,060	455,300	Synthetical	Chest X-Ray
RadVisDial (2019) (gold-standard) [5]	100	500	Natural	Chest X-Ray
PathVQA (2020) [6]	4,998	32,799	Synthetical	Pathology
VQA-Med-2020 [7]	5,000	5,000	Synthetical	CT, MRI, X-Ray, US
SLAKE (2021) [8]	642	14,000	Natural	CT, MRI, X-Ray
VQA-Med-2021 [9]	5,000	5,000	Synthetical	CT, MRI, X-Ray

Table 1: Overview of the datasets, that are used for medical visual question answering tasks. The datasets marked in blue were used in the models described in this work. [10]

For this work, only models were used that utilise VQA-RAD [3], VQA-med-2019 [4], PathVQA [6], and Slake [8] dataset. This is because the accuracy and BLEU score of the models was the best on these datasets. Due to this reason only these four datasets are described more in detail.

2.1 VQA-RAD

VQA-RAD dataset is one of the first datasets that was created for MVQA tasks in 2018. It contains radiology images from the head, chest, and abdomen. The images were extracted from the MedPix database. The whole dataset has 315 images with 3,515 question-answer pairs. It was manually created by clinicians, who had to provide questions. [3]

2.2 VQA-MED-2019

This dataset was created for the ImageCLEF challenge 2019. The images were also chosen from the MedPix database. They not only contain MRI, CT, and X-Ray images but also images from Ultrasound. The questions were categorized as modality, plane, organ system, and abnormality. Abnormality questions were assigned for answer generation and the remaining for answer classification tasks. The dataset highly corresponds to the VQA-RAD dataset. [4]

2.3 PathVQA

PathVQA dataset differs from the other ones because it contains pathology images instead of radiology images. The images were generated from textbooks and online libraries, and with a semi-automated pipeline. The captions were transferred into question-answer pairs. Half of the questions are closed-ended with a yes or no as an answer. The remaining questions are divided into the categories of what, where, how, how much/many, when and whose. [6]

2.4 Slake

Slake consists of CT, MRI, and X-Ray images from different body parts (head, chest, abdomen, pelvic cavity, and neck) and different question types (vision-only, knowledge-based, and bilingual). The images were annotated by doctors. For images, that need external information to produce a correct answer a knowledge graph was used. [8]

Figure 1: Overview of the observed datasets. [10]

3. Methods

3.1 General Approach

All methods that were considered can be divided into four steps: image encoder, question encoder, fusion algorithm and a classifier or generator that produces the answer. For image feature extraction there are often pretrained models used like ResNet and VGG network. Common choices for the question features are transformer architectures or LSTM networks. The biggest differences occur in the fusion algorithm, most networks use attention mechanisms and specific customized additional components.

Figure 2: General Framework of medical visual question answering models. It consists of four steps: image encoder, question encoder, fusion algorithm, and classifier/generator. [10]

3.2 Image Encoder

An image encoder reduces the dimensionality of the data by extracting essential features. Features are easier to process and describe the dataset in the same way. Examples of image features are shape, edges, or motion. Most models that were analysed for this work use ResNet as an image encoder.

The Multimodal BERT Pretraining for Improved Medical VQA (MMBERT) [12] team uses ResNet-152 to pretrain the network on the ROCO (Radiology objects in context) [3] dataset. This dataset contains radiology images with related image captions [3]. Later the model is fine-tuned on a medical dataset [12]. A similar approach is taken by the authors of Cross-Modal Self-Attention with Multi-Task Pre-Training for MVQA (CMSA) [13]. The researchers use three ResNet-34 networks to do multi-task pretraining on an external dataset to extract image features, a decoder for segmentation, and a 3-layer multi-layer perceptron to classify the image [13]. The pretraining can be divided into two parts: the features extraction and the test of whether the image is compatible with the question or not. To get the relation between image and question cross-modal self-attention (CMSA) is used, which is explained later in section 3.4. With this pretraining, it can be classified if the question corresponds to the image. For example, there is an image of a chest X-Ray, and the question is about anomalies in the brain, this cannot fit. [13] Another interesting approach in the image encoder present the authors of Contrastive Pre-training and Representation Distillation for MVQA based on Radiology Images (CPRD) [14] by using a teacher-student model. The idea is to train a teacher with a self-supervised learning method using a ResNet-8 on unlabelled data. In the end, the teacher differentiates between the three main categories brain, chest, and abdomen. The teacher is fed to the student model so that the student model learns the intra- and inter-region features. The student model is then used as an image extractor for the medical visual question answering task. [14] Moreover, MedFuseNet [15] uses a standard ResNet-152 to extract the image features. Similar to that researchers of Vision-Language Transformer for Interpretable Pathology VQA (TraP-VQA) use the ResNet-50. The Knowledge Embedded Meta-learning (KEML) [11] and From Image type point to Sentence (FITS) [17] team apply different encoders. The former takes a VGG-16 trained on ImageNet and a global average pooling strategy [11] while FITS utilises contextual transformer net (CoTNet-152) to extract image features and type points (head, chest, and abdomen). The type points are then fed to the question encoder. [17]

3.3 Question Encoder

Question encoders work similarly to image encoders. The text is converted into a matrix or vector of features to better process it. As mentioned in the general approach most networks apply LSTMs or transformers in combination with BERT [18]. In this survey, four models used LSTMs, and three models used transformers.

The question encoder of CMSA [13] and CPRD [14] has a word embedding that is fed to a LSTM network to get the question features. In contrast to that, MedFuseNet [15] uses BERT and LSTM for feature extraction and TrapVQA [16] BioELMO and a bidirectional LSTM. One of the newest used methods is BioM-Electra [19], which is like BERT. It is a generator and discriminator architecture that contains transformer encoder blocks. This technique is used in FITS [17]. KEML [11] and MMBERT [12] take a simple BERT model to extract the language features. The speciality of KEML is that they use few-shot classifier learning. This is done by meta-training using training samples and meta-testing using support samples. With this technique, the label of a query can be detected. [11]

3.4 Fusion Algorithm

The main part of every MVQA task is the fusion algorithm. The aim is to bring together the image and language features to generate an answer. Precisely for this reason, there are the biggest differences in the analysed models. In general, all models except KEML [11] apply an attention mechanism. Instead, KEML utilise only a block fusion model. It takes two vectors (image and language features) as input, and project that to a K-dimensional space. To reduce the number of parameters the created tensor is decomposed in blocks. With that technique, it is possible to get the fine interactions between modalities and at the same time keep mono-modal representation. As an additional component, KEML adds a text description part with gated graph neural networks. The Natural Language Toolkit extracts the adjectives and nouns of a question as keywords and set them in relation to the images. The knowledge graph together with the output of the fusion is fed into a neural network. Then the output is multiplied by the output of the fusion network. A relation model compares the support sample and meta-training samples to produce an answer. [11]

CMSA team developed its fusion model based on self-attention. Therefore, the visual, spatial, and question feature vectors are concatenated together to get a feature map and then concatenate into a multi-modal feature map. This map is decomposed into three feature maps by 1x1x1 convolutional layers. From that, the attention map is calculated from two feature maps with the softmax function. After that, the attention map is multiplied by a feature map to get a multimodal representation. The whole procedure is repeated, and a mean-pooling is applied. [13]

CPRD uses a bilinear attention network (BAN) to fuse the features [14]. The team of MedFuseNet created a multimodal factorized bilinear algorithm. For that, the image and textual features are multiplied by a projection matrix. This matrix needs to be learned, this happens with matrix factorization in two low-rank matrices. The matrices and the corresponding feature vectors can then be multiplied and in the end, everything is fused through sum pooling. [15] The speciality of TraP-VQA is that they use a transformer encoder and decoder with multi-head attention and feed-forward network architecture to fuse [16]. FITS [17] also uses a similar approach as well as MMBERT [12]. But instead of the TraP-VQA only an encoder Transformer is used.

3.5 Output Mode and Metrics

Most evaluated models utilise classification, only MMBERT [12] is a pure generation model. MedFuseNet [15] is the only model that takes both classification and generation into account. To evaluate and compare the models, accuracy, and BLEU score are the commonly applied evaluation scores. Some models also use precision, recall, AUC-ROC, AUC-PRC, and F-measure. For better comparison, only accuracy is considered, and where available the BLEU score. Results are shown in Table 3.

	VQA-RAD	VQA-med-2019	Path-VQA	SLAKE
KEML (2020) [11]		Acc.: 0.912 BLEU: 0.938
MMBERT (2021) [12]	Acc.: 0.72	Acc.: 0.672 BLEU: 0.69
CMSA (2021) [13]	Acc.: 0.732
CPRD (2021) [14]	Acc.: 0.727			Acc.: 0.821
MedFuseNet (2021) [15]		Acc.: 0.789 BLEU: 0.276	Acc.: 0.636 BLEU: 0.605
TraP-VQA (2022) [16]			Acc.: 0.6482
FITS (2022) [17]	Acc.: 0.765

Table 3: Results of the analysed methods. [10]

4. Conclusion

Based on the results (Table 3) all compared methods have an accuracy higher than 0.6. By far the best results achieve KEML with the VQA-med-2019 dataset. Moreover, the two newest methods in MVQA perform best on VQA-Rad and Path-VQA dataset. This shows that there is still potential to develop better models based on the models before. Also, MMBERT and MedFuseNet are networks that need to be considered, because they reached high scores on two datasets and not only one.

5. Personal Review

Strength: All methods showed that it is possible to solve the medical visual question task on a specific dataset and reach with that accuracy scores higher than 0.63. Every method has the same four general steps (image encoder, question encoder, fusion algorithm and generator/classifier), which makes it easier to understand the differences between the methods.

Weakness: All models are only trained on small datasets and often on only one specific dataset. Most datasets are produced synthetically, and it could be that they do not represent reality. Additionally, only pathology and radiology image datasets exist and other images like from the dermatologist or dentist are not available. It is difficult to compare the results of the methodologies because they use different datasets and different metrics. Also, the BLEU metric does not always represent the quality of the answer. Furthermore, it is difficult to certify the answer of the models because they are based on deep learning techniques, which are very difficult to analyse why and how they exactly generate the answer.

Future work: It would be interesting to consider additional information about the patient to generate a better answer, for example, blood values, medical history, and hereditary factors. Also, other images than MRI, X-ray, CT, and pathology images should be taken into account, for example, images of skin cancer or the eye can be considered to make a better diagnosis. Moreover, the different models should be trained on different datasets to have a more reliable result. Furthermore, there is a need for techniques to better interpret the results of deep neural networks. Thereby, it is possible to improve the models and get a reliable answer. In general, there is a need for more and bigger datasets, so that the models are not only trained for one specific task.

6. References

[1] ‘ImageCLEF - The CLEF Cross Language Image Retrieval Track | ImageCLEF / LifeCLEF - Multimedia Retrieval in CLEF’. https://www.imageclef.org/ (accessed Jun. 20, 2022).

[2] S. A. Hasan, Y. Ling, O. Farri, J. Liu, H. Müller, and M. Lungren, Eds., ‘Overview of imageCLEF 2018 medical domain visual question answering task’, Proc. CLEF 2018 Work. Notes.

[3] J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, ‘A dataset of clinically generated visual questions and answers about radiology images’, Sci. Data, vol. 5, p. 180251, Nov. 2018, doi: 10.1038/sdata.2018.251.

[4] A. B. Abacha, S. A. Hasan, V. Datla, J. Liu, D. Demner-Fushman, and H. Müller, ‘VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019’, 2019.

[5] O. Kovaleva et al., ‘Towards Visual Dialog for Radiology’, in Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online, Jul. 2020, pp. 60–69. doi: 10.18653/v1/2020.bionlp-1.6.

[6] X. He, Y. Zhang, L. Mou, E. Xing, and P. Xie, ‘PathVQA: 30000+ Questions for Medical Visual Question Answering’. arXiv, Mar. 07, 2020. Accessed: Jun. 20, 2022. [Online]. Available: http://arxiv.org/abs/2003.10286

[7] A. B. Abacha, V. V. Datla, S. A. Hasan, and H. Muller, ‘Overview of the VQA-Med Task at ImageCLEF 2020: Visual Question Answering and Generation in the Medical Domain’, p. 9.

[8] B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu, ‘SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering’. arXiv, Feb. 18, 2021. Accessed: Jun. 20, 2022. [Online]. Available: http://arxiv.org/abs/2102.09542

[9] A. B. Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, and H. Müller, ‘Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain’, p. 8.

[10] Z. Lin et al., ‘Medical Visual Question Answering: A Survey’, 2021, doi: 10.48550/ARXIV.2111.10056.

[11] W. Zheng, L. Yan, F.-Y. Wang, and C. Gou, ‘Learning from the Guidance: Knowledge Embedded Meta-learning for Medical Visual Question Answering’, in Neural Information Processing, Cham, 2020, pp. 194–202. doi: 10.1007/978-3-030-63820-7_22.

[12] Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, and C. V. Jawahar, ‘MMBERT: Multimodal BERT Pretraining for Improved Medical VQA’, arXiv, arXiv:2104.01394, Apr. 2021. doi: 10.48550/arXiv.2104.01394.

[13] H. Gong, G. Chen, S. Liu, Y. Yu, and G. Li, ‘Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering’. arXiv, Apr. 30, 2021. Accessed: Jun. 10, 2022. [Online]. Available: http://arxiv.org/abs/2105.00136

[14] B. Liu, L.-M. Zhan, and X.-M. Wu, ‘Contrastive Pre-training and Representation Distillation for Medical Visual Question Answering Based on Radiology Images’, in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Cham, 2021, pp. 210–220. doi: 10.1007/978-3-030-87196-3_20.

[15] D. Sharma, S. Purushotham, and C. K. Reddy, ‘MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain’, Sci. Rep., vol. 11, no. 1, Art. no. 1, Oct. 2021, doi: 10.1038/s41598-021-98390-1.

[16] U. Naseem, M. Khushi, and J. Kim, ‘Vision-Language Transformer for Interpretable Pathology Visual Question Answering’, IEEE J. Biomed. Health Inform., pp. 1–1, 2022, doi: 10.1109/JBHI.2022.3163751.

[17] A. Zhang, W. Tao, Z. Li, H. Wang, and W. Zhang, ‘Type-Aware Medical Visual Question Answering’, in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 4838–4842. doi: 10.1109/ICASSP43922.2022.9747087.

[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’. arXiv, May 24, 2019. doi: 10.48550/arXiv.1810.04805.

[19] S. Alrowili and V. Shanker, ‘BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA’, in Proceedings of the 20th Workshop on Biomedical Language Processing, Online, Jun. 2021, pp. 221–227. doi: 10.18653/v1/2021.bionlp-1.24.

Seitenhierarchie