Introduction
Radiology images are essential in diagnosing a number of diseases and conditions related to heart, lung, bone, or other parts in the chest. Additionally, they are used to monitor support devices like pacemakers. Only experts can see the subtle details in radiology images in order to produce reports about the findings. The body of radiology imaging is increasing and the workforce to support interpretation is limited. The idea becomes clear knowing that Rwanda, in 2015, had only 11 radiologists available for 12 million population (1). Additionally, the report generation process consumes a lot of time. Hence, it would only make sense to automate radiology report generation to help make the workflow of radiologists faster. Radiology report generation task is similar to imaging captioning task where the input image is used to generate output text. One exception is that report generation requires longer text output and has more sensitive textual patterns. Another point is that datasets contain a lot of normal cases and when findings exist they are present on only a small region of the image. Therefore, relying on image and report pairs to train models would be limited, that is why the suggestion to leverage knowledge emerges. Hence, we would like to explore a selected number of papers and how they leverage external knowledge, and potential direction for future work.
Related Work
Image Captioning
Image captioning task is a popular task where the model outputs a text describing the content of the image, as shown in Figure 1. Hence, it is important to understand the relationship between visual and textual signals. A number of works achieved extraordinary performance on this task. One such case is mPLUG (26) which introduces a novel cross-modal skip-connections and uses discriminative and generative objectives. Additionally, ExpansionNet v2 (27) introduces a novel Expansion Mechanism which is different from attention (8). The Expansion Mechanism changes the original input lengths by distributing the input over a number of arbitrary elements during forward pass then retrieves the original length during backward pass. Moreover, Blip-2 (28) combines cross-modal signals using Querying Transformer. Additionally, PaLI (29) uses existing Transformer-based models and jointly scales the visual and textual components. Furthermore, GIT (30) proposes a simple architecture but they scale the data and the model without relying on object detectors, object tags, or optical character recognition. Moreover, M2 (2) uses mesh-like connectivity between encoder and decoder as shown in Figure 1 to help encode low and high-level features. Furthermore, GRIT (5) which only uses transformers and extracts two kinds of image features during training.
Radiology Report Generation
Radiology report generation is a task where the model is given an input X-ray image in order to produce a report of findings evaluating the presence of different pathologies, as shown in Figure 2. Radiology report generation is different from image caption in that it requires longer output text, and more domain specific knowledge. Another factor is the limited amount of data which is constrained by security and privacy. Notably, METransformer (35) introduces learnable "expert" tokens that attend to different image regions and is used as well in cross-attention between words and visual tokens. VLCI (36) proposes an approach to mitigate visual-linguistic confounders using causal front-door intervention. On the other hand, R2GENCMN (3) leverages a cross-modal memory network.
Radiology Report Generation Using Knowledge
Including additional external knowledge in the radiology report generation task is an important key for improving performance. One approach, M2KT(31), extracts knowledge from visual and textual features and stores the knowledge in a memory module. Similarly, another paper (32) uses cross-modal memory to extract and store alignment between images and reports. PPKED (33) tries to imitate the workflow of radiologists where they assign a disease label to the abnormal regions then rely on prior and posterior medical knowledge to generate the report. Another paper (34) relies on graph convolutional neural networks to model a pre-constructed graph embedding and they propose a novel evaluation metric. In the upcoming discussion, we will have a look at two models KAD (14), KiUT (20) which rely on medical knowledge graphs. Additionally, we will have a look at XPRONET (25) which uses a cross-modal prototype matrix.
Background
In this section, we would like to discuss a brief overview of some of the main concepts that come up in the following discussion of models. These concepts are well established and not specific to the discussed models, so it is not the intent to discuss them in detail.
Architectures
Encoder-decoder architecture is a well established architecture to learn intermediate representation, or latent space. This architecture could be used in natural language processing tasks like text classification, machine translation, question answering, and text generation (8, 36, 42). It works perfectly for vision tasks like image segmentation, object detection, and image captioning (6, 43, 14). In essence, the encoder maps an input sequence to an intermediate representation space. Then, the decoder takes this latent space and generates the output. U-Net (6) provides a great example of using encoder in the contraction path which captures the context of the image. Then, a decoder in the expansive path to recover the spatial information, as shown in Figure 3. The additional utilization of skip-connection helps retain the fine-grained information.
Another issue that arises in deeper networks is the vanishing and exploding gradients, which hinder training. ResNet (7) proposes using residual blocks, as shown in Figure 4, to train deeper networks and prove the performance gain with deeper networks. The use of identity, or skip connection, is easy for the residual block to learn and they provide empirical evidence that residual blocks are easier to optimize.
Transformer
Transformer (8) is one of the recent breakthroughs in deep learning which opened the door for a tremendous amount of models that provide new state-of-the-art performance in natural language processing tasks like text classification, machine translation, question answering, text generation, and image captioning (10, 36, 37, 28). Same goes for vision tasks through vision transformers like in image classification, segmentation, and object detection (17, 38, 39, 40, 41). As shown in Figure 5, Transformer uses encoder-decoder architecture where a sequence of input generates a sequence of output one element at a time. Each output element is used as input for next element generation in an auto-regressive (9) manner. Each layer in the encoder consists of multi-head self-attention and fully connected feed-forward network and each employs a residual connection (7). Each layer in the decoder has the same two components and an additional masked multi-head attention sub-layer. This sub-layer does not attend to subsequent positions to ensure using only known positions. Attention is mapping a query and a set of key-value pairs to an output. Multi-head attention uses attention multiple times with different learned parameters and it allows attending to different representations at different positions. Positional encodings are embedded into the model to preserve sequence order due to the lack of recurrence and convolution. An important reason that makes self-attention favorable is that it reduces complexity per layer, helps with computation parallelisation, and allows for learning long-range dependencies.
Language Models
Language models have been a great success in many down-stream tasks, like text classification, machine translation, question answering, and text generation (10, 36, 37, 28). Bidirectional Encoder Representations from Transformers (BERT) (10) is a language model that is pre-trained on unlabelled sentence pairs, while during fine-tuning it uses down-stream task labeled data, as shown in Figure 6. More importantly, it does not require any architectural change during fine-tuning. BERT employs masked language models (11) to train deep bidirectional representation where they mask a random percentage of the input tokens then predict them. Additionally, it utilizes next sentence prediction where it is trained on pairs of sentences to predict if the sentence is next. They use a Transformer encoder that follows the implementation in (8).
Contrastive Learning
Contrastive learning is one approach of representation learning where the objective of contrastive loss is to maximize the agreement between positive pairs while minimizing the agreement between negative pairs. One example is CLIP (37) which uses image-text contrastive loss to predict correct pairings of image and text. The model jointly uses image encoder and text encoder to learn multimodal embedding space. Given a batch of image-text pairs it uses cosine similarity to maximize the between the real pairs while minimizing it for other possible incorrect pairings. Similarly, ConVIRT (13) learns medical visual representation using bidirectional contrastive learning where they use both image-to-text contrastive loss and text-to-image contrastive loss as shown in Figure 7.
Knowledge-enhanced Visual-language Pre-training on Chest Radiology Images
This paper (14) introduces a novel model (KAD) where they leverage medical graphs as a source of external knowledge to obtain a knowledge encoder. The knowledge encoder is then used to extract textual features from reports, guide visual encoder, and provide the query for the transformer decoder to generate the report.
In more detail, they leverage a unified medical language system (UMLS) (15) in order to provide the model with a knowledge base. Each node in the graph contains a concept, its definition, a Concept Unique Identifier (CUI), synonyms, and semantic type. Hence, a concept and its definition constitute concept-definition pairs which are used to fine-tune the knowledge encoder. Additionally, the edges represent the relation between two concepts, which constitute other other elements used for fine-tuning the knowledge encoder, called concept-relation-concept triplets. Initially, they used the English version of PubMedBERT (16) as a base for the knowledge encoder. Then they fine-tune this encoder using the UMLS graph. Contrastive learning is employed to fine-tune the encoder by maximizing the similarities between concepts-definition pairs and concept-relation-concept triplets. This means minimizing the distances between embeddings that point to the same CUI. Once this fine-tuning is done, the knowledge encoder becomes ready to be used in different parts of the pipeline.
During pre-training, they employ either ResNet-50 (7), or ViT-16 (17) as image encoder. Image-to-text and text-to-image contrastive loss is used for pairs of images and reports. Additionally, for each report the entities are extracted then passed onto the knowledge encoder. Each report is processed in order to extract the entities which could be anatomy, such as "lung", or observation, such as "effusion", along with whether they are present or absent. Top most common entities in the reports are compiled into an entity set Q. Three methods are evaluated for this step, including heuristic rules, RadGraph (18), and ChatGPT (19). Heuristic rules is a manual way of extracting the entities. RadGraph is a tool for extracting radiology entities and their presence. ChatGPT is a large language model that they ask to return a list of findings in the report and then they process the list of findings to determine presence.
Disease Query Network (DQN) consists of standard Transformer decoders (8). DQN gets key, value and query in order to output the pathologies probabilities. They use Random Select module to randomly assign textual features and visual features interchangeably as key and value pairs, meaning visual features could be either key or value. The entities set Q are encoded using the knowledge encoder and are passed onto the DQN to get the presence labels as output using binary cross-entropy loss. The final loss combines the contrastive loss between image and text, and the binary cross-entropy loss of the DQN.
For zero-shot inference the model gets an input X-ray image and a list of pathologies as a query. Afterwards, it outputs the probability of the given pathologies being present in the image.
They pre-train the model on MIMIC-CXR (44) and evaluate the model on 4 different datasets, PadChest (45), NIH ChestXray14 (46), CheXbert (47), and ChestX-Det10 (48) . KAD outperforms SOTA models in zero-shot setting, achieves comparable results to fully supervised models, and radiologists. Additionally, it performs well in out-of-distribution setting and exceeds self-supervised models, as shown in Figure 9, Figure 10, and Figure 11.
The model exploits intrinsic and extrinsic relationships between image regions. Additionally, it allows arbitrary pathologies input query, works on unseen pathologies, and provides best zero-shot performance. Also, it provides grounded heatmaps. On the other hand, minor limitations related to zero-shot and data-transfer are that it requires a small validation set for hyper-parameter tuning, and could not adapt to all unseen pathologies, especially when it is not related to existing ones.
KiUT: Knowledge-injected U-Transformer for Radiology Report Generation
This paper (20) proposed a model (KiUT) that is similar to KAD (14) in terms of the knowledge base used, which is a medical knowledge graph in both cases. However, KiUT uses a pre-defined medical graph compiled by experts in order to obtain clinical knowledge. Afterwards, they combine clinical, visual and contextual signals to guide the final decoder step to generate the report.
The pipeline starts by extracting the visual features using ResNet-101 (7) pre-trained on ImageNet (21). Then, the patch features are passed to the encoder-decoder which follows Transformer architecture. The encoder and decoder layers are connected using skip connections; they call it U-connection, hence the U-Transformer naming. In order to make use of the extrinsic and intrinsic relationships among these regions, they propose an encoder in which they incorporate a number of adjustments. Extrinsic means organs location with respect to each other. While intrinsic means cause or effect relation between entities. In order to preserve the spatial extrinsic information, they incorporate relative region coordinates following (22, 23). For the intrinsic relations, there are additional learnable parameters in self-attention following (2).
Figure 13 shows the differences between the connections they tried. Last connection means that the last encoder layer connects to the first decoder layer. The 1-to-1 connection means that the first encoder layer connects to the first decoder layer and the remaining layers are connected following the same pattern. Meshed connection means connecting each encoder layer with all decoder layers. U-connection, which they use in the pipeline, means that connecting the first encoder layer with the last decoder layer and the remaining layers are connected following the same pattern.
A pre-trained BERT (10) is used then fine-tuned on the reports, then coupled with MaskedMHA (8) to extract textual features. Using experts' help, a medical knowledge graph was developed to serve as a knowledge base. For each image, a probability distribution is calculated using a pre-trained TorchXRayVision classification model (24) for the symptoms in the graph. Then, a graph attention mechanism (GAT) is used to get the knowledge signal. Finally, the three signals are then combined with the output from the last decoder to generate the final report. An overview of the proposed pipeline could be seen in Figure 12.
They test the model on IU-Xray (49) and MIMIC-CXR (44) datasets and it provides an improvement over SOTA models in most cases except for a smaller dataset where XPRONET (25) performs better, as shown in Table 1.
One advantage of the KiUT model is that it can perform well on large datasets. Additionally, it exploits intrinsic and extrinsic relationships between image regions. However, its performance is a bit lacking for smaller datasets.
Cross-modal Prototype Driven Network for Radiology Report Generation
This paper (25) proposes a novel end-to-end cross-modal prototype driven network called XPRONET. They leverage memory matrix to learn cross-modal prototypes matrices which hold a form of intermediate representation combining visual and textual features and it serves as the external knowledge. In essence, the model combines the knowledge signal from the cross-modal prototype matrix with the visual and textual signals. Then, the knowledge enhanced features are passed on to the Transformer encoder and decoder layers to generate the report as shown in Figure 14.
Initially, the training data is used to initialize a cross-modal prototype matrix where the visual features are extracted using ResNet-101 (7) and the textual features are extracted using BERT (10). However, cross-modal prototypes require category label per sample, hence, they leverage CheXbert (47) to generate a pseudo label for each image which serves as a category label used for report generation. Then, K-Means (12) is used to cluster features and use the average of each cluster. Finally, the matrix is initialized using visual and textual feature group per class. The prototype matrix is explicitly shared and is used to capture intermediate representations.
During pre-training, image features are extracted using ResNet-101 (7). Then, image features are concatenated to form regions which are considered as visual words tokens. The cross-modal prototype querying is used to embed cross-modal information into single-modal features. They also measure the similarity between single-modal representation and cross-modal prototype vectors under the same class label. The cross-modal prototype responding is used to transform prototype vectors to the same representation space of the corresponding query vectors. Feature interaction module combines single-modal features with cross-modal responses. Finally, the knowledge enhanced visual signal is passed to the encoder and the knowledge enhanced textual signal is passed to the decoder to generate the next word. Each output word is used to predict the next word.
XPRONET provides improved performance over SOTA models at the time on IU-Xray (49) and MIMIC-CXR (44) datasets, as shown in Table 2.
One advantage of the model is that it performs well on small datasets due to the prototype matrix. However, it is difficult to learn the cross-modal prototypes matrix on large datasets. They propose multi-label contrastive loss where labels with at least one common label are considered a positive pair.
Comparison
The three models leverage BERT (10) for one or more steps of their pipelines, as base for text or knowledge encoder to learn embeddings. The three models share the use of ResNet (7) architecture for image feature extraction or encoder. KAD (14) and KiUT (20) both use a medical knowledge graph as an external knowledge signal. However, XPRONET (25) uses a cross-modal prototype matrix which is initialised on the training data. KAD uses the knowledge signal to guide the visual encoder, while KiUT uses the knowledge signal in the final decoding step. XPRONET, however, uses the knowledge signal to guide the internal representation of visual and contextual knowledge during the cross-modal prototype querying and responding process.
Review
KAD (14) provides a simple architecture, uses a well-established and well-structured representation for knowledge graph, and provides zero-shot inference. The fact that they use the definition of each concept and relations between concepts provide a great source of knowledge to the model and could attribute the great zero-shot performance because it might provide more accurate embeddings. Additionally, Random Select makes the model more robust which is ideal.
KiUT (20), on the other hand, employs more complex architecture which poses many questions to evaluate each part. For example, they measure the effect of removing different parts passed onto the IK-Distiller but they do not address the IK-Distiller internal structure, like the effect of using the signal in the last decoding step. In addition, they process the input image twice, once to get symptoms probability and once to get visual features.
XPRONET (25) could be considered a bit lacking in generalisability, due to the fact that the performance drops on larger datasets. This could be attributed to the fact that it does not provide explicit relationships between concepts and only uses visual and textual signals to construct the knowledge signal. Another reason could be that it only uses the training dataset to initialize the cross-modal prototype matrix, which might cause a drop in performance in out-of-distribution setting.
From a different aspect, these models do not go through bias analysis to monitor models performance for under-represented groups to ensure fairness. It has been shown that some models could classify patients' race using chest X-ray images without being trained to perform that task, more importantly without even being given the race labels (4).
Although some work explores the effect of too complex knowledge (33) and the effect of direct adoption of knowledge(50), it would be important to further explore the effect of knowledge representation, the effect of knowledge base size and how it affects zero-shot inference.
Conclusion
We have discussed mainly three models that use medical knowledge to generate radiology reports. Two models that use external medical knowledge graphs to represent the knowledge. As well as a model that uses a different type of knowledge which is a matrix that contains an intermediate representation of visual and textual information of the training data. The models improve state-of-the-art performance and provide insights for future work.
References
- Rosman, D., Nshizirungu, J., Rudakemwa, E., Moshi, C., de Dieu Tuyisenge, J., Uwimana, E. and Kalisa, L., 2015. Imaging in the land of 1000 hills: Rwanda radiology country report. Journal of Global Radiology, 1(1).
- Cornia, M., Stefanini, M., Baraldi, L. and Cucchiara, R., 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10578-10587).
- Chen, Z., Shen, Y., Song, Y. and Wan, X., 2022. Cross-modal memory networks for radiology report generation. arXiv preprint arXiv:2204.13258.
- Banerjee, I., Bhimireddy, A.R., Burns, J.L., Celi, L.A., Chen, L.C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.C., Kuo, P.C. and Lungren, M.P., 2021. Reading race: AI recognises patient's racial identity in medical images. arXiv preprint arXiv:2107.10356.
- Nguyen, V.Q., Suganuma, M. and Okatani, T., 2022, October. Grit: Faster and better image captioning transformer using dual visual features. In European Conference on Computer Vision (pp. 167-184). Cham: Springer Nature Switzerland.
- Ronneberger, O., Fischer, P. and Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18(pp. 234-241). Springer International Publishing.
- He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30.
- Graves, A., 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
- Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Taylor, W.L., 1953. “Cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4), pp.415-433.
- Lloyd, S., 1982. Least squares quantization in PCM. IEEE transactions on information theory, 28(2), pp.129-137.
- Zhang, Y., Jiang, H., Miura, Y., Manning, C.D. and Langlotz, C.P., 2022, December. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference (pp. 2-25). PMLR.
- Zhang, X., Wu, C., Zhang, Y., Xie, W. and Wang, Y., 2023. Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications, 14(1), p.4542.
- Bodenreider, O., 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research, 32(suppl_1), pp.D267-D270.
- Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J. and Poon, H., 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1), pp.1-23.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. and Uszkoreit, J., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y. and Langlotz, C.P., 2021. Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463.
OpenAI. Introducing ChatGPT. https://openai.com/blog/ chatgpt/ (2023).
- Huang, Z., Zhang, X. and Zhang, S., 2023. KiUT: Knowledge-injected U-Transformer for Radiology Report Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19809-19818).
- Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
- Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S. and Lu, H., 2020. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10327-10336).
- Herdade, S., Kappeler, A., Boakye, K. and Soares, J., 2019. Image captioning: Transforming objects into words. Advances in neural information processing systems, 32.
- Cohen, J.P., Viviano, J.D., Bertin, P., Morrison, P., Torabian, P., Guarrera, M., Lungren, M.P., Chaudhari, A., Brooks, R., Hashir, M. and Bertrand, H., 2022, December. TorchXRayVision: A library of chest X-ray datasets and models. In International Conference on Medical Imaging with Deep Learning (pp. 231-249). PMLR.
- Wang, J., Bhalerao, A. and He, Y., 2022, October. Cross-modal prototype driven network for radiology report generation. In European Conference on Computer Vision (pp. 563-579). Cham: Springer Nature Switzerland.
- Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G. and Cao, Z., mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. arXiv 2022. arXiv preprint arXiv:2205.12005.
- Hu, J.C., Cavicchioli, R. and Capotondi, A., 2023, December. Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning. In 2023 IEEE International Conference on Big Data (BigData) (pp. 2173-2182). IEEE.
- Li, J., Li, D., Savarese, S. and Hoi, S., 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L. and Kolesnikov, A., 2022. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
- Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C. and Wang, L., 2022. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- Yang, S., Wu, X., Ge, S., Zheng, Z., Zhou, S.K. and Xiao, L., 2023. Radiology report generation with a learned knowledge base and multi-modal alignment. Medical Image Analysis, 86, p.102798.
- Chen, Z., Shen, Y., Song, Y. and Wan, X., 2022. Cross-modal memory networks for radiology report generation. arXiv preprint arXiv:2204.13258.
- Liu, F., Wu, X., Ge, S., Fan, W. and Zou, Y., 2021. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13753-13762).
- Zhang, Y., Wang, X., Xu, Z., Yu, Q., Yuille, A. and Xu, D., 2020, April. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12910-12917).
- Wang, Z., Liu, L., Wang, L. and Zhou, L., 2023. METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11558-11567).
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. and Zettlemoyer, L., 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger, G., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P. and Joulin, A., 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650-9660).
- Bao, H., Dong, L., Piao, S. and Wei, F., 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P. and Girshick, R., 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000-16009).
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P. and Joulin, A., 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650-9660).
- Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. and Zagoruyko, S., 2020, August. End-to-end object detection with transformers. In European conference on computer vision(pp. 213-229). Cham: Springer International Publishing.
- Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q. and Wang, M., 2022, October. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision (pp. 205-218). Cham: Springer Nature Switzerland.
- Johnson, A., Pollard, T., Mark, R., Berkowitz, S., and Horng, S. (2019) 'MIMIC-CXR Database' (version 2.0.0), PhysioNet. Available at: https://doi.org/10.13026/C2JT1Q.
- Bustos, A., Pertusa, A., Salinas, J.M. and De La Iglesia-Vaya, M., 2020. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis, 66, p.101797.
- Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M. and Summers, R.M., 2017. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition(pp. 2097-2106).
- Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K. and Seekins, J., 2019, July. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence(Vol. 33, No. 01, pp. 590-597).
- Liu, J., Lian, J. and Yu, Y., 2020. Chestx-det10: chest x-ray dataset on detection of thoracic abnormalities. arXiv preprint arXiv:2006.10550.
- Pavlopoulos, J., Kougia, V. and Androutsopoulos, I., 2019, June. A survey on biomedical image captioning. In Proceedings of the second workshop on shortcomings in vision and language (pp. 26-36).
- Li, M., Cai, W., Verspoor, K., Pan, S., Liang, X. and Chang, X., 2022. Cross-modal clinical graph transformer for ophthalmic report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20656-20665).