Cheng-Yi Li et al., arXiv:2407.02235 (2024).

Blog post author: Gianluca Procopio

Introduction

"Modern models (...) fall short in providing the natural-language expressivity and interactivity required in hands-on clinical workflows."

Ever since the publication of Attention Is All You Need and the advent of Transformers, we witnessed the rise of Generative AI. The use of Large Language Models (LLMs) and Multimodal LLMs has spread to many fields, and the medical field is no exception.

In this field, collaboration between man and AI can be providential in saving lives. Despite the many liability and privacy issues, implementing such models into clinical workflows can help physicians and doctors in streamlining and speeding up manual processes, allowing them to focus primarily on the patient and diagnosis.

Of course, the final call is on the doctors. These models are just helpful tools with a supportive role, not leading.

The use of multimodal models can be particularly useful for improving decision making, image classification but especially automating report generation from MRI and CT scan or X-ray. This paper specifically focuses on report generation from 3D CT scan of the brain, and is currently the first of its kind in the literature. Currently, some models such as LLaVA-Med (Microsoft) and Med-PaLM M (Google) have achieved promising results on 2D images, but they struggle with real-world 3D images.

MLLMs only have a supportive role

Innate Limitations

The goal of generating reports from 3D CT scan is ambitious and challenging, in fact it presents problems that the authors must address in order to achieve their goals.

Data Complexity

The human body is a 3D entity, and 2D datasets are not able to capture the complex 3D neurovascular anatomy of the brain and heart.

Another problem is the so-called sharpshooter fallacy. It can be hard to precisely extract the correct 2D slice containing the lesion from a 3D image.

The sharpshooter fallacy

Solution: New 3D dataset!

To effectively challenge MLLMs, they collected a novel dataset comprising 3D images that accurately represent real-world diagnostic scenarios. From 2010 to 2022, Tapei Veterans General Hospital collected 18,885 brain CT scans from 9,689 Alzheimer’s patients. The overall population is quite elderly, with an average age exceeding 82 years. . The dataset includes images of normal brains, past infarcts, chronic conditions, and acute lesions.

Model Capacity

MLLMs excel at generating reports from 2D slices but struggle with 3D images. Recent medical MLLM studies show promise in generating X-ray and single-slice CT reports, but 3D brain CT data poses a challenge.

Solution: Visual Instruction Tuning

They applied clinical visual instruction tuning (CVIT) to train the BrainGPT model using the open-source Otter. Otter’s multi-image captioning and instruction tuning enabled them to generate clinically relevant captions from volumetric brain CT scans.

Evaluation Metric Fidelity

Traditional metrics fail to assess information density in diagnostic reports. Evaluation metrics for short translation and summaries are inadequate for assessing clinical essence in BrainGPT reports.

Solution: FORTE

FORTE measures clinical relevance of captions based on disease objectives. It categorizes radiology keywords into subsets for a multi-faceted performance evaluation.

The authors mentioned two relevant papers covering report generation: Med-PaLM-M and CT2Rep. Looking at the evaluation metrics they used:

Metrics from the mentioned papers

There are many others NLP metrics in the literature, but this is a quick overview of the ones mentioned in these papers.

BLEU (Bilingual Evaluation Understudy): Quantification of n-gram co-occurence frequencies, used for Machine Translation.

METEOR (Metric for Evaluation of Translation with Explicit ORdering): Word-to-word matching, used for Machine Translation.

ROUGE-L (Recall Oriented Understudy for Gisting Evaluation): Longest Common Subsequence (LCS), used for Machine Translation or Document Summarization.

BrainGPT

The final model, named BrainGPT, was built on the open-source Otter framework. BrainGPT was developed by refining Otter using Visual Instruction Tuning. Otter is based on the OpenFlamingo architecture and was obtained through Instruction Tuning using the MIMIC-IT dataset, improving significantly the instruction-following capabilities of OpenFlamingo. While OpenFlamingo already demonstrated remarkable zero-shot performance, it also supports multi-image in-context learning, which enables effective instruction tuning.

OpenFlamingo was created to address the lack of open-source autoregressive vision-language models, as models like Google's Flamingo are not disclosed so they are not feasible for researching purposes. It replicates Flamingo's architecture, extending the (image, text) → text format to handle interleaved image-text inputs, enabling in-context learning.

Architecture

The authors kept the original architecture of the Otter model, which is the OpenFlamingo’s architecture. While tuning the model, they didn’t make any changes to the underlying structure. Instead, they worked on a higher level, and I’ll outline later how.

The architecture consists of a visual encoder (OpenAI's CLIP ViT/14), a Perceiver Resampler and a Large Language Model (LLaMA-7B). Most of the block and modules in the architecture are freezed to reduce training time.

OpenFlamingo's architecture

CLIP ViT/14

This open-source vision encoder is used for image enhancing and extracting meaningful visual features from the set of 24 slices. This is freezer.

Perceiver Resampler

This trainable Resampler is used to map visual features to token that will be forwarded into the Language Model thanks to the Gated Cross-attention dense layers. This is trainable.

LLaMA-7B

This open-source Large Language Model is used for elaborating text and visual tokens. Cross-gated attention layers are added to help the model focusing over the CT scan visual features. The Language Model blocks are freezed, therefore during fine tuning only the cross-gated attention layer and the input/output embeddings are trained.

Model tuning

As previously mentioned, the Otter model demonstrated strong multi-image in-context learning capabilities, allowing it to process interleaved images and text as input. Therefore, the data is structured as triplets, with each set of 24 slices paired with an instruction and the corresponding ground truth (the true report).

To fine-tune the model, Visual Instruction Tuning was applied to the Otter foundation model. This process includes two variants: Regular and Clinical, with the Clinical variant being more structured and tailored to clinical contexts.

Regular VIT (RVIT)

Plain Instruction

Just convey the role as radiology assistant:

You are an Al assistant specialized in radiology topics. You are provided with brain CT slices from a single study. The number of slices is 24.

Please generate medical descriptions based on the images.

Example Instruction

Adding to the plain instructions 3 in-context examples:

You are an Al assistant specialized in radiology topics. You are provided with brain CT slices from a single study. The number of slices is 24.

Please generate medical descriptions based on the images in a consistent style.

<Impression: arteriosclerotic encephalopathy>

Findings

low density change in the periventricular white matter, most likely as subcortical arteriosclerotic encephalopathy

no intracranial hemorrhage.

normal appearance of insular cortex and no definite effacement of cerebral cortex. no ct evidence of acute infarction of brain.

no ventricular dilatation nor midline shift.

no space-occupying lesion in the brain parenchyma.

bilateral paranasal sinuses and mastoid air cells are well pneumatized.

skull bones appear intact without space-occupying lesion.

Conclusion

low density change in the periventricular white matter, most likely as subcortical arteriosclerotic encephalopathy

no ct evidence of acute infarction of brain.

..

..

[2 MORE EXAMPLES]

Clinical VIT (CVIT)

Template Instruction

Adding to the plain instruction a clinical defined template:

You are an Al assistant specialized in radiology topics. You are provided with brain CT slices from a single study. The number of slices is 24.

Please generate medical descriptions based on the images in a consistent style.

‹Style template>

Findings:

<is there a midline shift or hemorrhage(intracranial hepatoma/epidural hepatoma/subdural hepatoma)?>

<is there a change in ventricular and sulci system?>

<|s there a white matter lesion?(lacunar infarction/cortical infarction/subcortical infarction)>

<|s the brain parenchyma healthy?(tissue loss(atrophy)/tissue swelling)>

<|s there abnormality in high density area?(meningioma/fracture/calcified plaque/arachnoid cyst)>

<(Use your domain knowledge)ls there any other abnormality?(herniation/arteriosclerotic encephalopathy/encephalomalacia/wall calcification of cavernous ICA/Air-fluid level)>

Conclusion:

<summarizing important finding 1>

<summarizing important finding 2>

Keyword Instruction

Adding to the plain instruction some categorical guidelines to help the model concentrate over specific aspects when generating the report:

You are an Al assistant specialized in radiology topics. You are provided with brain CT slices from a single study. The number of slices is 24.

Please generate medical descriptions based on the images in a consistent style.

Use the following guidelines:

Degree: Indicate the intensity or state (e.g., normal, mild, chronic, old, etc).

Landmark: Specify the area of interest (e.g., intracerebral, midline, parenchyma, sulci, etc).

Feature: Describe any observed abnormalities (e.g., hemorrhage, atrophy, infarcts, etc).

Impression: Conclude with a clinical impression (e.g., arteriosclerotic encephalopathy, intracerebral hemorrhage, dementia, etc).

Ensure consistency and clarity in the report.

This keyword-based instruction with categories will be really useful when the FORTE metric will be introduced.

Example BrainGPT

Ultimately, four fine-tuned BrainGPT models were derived. It was observed that moving from RVIT to CVIT fine-tuned models led to reports that utilized clinical keywords more effectively, less verbose, and showed greater clinical accuracy.

While the clinical relevance of BrainGPT Keyword's reports is clean when reviewing the generated reports, the question remains whether traditional metrics can capture these differences.

BrainGPT reports examples

Evaluation with traditional metrics

They tried to evaluate the BrainGPT generated reports with the traditional metrics mentioned before and not surprisingly they noticed how such metrics were quite insensitive toward the increase of clinical relevance of the reports. As a matter of fact, the traditional NLP metrics only focus on the superficial text similarity, but they are unable to capture the increasing density of clinical keywords, hence the increasing clinical relevance.

Evaluation with traditional metrics

The only exception is the CIDEr-R (Robust Consensus-based Image Description) metric. As a matter of fact, this metric is based on the keywords usage: it is computed as the cosine similarity of the frequency-inverse document frequency (TF-IDF) vectors. Hence, CIDEr-R is able to capture the increase of clinical relevance from the RVIT to the CVIT generated reports.

Nevertheless, they also noticed how poor the results were, so they figured they need to find a way to improve the scores by working on an high level.

Sentence Pairing

To boost the scores, they tried Sentence Pairing. The idea is simple: a list-by-list structured report is easier to evaluate than a paragraph. After generating the report, they paired sentences based on cosine similarity. So, they only evaluated the paired sentences, not the whole text.

Despite Sentence Pairing, the baseline Otter registered even poorer results. On the other hand, for the BrainGPT models there have been huge gains in score for all the metrics, especially for CIDEr-R where the results exploded for all the BrainGPT models.

Sketch sentence pairing

Scores after sentence pairing

Given CIDEr-R’s ability to capture hierarchical clinical essence across Visual Instruction Tuning conditions, they thought that its TF-IDF component reacts to the use of rare radiology keywords in reports. They investigated term frequencies in ground truth and test outputs and noticed the baseline Otter model has low recall for clinical keywords, while BrainGPT models have high recall, confirming how the BrainGPT generated reports have higher clinical relevance.

FORTE (Feature-Oriented Radiology Task Evaluation)

This approach evaluates the medical content of the generated report by analyzing the density of radiology information. Radiology keywords are divided into four categories - degree, landmark, feature, and impression - to provide a complete assessment of the system's performance. By defining the exact categories over which the model has to focus, they essentially guide the model to only focus on specific aspects when generating the report.

For each category they carefully created a bank of keywords and their synonyms to cover all the possible clinical relevant keywords. Since FORTE is based on a closed set of keywords for each category, to get a numerical score they computed the F1 score for each category.

Below there is a table collecting a sample of keywords (and synonyms) for each category. They provided the full sets of keywords for each category as a .json file on their GitHub. They also published bank of keywords for Chest X-Ray, Abdomen and Chest CT scan.

FORTE bank of keywords fro brain CT scan

Negation Removal

By further investigation the keyword frequencies, they noticed that among the usual grammar filler words, the word "no" had a really high frequency. Hence, they manually reviewed the generated reports and they noticed 2 things:

Interpretation Spree

The BrainGPT tendency to overgenerate negative findings, making reports more verbose and less clinical accurate.

No "reporting bias"

When writing a report, negative descriptions should be overlooked in favor of positive findings, making the report more concise and clinical relevant.

Essentially, to address this problem they removed all the negative sentences after the sentence pairing, and they obtained really good results across al the metrics for all the BrainGPT models. On the other hand, foundation model Otter's performance dropped again.

w.r.t. sentence pairing

Negation Removal example

External Validation

So far, everything seems great, and the performances are impressive. But how does the model perform on unseen data? And what about its natural language expressivity?

The authors carried out two experiments to validate the model. The first external validation was performed on the CQ500 dataset to assess the generalization of the diagnostic capabilities. The second one, the Turing Test, was performed to assess how good is the model at replicating human writing style.

CQ500 ICH Dataset

The first validation is on the CQ500 dataset, which is an IntraCranial Hemorrhage dataset, meaning that the covered diseases were slightly different from the ones covered in the training set.

As a matter of fact, the keywords usage rate was very different in both datasets, for example the "Hemorrhage" feature occurred in more than 50% of the validation dataset, whereas only on less than 10% of the training set.

Therefore, they wondered whether the model is able to capture these features even if they are underrepresented in the training set.

Features occurrence

Turing Test

The second external validation is the Turing Test. Essentially, they asked 11 physicians to guess if a report was written by a human or an AI. They were given pairs of text reports and had to answer a survey.

After that, they were provided with the original CT scan and had to re-evaluate their answers. They were also asked to rate their confidence and explain what made them choose that answer.

Turing test survey

Results

To wrap up, is BrainGPT actually good? And is FORTE really useful? Let's explore now the collected results.

Results CQ500 Validation

The results of the first external validation experiment on the CQ500 dataset gave several interesting insights into the diagnostic capabilities of BrainGPT. First of all, as you can see from the chart, especially after the Negation Removal is performed, BrainGPT is able to capture certain features with fair accuracy levels, even if they are underrepresented in the training set.

Results as represented as intervals because they encompass all the BrainGPT fine-tuned models, from -plain to -keyword. The CVIT fine-tuned model always over performed the RVIT fine-tuned ones.

This test highlighted how better and more accurate the CVIT fine-tuned models are and how important is to perform Negation Removal on the generated reports to improve clinical accuracy and relevance.

Moreover, by furtherly investigating they results they also noticed that BrainGPTl does not suffer of the sharpshooter fallacy, because it proved that it is able to recognize multi-slice lesions and multi-object lesions.

The conclusion is: Yes, BrainGPT diagnostic capabilities generalize well!

Results Turing Test

And what about the natural language expressivity? Again, the results are impressive.

When only the textual reports were provided,, more than 74% of the AI-written reports were mistakenly identified as human-written, whereas only less than 47% human-written reports were identified as such.

On the other hand, when the original CT scan was provided to the evaluators, the percentage of AI-written reports classified as human-written dropped to slightly above 56%, which is still impressive if we consider that solely 50% of the human-written ones are correctly classified. You can also see how the confidence in their answers increases after they are provided with the original CT scan.

The authors also outlined which are the main factors influencing the evaluators' answers - Familiarity and Voice, Continuity and Coherence, Specificity and Vagueness of details and Writing quality at sentence level - stating that, when fine tuning a model, these are the factors over which researchers should concentrate in order to improve the natural language expressivity of the model.

Traditional Metrics

They also evaluated BrainGPT's reports with the traditional metrics and compared the results with the current State-Of-The-Arts and they notice how, despite the small size of the model (7B parameters), they match or overperform bigger and more complex models.

It's impressive how better BrainGPT is compared to Med-PaLM-M (84B), even though we need to specify that the latter is a generalist model which has been applied to a broader set of tasks (report generation, image classification, medical Q&A etc..).

FORTE

In contrast to the trend shown by traditional metrics, it is clear how FORTE is able to capture the increase in clinical relevance of the reports generated, highlighting how reports generated by CVIT fine-tuned models have a higher average F1 score, i.e., these models make better use of clinical relevant keywords and that their reports are closer to the real ones.

They also investigated the relationship between FORTE and the traditional metrics, and the noticed that the latter have a really high internal positive correlation, meaning they express the same behaviour. On the other hand, FORTE has a low positive internal correlations and low inter-correlation with the traditional metrics, and the authors stated that this is given by the fact that FORTE is able to capture broader and distinct aspects of the captured diseases.

FORTE F1 scores

Correlation matrix

Comments

It's impressive how well the model replicates human writing, as confirmed by its performance on the Turing Test. However, when it comes to report generation, I wouldn't concentrate over the natural language expressivity too much. I'd much rather see reports that are concise and diagnostically accurate.

On the diagnostic side, the model also shows good results. But there is a problem: there aren’t any other brain-specific MLLMs to compare it to yet. Still, the results are remarkable, especially considering the architecture of BrainGPT. It’s open-source, light, and incredibly efficient. The fine-tuning process was fast—12 hours on two NVIDIA A100 GPUs—which is way less than the resources needed for models like Google’s Med-Gemini-3D, which can handle 3D CT report generation but requires more hardware resources.

I think that FORTE is the best advancement as it's the first attempt in the literature to evaluate the clinical relevance of a diagnostic report. The strength of this metric is that it's highly generalizable and can be applied in other diagnostic domains. The only issue is that the physicians have to carefully curate a domain specific bank of keywords for each category.

One issue I noticed is the imbalance in the training dataset. It leans heavily toward degeneration-related conditions, leaving other diseases underrepresented. For example, sometimes the model failed to identify brain tumors, which is a critical in real-world applications. To improve, I'd recommend using a more balanced dataset that equally represents more diseases.

Another idea for future work? Try to leverage a more advanced model architecture. The authors achieved really good results by working on high-level features like sentence pairing and negation removal on a lightweight framework. Imagine what could be accomplished with a more complex architecture—it could improve the performance even more.

References

Li et al.,Towards a Holistic Framework for MLLMs in 3D Brain CT Report Generation. arXiv:2407.02235 (2024).

Source Code, https://github.com/charlierabea/FORTE/tree/main?tab=readme-ov-file.

Tu, T., et al. Towards Generalist Biomedical AI. NEJM AI 1, AIoa2300138 (2024).

Li, C., et al. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. arXiv:2306.00890 (2023).

Ethem Hamamci, I., Er, S. & Menze, B. CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging. arXiv:2403.06801 (2024).

Li et al., Otter: A Multi-Modal Model with In-Context Instruction Tuning, arXiv:2305.03726 (2023).

Awadalla et al., OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models, arXiv:2308:01390 (2023).

Alayrac et al., Flamingo: a Visual Language Model for Few-Shot Learning. arXiv:2204.14198 (2022).

Vaswani et al., Attention Is All You Need. arXiv:1706.03762 (2017).

Seitenhierarchie

⭐4: Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation

Introduction

Innate Limitations

Data Complexity

Solution: New 3D dataset!

Model Capacity

Evaluation Metric Fidelity

Solution: FORTE

BrainGPT

Architecture

CLIP ViT/14

Perceiver Resampler

LLaMA-7B

Model tuning

Regular VIT (RVIT)

Plain Instruction

Example Instruction

Clinical VIT (CVIT)

Template Instruction

Keyword Instruction

Example BrainGPT

Evaluation with traditional metrics

Sentence Pairing

FORTE (Feature-Oriented Radiology Task Evaluation)

External Validation

CQ500 ICH Dataset

Turing Test

Results

Results CQ500 Validation

Results Turing Test

Traditional Metrics

FORTE

Comments

References