1. Outline

2. Motivation

The difference between a good doctor and an average one is rarely just about medical knowledge. It lies in the ability to observe details, reason through uncertainty, and act according to the specific context of each patient. This distinction also applies significantly to the world of Artificial Intelligence (AI).

Over the past few years, research in medical AI has shifted toward building Large Language Models (LLMs) capable of supporting clinical tasks. These models demonstrate impressive knowledge, strong pattern recognition, and an ability to generate coherent explanations across many medical domains. Yet despite this progress, their application in real-world clinical scenarios remains limited. The reason behind this is because they suffer multiple flaws. LLMs often provide single-shot answers: they usually generate a final response without validating it through follow-up questions, and when the input is incomplete or ambiguous, they still assume they must produce an answer. They also fail to integrate contextual information, as they do not automatically track patient history unless these are explicitly included in the prompt. As a result, they may overlook crucial diagnostic information simply because they do not realize context is missing. Furthermore, LLMs struggle with multi-step reasoning: they cannot reliably maintain long reasoning chains, and tasks requiring sequential logic or multi-step clinical diagnosis frequently break down as the model loses consistency across steps, leading to fragile output. Lastly, they tend to hallucinate when faced with ambiguity or missing data. Since these models are trained to produce fluent text rather than express uncertainty, they often invent seemingly accurate details instead of expressing information deficiency.

This gap resulted in a new approach in medical AI: agent-based systems. Rather than treating an LLM as a passive text generator, the idea now is to design systems where the model functions as the brain of an active agent. These systems introduce several new capabilities:

  • Step-by-step reasoning:
    Instead of producing single-shot answers, the agent can reason progressively.

  • Environment observation:
    It can observe patient history as well as context before answering.

  • Tool interaction:
    The agent is able leverage external tools with the goal of enhancing performance and accuracy.

  • Action-taking:
    Before comitting to an answer, it can take actions ,such as asking for additional input or adjusting its reasoning.

In this blog post, we explore how recent research leverages this shift from static models to intelligent agents, and how these systems are bringing medical AI one step closer to real-world clinical integration. We begin by outlining how AI’s role in medicine has evolved over recent years in Section 3. In Section 4, we introduce the concept of AI agents and then summarize the contributions of three key papers that explore different agent-based designs. Section 5 follows with a conclusion of the main insights and interpretations drawn from these studies. Finally, we end with a review and discussion of the papers, highlighting their implications and offering perspectives on the future of agentic AI in the medical field.


3. AI in Medicine & LLMs

3.1 From Traditional Medical AI to Foundation Models

Incorporating AI into the medical field has long been a central goal for researchers. For the past years, their approach relied mostly on Deep Learning (DL), particularly Convolutional Neural Networks (CNNs), which excelled at processing structured data such as medical images. As a result, CNN-based systems achieved strong performance in tasks like radiology classification, skin lesion detection, ECG signal analysis, and disease risk prediction, often surpassing previous machine-learning methods (source 1).

However, these models suffered from great restrictions. They performed well on highly specific tasks but could not generalize beyond their training setup, reason about patient context, or integrate multiple modalities.

Fortunately, a major upgrade occurred in AI with the introduction of Transformers, which became the backbone of modern Foundation Models (FMs). Unlike CNNs, Transformers excel at capturing long-range dependencies and understanding complex context in both text and image data. This architecture enabled models such as GPT, PaLM, LLaMA, and Med-PaLM to process natural language instructions, summarize medical information, and perform advanced clinical reasoning.

Thanks to Transformers, FMs rapidly set a new standard for AI in medicine. Systems are now one step closer towards understanding nuanced prompts, adapting to varied clinical tasks, and operating across multiple medical domains. Hence, pushing the field far beyond the capabilities of traditional deep learning systems.

3.2 LLMs in Medicine

Among foundation models, LLMs stand out as the most reliable progress tool in medical AI, thanks to their capacity to understand and generate medical language with remarkable flexibility. As highlighted in (source2), they revolutionized natural language processing and now occupy a central position at the forefront of AI innovation in medicine. Despite their impressive capacities, LLMs are still unable to act as autonomous systems in real clinical environments. The reason behind this is explained through multiple limitations.

First, they usually provide single-shot answers, generating immediate responses to prompts without the multi-step reasoning that clinical decision-making requires. This often leads to hallucinations, where confident but inaccurate information is produced, especially when data is incomplete or ambiguous. LLMs also lack interpretability (black box): they generate conclusions without exposing the reasoning behind them, making it difficult for clinicians to trust or verify their outputs.

Finally, LLMs do not support collaborative reasoning, meaning they cannot check their output with other experts, raising the risk of uncorrected errors.

3.3 Why Limitations Lead to Agentic Systems

The limitations of LLMs explain why they alone are insufficient for clinical deployment. In response, researchers are now turning to AI agent systems, which extend LLM capabilities through interaction, collaboration, and structured decision-making.

A helpful way to think about these systems is to imagine an entire human body where the LLM acts as the brain. It provides cognitive abilities such as understanding language, recalling knowledge, and reasoning, whereas the agent framework adds everything the “brain” was previously missing. Since AI Agents can gather new information, take actions, and reflect or verify their reasoning, they address several of the restrictions that LLMs struggle with.

Through this combination, agentic systems transform an LLM from a passive text generator into an autonomous active model, much closer to how clinicians actually work.


4. AI Agents

Defined in (source4) as “autonomous software programs that perform specific tasks,” AI agents extend beyond traditional LLMs by adding the ability to perceive, interact, remember, and reason. These capabilities allow them to (operate as active problem-solvers rather than passive text generators.)

AI agents can generally be grouped into two main categories: single-agent systems and multi-agent systems. The difference between them is not just the number of agents involved, but the type of intelligence each system is designed to specialize in.

Single-agent systems focus on giving one agent the ability to interact with its environment. They excel at tasks that require observation, exploration, and step-by-step actions.

Multi-agent systems, on the other hand, specialize in complex reasoning tasks. By bringing several agents together, each with different roles or perspectives, these systems benefit from team-based intelligence. They perform well in scenarios requiring multi-step reasoning, peer review, structured debate, and reduced hallucinations. Just like LLMs emerge and perform well in tasks they weren't trained on, multi-agent systems also share the same effect: once models start interacting and reasoning together, the system as a whole develops abilities that no single agent was taught. This emergent teamwork often lets them handle new or complex tasks surprisingly well. (source 8)


4.1 Single-Agent Systems

4.1.1 CPathAgent

Idea

Coming into single-agent systems, CPathAgent introduces a framework that strategically navigates Whole Slide Images (WSIs) in order to perform diagnostic interpretation. Its goal is to emulate pathologists’ examination workflow while also providing clear explanations of each reasoning step. It addresses several issues that traditional LLM-based approaches suffer from, including the difficulty of training models with limited labeled pathology data and the absence of verifiable reasoning that clinicians can trust. CPathAgent also overcomes a major technical barrier in computational pathology which is the inability of standard LLMs to handle gigapixel, high-resolution slides. Although its use of agentic AI differs from multi-agent collaboration, it still offers a clear advantage over traditional LLMs by enabling the model to interact with its environment, perform step-by-step diagnostic reasoning, and generate interpretable diagnostic reports grounded in visual evidence.

Methodology

In order to faithfully mimic how real pathologists examine WSIs, Figure 2 depicts how CPathAgent follows a three-stage agentic workflow capable of dynamic region selection, strategic navigation planning, and multi-scale diagnostic reasoning:

The first stage is global screening:

The system generates a thumbnail of the WSI via 32x downsampling and then divides it into overlapping regions. These regions are then grouped according to similarities in their pathological characteristics. Next, each cluster is assigned a severity score (0-5) indicating whether it requires high-magnification review. This step filters out insignificant areas, which serves both performance and efficiency. 

The second stage is navigation planning:

For every region preserved after screening, the agent generates a dynamic and adaptive viewing plan. The plan consists of a sequence of viewing steps, defined by coordinates, magnification level and a diagnostic view describing what the agent should examine at each position. This plan is generated autoregressively, with each step being influenced by the clinical context and previously observed content. This results in a path that faithfully mirrors how a pathologist scans a slide, zooming in and out to investigate suspicious structures. 

The third stage is multi-scale, multi-view sequence reasoning:

This is where most of the actual diagnostic work occurs. The agent begins by executing the navigation plan, retrieving all corresponding image crops alongside their associated data. It then performs a multi-scale, multi-view holistic reasoning, which includes integrating evidence, cross-referencing features, refining hypotheses and maintaining logical continuity across the viewing sequence. Finally, it synthesizes a coherent pathology report summarizing its diagnostic.                                                        

Dataset Construction:

An important part of the methodology in this paper is the construction of both the training and benchmarking datasets.  This step was necessary because computational pathology suffers from notable scarcity when it comes to reliable datasets. To address this gap, the authors introduced PathMMU-HR², a dataset consisting of 1,668 expert-validated VQA pairs that requires multi-scale analysis. Additionally, they build the CPathAgent-Instruct dataset, which serves as training material for the agent on pathology decision making.  Together, these datasets form the foundation for training and evaluating CPathAgent, making it possible to properly assess its diagnostic performance and the benefits of its agent-based design.

Results

The evaluation of CPathAgent is done through four major criteria, each designed to test a specific diagnostic capability. The first one, patch understanding, examines whether the model can correctly interpret small pathology patches, which serves as the foundation for more complex reasoning. Here, CPathAgent outperforms both general-purpose and pathology-specific LLMs, and even surpasses expert-annotated baselines on several subsets. The second criteria is huge region understanding, this one assesses the agent’s capacity to navigate and reason over large, high-resolution regions. In this setting, CPathAgent also achieves superior performance across diverse cancer types, confirming that agent-based reasoning strategies enhance pathology model performance. The next criteria is WSI classification. It measures how well the system can classify whole-slide images. CPathAgent achieves competitive results using its agent-based workflow, and the upper-bound analysis confirms the high quality of its synthetic training data while revealing potential room for improvement. Finally, the out-of-distribution evaluation tests how well the model generalizes to datasets it has never seen before. Even with significantly fewer training data in comparison to other methods, CPathAgent achieves strong out-of-distribution performance with remarkable efficiency. Altogether, these results highlight the robustness and clinical potential of CPathAgent’s agent-based design.


4.2 Multi-Agent Systems

4.2.1 MDAgents

Idea

Although LLMs perform well on medical tasks that rely primarily on information retrieval, their struggle with complex tasks, that require deeper reasoning, has been a limitation they’re unable to overcome. Medical decision-making, for instance, often involves several clinicians who discuss, critique, and refine each other’s hypotheses. An LLM, analogous to a brain, cannot perform such tasks on its own since it lacks the ability to collaborate, cross-check, or interact.

To address this gap, (Authors) introduce MDAgents, a framework that takes advantage of what agents can do beyond traditional LLM systems. It does so by leveraging multiple LLM-powered agents to simulate the behavior of a medical team when tackling tasks of different complexities. In this setup, each agent is assigned a specialized role, and through rounds of discussion, critique, and refinement, the agents collectively work toward a more reliable clinical conclusion.

Methodology

Although Agent collaborative systems have already been introduced in the past, MDAgents introduces a more adaptive and efficient approach to collaboration. It starts by assessing the complexity of each incoming task. Based on whether the task is classified as low, moderate, or high complexity, the system determines how many agents should be recruited. This adaptive allocation ensures that the computational effort matches the difficulty of the problem, allowing the system to maintain efficiency while improving performance on more challenging clinical tasks. For low-complexity tasks, the framework recruits a single Primary Care Clinician (PCC), for moderate cases, it activates a Multi-Disciplinary Team (MDT), and for high-complexity problems, it calls upon an Interdisciplinary Collaboration Team (ICT).

The distinction between these teams reflects the type of diversity needed: the PCC acts as a single generalist agent, the MDT brings together agents representing different medical perspectives, and the ICT gathers agents with different reasoning roles, optimized for long multi-step chains of thought.

Once the appropriate team is recruited, each agent is assigned a specialized role, such as:

  • Analyst (generates initial hypotheses)

  • Critic (identifies errors and contradictions)

  • Reviser (refines solutions using feedback)

  • Evidence Retrieval Agent (fetches relevant medical information)

  • Summarizer / Reporter (produces structured and interpretable output)

This diversity of roles creates complementary reasoning perspectives, improving quality and reducing hallucinations.

The workflow of the MDAgents goes as presented in Figure 3: Each AI agent begins by generating an initial response to the clinical query based on its assigned role and underlying medical knowledge. These initial responses form the foundation of the collaboration cycle. Once they are produced, agents examine one another’s outputs, identifying contradictions and providing critiques or supporting evidence according to their specialized functions. Through these iterative rounds (propose-critique-revise), the team progressively strengthens its reasoning until a stable consensus is reached.

An additional component introduced in MDAgents to improve performance is the evidence retrieval module. When an agent encounters uncertainty or needs to verify a clinical claim, it can request information from curated medical sources. Once this data is incorporated into the reasoning loop, the ongoing collaboration becomes grounded in verified facts, further reducing hallucinations and strengthening the medical accuracy of the final decision.

After a stable consensus is reached, all agent outputs are forwarded to the Moderator Agent, which performs the final evaluation. If any disagreements remain, the moderator selects the explanation with the best evidence and clinical logic, leading to the production of a single unified answer. This step significantly enhances overall system performance by ensuring reliability as well as medical consistency.

Finally, the output of MDAgents consists of three main components. First is the final clinical answer, representing the conclusion reached through collaboration and moderation. Second is the reasoning trace, which provides an explanation of how the agents reached that conclusion. Third, when applicable, the system includes evidence excerpts from medical sources that were retrieved to support or verify the reasoning.

Results

For the results analysis, we begin by examining the overall performance of MDAgents in comparison to both single-model and multi-model baseline systems. We then present how performance varies across different complexity levels, highlighting the effectiveness of the adaptive PCC–MDT–ICT structure. Finally, we discuss the ablation study conducted in the paper, which demonstrates the contribution and necessity of each component within the MDAgents framework.

Overall Performance Evaluation

For the overall performance evaluation, Figure 4 represents the gains that MDAgents has over solo LLMs and grouped LLMs. In order to make the measurements, the researchers used a wide set of benchmarks grouped into two groups. The first category consists of Medical Knowledge Retrieval Datasets (MKRD), which include both text-based tasks (e.g., MedQA, PubMedQA) and multimodal VQA datasets (e.g., Path-VQA, MedVidQA). The second category is Clinical Reasoning and Diagnostic Datasets (CRAD), which evaluate multi-step reasoning and clinical problem-solving using text-only tasks such as DDXPlus, and image+text tasks such as MIMIC-CXR.

The evaluated systems are also divided into three groups. The first contains single-agent models, primarily GPT-4, tested under different prompting strategies such as zero-shot, few-shot, and chain-of-thought. The second group contains multi-agent (single-model) systems, where multiple agents collaborate but all rely on the same underlying LLM Type. The third group contains multi-agent (multi-model) systems, which combine different LLMs within the same team, this category also includes MDAgents, the focus of the study.

The results show several clear highlights. First, MDAgents outperforms every single-agent method across almost all benchmarks. This shows that even the strongest individual models struggle with complex medical tasks when operating alone, whereas adding agentic collaboration provides a measurable advantage. Second, MDAgents also generally surpasses previous multi-agent baselines. Since those systems were not specifically designed for medical tasks, they lack the adaptive team formation and role-based specialization that give MDAgents the upper hand.

Finally, the most prominent improvements appear in the CRAD benchmarks, which require multi-step clinical reasoning. MDAgents achieves its strongest gains here thanks to its feedback loop, allowing agents to catch contradictions and refine diagnostic reasoning through multiple rounds. The ICT configuration, with its diverse reasoning roles, performs particularly well on these high-complexity tasks, reinforcing the idea that agentic collaboration is especially effective for advanced clinical decision-making.

Performance Across Task Difficulty Levels

The second field of performance evaluation examines how accuracy varies when different settings are applied to different task difficulty levels. The results show that solo settings perform best on low-complexity tasks, illustrating that using a multi-agent group on simple problems can introduce unnecessary complication and even reduce accuracy.
In contrast, group settings achieve the highest accuracy on high-complexity tasks, confirming that collaborative reasoning is especially valuable when a problem requires deeper analysis. These findings also demonstrate why selecting the appropriate team configuration based on task complexity is crucial.
For example, assigning an ICT team to an easy question may produce overly elaborate reasoning and hurt performance, while giving a high-complexity clinical problem to a single PCC would probably make the problem addressed inadequately.

Ablation Study

Moving onto the third element of the performance analysis, namely the impact of the Moderator’s Review and Retrieval-Augmented Generation (RAG). As shown in the table, adding RAG to the base MDAgents framework increases the average accuracy across all datasets by nearly 5%. Introducing the Moderator’s Review produces an even larger improvement, raising the accuracy by about 8% over the base system. When both components are combined, the average accuracy increases by almost 12%, demonstrating their strong complementary effect.
These findings highlight that the contribution of the research extends beyond the core MDAgents architecture; the additional modules play a crucial role in further enhancing the performance and medical accuracy of the system.


4.2.2 MAC Rare Disease Diagnosis

Idea

Although it shares its core motivation with MDAgents, this paper diverges from medical decision making and focuses specifically on rare disease diagnosis through a Multi-Agent Conversation (MAC) framework. It demonstrates how multi-agent collaboration can overcome limitations that standard LLMs struggle with, such as limited training datasets and highly overlapping symptom profiles that often mislead single-model reasoning. By enabling agents to collaborate on differential diagnoses, the framework reduces hallucinations. Overall, this work further highlights the strong potential of multi-agent LLM systems in healthcare thanks to its promising results.

Methodology

Due to their similar structures. The methodologies of MAC and MDAgents intersect in several ways, which we will point out when presenting the workflow of the MAC framework.

The system begins by receiving a clinical case describing the patient’s symptoms and context. Because MAC has a multi-agent design, it assigns different roles to the agents it uses, defined as follows:

  • Generator / Initial Diagnoser – proposes first hypothesis list
  • Critic – checks inconsistencies, missing findings
  • Reviser – updates differential based on critique
  • Ranker – orders diagnoses by likelihood

Similarly to MDAgents, these agents engage in a conversational loop. The generator first produces an initial differential list, then the critic evaluates and points out issues which the reviser uses to update the list accordingly. This cycle continues according to a preset number of iterations or until system convergence. Afterwards, the Ranker finalizes the ordered diagnoses providing a ranked list of diagnoses which, alongside a reasoning explanation derived from the conversation, forms the system’s final output.

Results

Regarding the overall performance evaluation, MAC Framework achieved a result that further backs up the superiority of multi-agent systems over single LLMs. As shown in Figure 5, compared to single-agent baselines based on GPT-3.5 and GPT-4, the experiments show that collaborative reasoning significantly improves diagnostic accuracy as well as helpfulness of further recommended tests in both primary and follow-up consultations.

The paper also evaluates additional parameters to determine which configurations yield the best performance. One of these tests examines how accuracy changes when varying the number of agents. Measurements show that increasing the number of agents from 1 to 3 leads to systematic improvements in diagnostic accuracy. However, when increasing from 3 to 5 agents, the gains stabilize, and in some cases performance even decreases. This appears across both primary consultations and follow-up consultations, indicating that adding more agents does not always translate into better performance.

Another experiment involves removing the supervisor agent to measure its contribution. The results indicate that excluding the supervisor leads to noticeable drops in accuracy for both the most likely diagnosis and the list of possible diagnoses, while the accuracy related to recommending further diagnostic tests remains largely unchanged.

5. Conclusion

To conclude, all of the frameworks provide significant performance enhancement over state-of-the-art LLMs. CPathAgent achieves this by extending a single model with structured navigation, visual reasoning, and environment interaction, enabling it to handle high-resolution pathology tasks that traditional LLMs cannot approach. In contrast, MAC and MDAgents reach their gains by arranging multiple LLMs into coordinated teams, where collaboration, critique, and iterative refinement collectively elevate diagnostic accuracy and reasoning quality.

Together, these approaches illustrate that progress in medical AI comes not only from stronger models, but from designing systems that organize how these models think, act, and interact. Whether through a single agent executing a diagnostic workflow or a multi-agent team engaging in structured debate, agentic frameworks like CPathAgent, MAC and MDAgents unlock capabilities that bring medical AI closer to clinical reliability.

6. Review & Discussion

Review

After examining the three papers, it becomes clear that agentic AI is not a single technique of improvement but rather a family of architectural strategies, each unlocking a different set of capabilities.  On one hand, single-agent systems enhance an LLM with abilities such as interacting with its environment, planning and executing sequential actions, and following a predefined reasoning pipeline. These systems behave like a clinician who thinks and acts step-by-step through one unified mind. 

On the other hand, multi-agent system designs introduce strengths that cannot stem from a single model alone: team-based collaboration, structured critique and debate, iterative refinement, and more robust high-level clinical reasoning. Hence, multi-agent systems resemble a coordinated medical team, where the final outcome depends on specialists fulfilling their roles and collectively improving the decision quality.

Discussion

The results presented by the three papers we examined are with huge importance to the medical field, specifically, and to the development of AI in general. The reason behind this, is because it's a huge step towards incorporating smart systems into hospitals, and perhaps even any establishment. And while this might make some people concerned about how intelligent systems are on their way to replace humans in some tasks, others consider this as a win, since it will shift humans' focus towards tasks that require more creativity and make services more accessible regardless of the presence of an expert. 

Future Insight

As discussed in Section 4, multi-agent systems reveal an important property: Emergence. While mostly viewed as a positive feature, since it opens new research directions that aim to understand it and leverage it as an advantage. However, it can also hinder performance, as emergent behaviors in LLMs may deviate toward undesirable patterns such as deceptive or controlling tendencies, potentially endangering the stability of the entire multi-agent system.

These observations naturally raise two promising research questions.

First, how can we design collaboration algorithms that guarantee stable and predictable interaction between agents? This could extend to incorporating game-theoretic principles, such as enforcing Nash equilibria, to maintain stable cooperation.

Second, how do phenomena such as cooperation, competition, or spontaneous specialization arise when LLMs interact in a multi-agent environment? Understanding these dynamics could help shape safer and more effective multi-agent systems.

7. References

[1] Kaul, Vivek, Sarah Enslin, and Seth A. Gross. "History of artificial intelligence in medicine." Gastrointestinal endoscopy 92.4 (2020): 807-812.

[2] Thirunavukarasu, Arun James, et al. "Large language models in medicine." Nature medicine 29.8 (2023): 1930-1940.

[3] Gao, Shanghua, et al. "Empowering biomedical discovery with AI agents." Cell 187.22 (2024): 6125-6151.

[4] Sapkota, Ranjan, Konstantinos I. Roumeliotis, and Manoj Karkee. "Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges." arXiv preprint arXiv:2505.10468 (2025).

[5] Sun, Yuxuan, et al. "CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic." arXiv preprint arXiv:2505.20510 (2025).

[6] Kim, Yubin, et al. "Mdagents: An adaptive collaboration of llms for medical decision-making." Advances in Neural Information Processing Systems 37 (2024): 79410-79452.

[7] Chen, Xi, et al. "Enhancing diagnostic capability with multi-agents conversational large language models." NPJ digital medicine 8.1 (2025): 159.

[8] De Zarzà, I., et al. "Emergent cooperation and strategy adaptation in multi-agent systems: An extended coevolutionary theory with llms." Electronics 12.12 (2023): 2722.

 


  • Keine Stichwörter