Author: Asmaa Elghitany
Supervisor: Kamilia Zaripov
Introduction
As we stand on the precipice of the digital age, the field of artificial intelligence (AI) has grown exponentially, interweaving into practically every business domain [1]. AI has the potential to integrate machines with human intellect, enabling them to replicate human behavior and perform tasks such as vision, speech recognition, and decision-making [2]. It is, without a doubt, one of the most disruptive technologies of the modern era, and along with its subsets, it has triggered a paradigm shift in almost every sector [3]. Its reach has far outgrown the realm of research, becoming a fundamental accelerator in many fields, particularly healthcare, and causing an unparalleled industrial revolution [4]. The medical field has always demonstrated an unwavering willingness to adapt to emerging technologies to ensure that patients receive the utmost care possible [5]. As a result, they have paved the way for the development of countless techniques that automate the diagnostic process [7]. Increasingly over the last decade, researchers have been experimenting with deep learning and convolutional neural networks to identify the most predictive aspects of diseases directly from medical images [8]. One of AI's latest innovations, Large Language Models (LLMs), is being explored in medical diagnostic tools capable of interpreting complex patient data and assisting medical personnel in clinical decision-making [9]. Results have shown that deep learning algorithms may be a viable way to expand the availability of medical expertise, as they have demonstrated precise and rapid detection with improved clinical outcomes [10]. However, with the surge of research in this area, it is crucial to pause and critically assess these methods to ensure their effectiveness, accuracy, and safety [11].
Motivation
LLMs are now at the forefront of many AI research areas, with major potential still foreseen on the horizon. To get a glimpse of what LLMs are capable of and their potential in the future, we must examine the latest research in the field. A deeper look into some of the recent advancements and trends would provide insight into the capabilities, limitations, and potential growth of this method. Despite the rapid advancement of other AI techniques like symbolic AI, rule-based systems, and classical machine learning, these methods required personalized algorithms and carefully curated data for any specific task[18]. LLMs have overcome this major strain on time and resources with their versatility and unparalleled ability to generalize across different tasks with minimal customization. Their superior understanding of natural language allows them to achieve coherence and context in long texts, making them suitable for human-centered fields like healthcare[19]. LLMs also showed great scalability, handling massive datasets and billions of parameters while maintaining remarkable performance. New approaches like few-shot and zero-shot learning show that extensive retraining is often unnecessary. LLMs can as well be fine-tuned for specific use cases, such as medical diagnostics, and can process multimodal data, opening new avenues for diagnostic tools. Although training LLMs is computationally expensive, they can function across multiple use cases with little modification, reflecting a natural evolution in AI systems that surpass their predecessors[20][3].
Related Work
In recent years, the field of medical artificial intelligence has exhibited a big transformation with the rise of large language models (LLMs). Particularly transformer models like GPT-3, GPT-4, and BERT, have the remarkable ability to process vast amounts of unstructured data. Hence, why it became the most highly candidate when it comes to dealing with medical information like medical records, patient histories, and clinical notes. This step assured a new era in medical diagnostics with LLMs' ability to understand and generate natural language [12].
Text-based approaches typically excel in processing complicated clinical data like patients' history and clinical notes, extracting the exact symptoms, and then predicting potential diagnoses. When incorporating retrieval-augmented generation (RAG), their diagnostic abilities were significantly enhanced with up to 18% accuracy. On the other hand, when fine-tuned on biomedical datasets like PubMed, disease classification and clustering witnessed improvements by 15%. Hybrid approaches had the advantage of combining LLMs' natural language model capabilities with EHR data, structured data that ranges from lab results to ICD codes. With this integration, it bridges the gap between clinical data and quantifiable metrics, boosting the precision of disease diagnosis by 13% when using decision trees. Other hybrid models integrated structured and unstructured data and achieved an increased accuracy of 21%. Multi-model diagnostic approaches go further by integrating textual data with imaging and genetic information to aim for higher model synergy and detection. These frameworks were tested with radiology reports and imaging data, which proved improvement with a 19% increase in cardiovascular diagnostics and combining genetic profiles with clinical notes with a 16% increase. LLMs' performance is reflected in their F1 scores of 0.80, with up to 30% improvement compared to traditional methods of diagnosing rare diseases[24].
Figure 1: Overview of the investigated LLM approaches
Figure2 : Overview of LLM techniques for disease diagnosis
LLM Approaches
Text-Based Approach
Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping[23]
This approach leverages Large Language Models (LLMs) combined with Augmented Generation (RAG) techniques to enable zero-shot disease phenotyping. Zero-Shot Disease Phenotyping is the process of identifying the characteristics unique to each disease in zero-shot manner. Entailing that the model should be able to recognize and predict diseases it has never seen before, completely abandoning other approaches need to training on labeled data.
Methodology
The core of the approach lies in the RAG framework, a hybrid models that consists of two consecutive steps retrieval and generation.
Figure 8: Overview of the LLM-based phenotype architecture
Retrieval Step
In this step, the models performs an extensive search process related to the target disease phenotype in external sources like clinical databases, typically done using vector-based search techniques or search engines that performs relevance ranking to a query. The process starts with formulating a query the describes the phenotype targeted, encompassing short description or simple symptoms. Afterwards, a pre-trained language model is used to encode the query to a vector representation that helps represent the query in a high dimensional space where similar information are close to each other. The query is then used in a large-scale search to retrieve relevant documents to the disease, one common methods are BM25 (an advanced ranking function) or dense retrieval methods using transformer-based models (like BERT or its variants). The retrieved information is then raked based on contextual relevance with the top documents used in the next step for further processing. In this framework, regular expressions (Regex) are used to utilize its rules in matching potential references to the targeted diseases like PH or symptom patterns.
Figure 9: regular expression rules utilized to identify pertinent text snippets for diagnosing pulmonary hypertension (PH)
Figure 10: Diagnostic and medication codes that make up the structured phenotype for PH.
Querying and Processing (MapReduce Approach)
In this step, the MapReduce is used to process the retrieved snippets of clinical notes that presents a challenge as it may exceed the context window of LLMs to process all at once. In the MapReduce phase each snippet is passed independently to LLM with guided instructions on the context of the specific snippet. Afterwards, The LLM processes each snippet also independently and evaluates if it is relevant of the targeted disease phenotype. After all the passed snippets are processed in parallel, the results are aggregated in a reduction phase to form a final decision. This aggregation ensures all relevant information are extracted and that the model accounted for it. This mapping then reduction methodology, efficiently handles large clinical texts without the need for further processing by the LLM.
Aggregation
The outputs from the snippets are aggregated through two different aggregation strategies LLM-based aggregation and Max aggregation to form final diagnosis. LLM-based aggregation: aggregates the outcome and the reason behind it for each snippet into a larger prompt by LLM through prompting. Could be same prompt aggregation where the same prompt is used for each snippet or Diagnostic Focus Aggregation where a different prompt is used to check for positive diagnosis suggested by any response. In contrast, the Max aggregation method is simply checking if any snippets suggests a disease, a positive diagnosis is assigned to the patient ensuring that even the smallest mention of the disease is included in the final decision. After aggregation step the model generates final output that includes the potential diseases and any other relevant clinical information like diagnostic tests.
Results and Discussion
The method proved robustness of the methodology in extracting only the relevant information and generating accurate disease phenotypes without the need for additional fine tuning. The system retrieved data from 29 different types proving its ability to handle clinical diverse data. Then the model tested several prompt design but the final one using CoT improved F1 score greatly to 0.75. Max aggregation strategy showed the highest and most stable F1 with 0.73 opposite to other aggregation methods with range of 0.67-0.68. the LLM model outperformed structured phenotype baseline in the F1 scores and made twice as much PH identifications.
Figure 11: An illustrative example of zero-shot prompt.
Figure 12: Test set performance of three variants of the LLM-based phenotype compared to structured phenotype.
Hybrid Models Approach
Health-LLM: A Personalized Retrieval-Augmented Disease Prediction System[21]
As patients diagnosis varies significantly on case to case basis that requires specific patient's medical history, symptoms, and complexity that needs to be accounted for in individual adaptation. Hence, the proposed approach addresses the challenges of models that do not provide contextual patient's data while providing tailored predictions to each unique patient's case. This solution also aims to improve the interpretability of EHR data for predictions. The system's novel approach combines LLMs with retrieval-augmented question answering (RAQA) mechanism for more accurate and personalized predictions. The novelty of the approach lies in utilizing RAQA mechanism to dynamically use EHRs to provide the model with contextual understanding for each patient's data.
Methodology
This approach combined LLMs with retrieval augmented techniques, while centering the use of both patient specific data combined with large scale health care information for more accurate prediction. The general idea is the system retrieves the needed information from Electronic Health Records (EHRs), then uses large language model for context understanding which later the predictions are built on. The method collects all possible patient's medical records from EHRs or clinical datasets like MIMIC-III, which is typically structured data like lab results, symptoms, demographics and medical history. The data then goes through regular data cleaning and feature engineering processes with additional Natural language processing to extract information from unstructured data. The used retrieval augmented approach, a focus point for Health LLM, retrieves all relevant clinical data based on the current query symptoms and condition. The approach typically follows a procedural step by step manual for each query.
- The retrieval process get initiated with query generation, entering patient's specific symptoms, history or condition in the input query for the retrieval system. The query is not highly specified or time consuming, short precise description of the symptoms is enough for the system to operate like "fatigue, chest pains".
- The inputted query is then used to retrieve all similar cases, or historical medical records from the database using search techniques like vector embeddings or nearest neighbor. The retrieval search process is essentially looking for trend or patterns between the current query and the historical data in the database, to find similarities that would suggest any relevant findings for a possible diagnosis.
- The system then retrieves all the relevant records from the database for the LLM models (e.g., GPT, BERT ..etc) to process these information. The contextual nature of LLMs enables them to interpret the retrieved information in the context of the specific case at hand. This interpretation is then combined with the structured data if the case and used to analyze and generate predictions.
Upon the completion of the retrieval and contextualization of the data, the prediction process starts with Classification. After the in-context learning with LLM, the model uses machine learning classifiers that trains on historical patients records for diseases likelihood prediction. The classification framework presents a comprehensive model with 61 disease labels ranging from common mild cases like indigestion to complex conditions. The proposed approach then leveraged the Llama index for extracting the feature scores in addition to using XGBoost for feature learning. The predictions are expressed in a binary format with 1 indicating a strong association or higher severity level of the condition and 0 indicating no apparent association or lower level severity of the condition.
Figure 1: An example of the interactive process in case study.
The domain or context specific knowledge that the model chased proved most significance when trying to improve the predictions through feature preprocessing. Utilizing the Context-Aware Automated Feature Engineering (CAAFE) methodology powered by LLMs to iteratively generate features with semitic relevance to the context we are working with. The system now has full automation relying solely on LLMs to not only craft dataset description but also generate health suggestions based on the disease predictions. Upon completion of the prediction process, the system generates health recommendations specific to the patient's unique medical history. The system was put into a case study trial using the IMCS-21 dataset to test its model's accuracy and compare it with other models. The IMCS-21 dataset includes annotated data samples that encompassed medical dialogues that were converted Electronic Patient Records (EPRs) for preprocessing purposes. The validation metrics for the binary classification framework are accuracy (ACC) and F1-score (Macro).
Results and Discussion
The result from the Health-LLM demonstrated its ability to improve disease detection accuracy compared to other traditional methods. By using the retrieval-augmented techniques, this approach achieved higher accuracy than other models that relied on fixed features. The model marked improvement in common and rare disease's prediction, with an average of 15-20% over the baseline AI models. Time efficiency of the system improved as well, scoring sub 2 seconds response time on average, remarkable result given how computationally intense LLMs. This is a key feature that provide that the system can perform in a highly speed oriented environment like healthcare, where clinical applications should be a viable tool in emergency or resource constrained situations. The contextual nature of the method gave the model the ability to recognize patterns that are related to each patient's unique clinical history. The usage of retrieval-augmented techniques proved to be of great help in predictions, as it integrates both structured and unstructured data into the decision making process. The fast retrieval nature made it very applicable to be used in the real world as it can integrate any new clinical knowledge in advanced research ensuring the predictions are highly up to date and closes any possible gaps in the clinicians knowledge. The system also provided insights into the reasoning behind the predictions, so the clinician can trace back to the logic behind the diagnosis. This significantly is needed in the health care system as the predictions are highly interpretable dealing with multimodal data and sometime complex diseases. Hence why, Physicians rated the approach highly on the explainability metrics as it even provides visualization for the reasoning and symptoms that derive the prediction. In addition, the model generalized well across the different diseases categories which was proven in experiments using very diverse datasets like MIMIC-III where the conditions ranged from heart disease to diabetes. The system demonstrated higher diagnostic accuracy with 83% accuracy and a 0.762 F1-score than traditional methods and direct LLM inference like GPT-3.5 and GPT-4 in zero-shot and few-shot settings which could be referred back to these models lack of understanding multi class and long medical dialogue. Fine tuned LlAmA-2 demonstrated a good performance with 0.710 and 0.730 but it remained lower that Health-LLM. The retrieval mechanisms and Context-Aware Automated Feature Engineering (CAAFE) demonstrated grave importance to the performance accuracy as removing any of the techniques drops the metrics with considerable amount.
Figure 2: Comparing with existing methods on the Health-LLM diagnostic test.
Figure 3: Ablation studies.
Hybrid Models with Structured Data Approach
LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction[22]
Despite the progress made with LLMs in clinical settings, it still faces unresolved challenges hindering full deployment. The annotated data needed for training AI models is a scarce resource in the medical domain, and large-scale annotation is expensive and time-consuming. While vital clinical data sources like Electronic Health Records (EHRs) are available, their noisy and unstructured nature makes them difficult to use for disease predictions. Additionally, traditional models struggle with context understanding, which is essential for reasoning through clinical scenarios in disease prediction. This novel framework seeks to close the gaps by combining Predictive Agent Reasoning (PAR) and Critical Agent Instruction (CAI) using EHRs to improve the prediction process, providing the necessary practical requirements for a clinical environment while leveraging the latest AI for accurate diagnostic decisions.
Methodology
This proposed framework aims to investigate two main aspects
- Evaluating the performance of LLMs using HER based prediction
- Proposing using varies LLM agents in a collaborative setting to improve to produce a higher level performing model.
Evaluating the performance of LLMs using HER based prediction
This approach transforms EHRs data into unstructured formats by mapping three types of medical codes medications, procedures and diseases to their names. Then evaluates the disease prediction accuracy for LLMs in zero shot and few shot settings.
Through the Zero-Shot approach, prompt engineering was used to guide the LLMs and improve their performance. For an added guidance of LLMs and provide additional context, a set of prompts were developed to specifically deal with EHRs based predictions.
- Chain-of-thought (CoT) reasoning was used to prompt the LLMs to give an explanation for each step.
- Factor interactions were further incorporated to prompt the LLMs to further observe the relations between different medical factors like diseases, medications, and procedures.
- Prevalence information was an addition to improve the predictions by providing more context through integrating the information about the prevalence statistics
Through the Few-Shots approach, prediction categories were established by selecting exemplars in the form of random small number of labeled samples (positive and negative). The exemplars main role is to provide the LLMs with task specific examples that can be utilized in the learning process. Through the past steps the LLMs can adapt to the characteristics if the EHRs prediction while simultaneously utilizing the LLMs existing knowledge. This approach provides a method to guide LLMs to identify most task relevant patterns with each established prediction category.
Collaborative LLM Framework
Figure 4: The framework of EHR-CoAgent employs two LLM agents
LLMs potential have surpassed the limits of a single agent application, thus a collaborative framework with multiple LLMs, each assigned a specific role would be ideal to address complex problems while simultaneously enhancing the performance. The paper proposes EHR-CoAgent a novel approach that uses multiple LLM agents to potentially improve the prediction performance of EHRs. The EHR-CoAgent consists of two main agents, a predictor agent 𝒫LLM and a critic agent 𝒦LLM.
Predictor agent: an LLM with few-shot prediction setting, focusing on generating explanatory reasoning based on the EHRs inputted data. The agent analyzes the patient's medical history ℋi, to gather the relevant information then generate most likely predictions 𝒟i^. Then it provides a detailed explanation for the reasoning behind the prediction ℛi, which is crucial for enhancing the interpretability of the predictions for further analysis and validation. The detailed prompt for the predictor agent is shown in Figure 5.
Figure 5: Prompt for Predictor Agent in EHR-CoAgent for the CRADLE dataset
Critic agent: an LLM agent that observes the sampled wrong predictions from the prediction agents and analyzes the inconsistencies of the generated prediction to the ground truth label. The critic agent performs this for each batch ℬj to identify the patterns in the errors to refine the prediction process. Based on the earlier analysis, this process repeats iteratively for m times for each batch to generate a set of instructional feedback {ℱj} .The detailed prompt for the critic agent is shown in Figure 6. The frameworks aims to create in-context learning by incorporating the instructional feedback from the critic agent back into the prompt for the predictor agent, to enhance the prediction accuracy.
Figure 6: Prompt for Critic Agent in EHR-CoAgent for the CRADLE dataset.
An experiment was conducted to test the proposed framework using two datasets, MIMIC-III and CRADLE. MIMIC-III is for acute conditions like Disorders of Lipid Metabolism, while CRADLE predicts cardiovascular disease endpoints in type 2 diabetes patients. Both datasets exhibited class imbalance, with lipid disorders in MIMIC-III at 27.6% and cardiovascular disease endpoints in CRADLE at 21.4%. Accuracy, sensitivity, specificity, and F1 score metrics were utilized for evaluation. The study evaluates the performance of EHR-CoAgent compared to traditional machine learning methods like Decision Trees, Logistic Regression, Random Forests, and single-agent LLM approaches using GPT-4 and GPT-3.5. The LLM models are evaluated in zero-shot, zero-shot with additional prompt information, and few-shot learning settings, while the ML models are trained in fully supervised and few-shot settings. The study aims to compare the different models to assess the advantages of using traditional Collaborative LLM agents over traditional methods.
Results and Discussion
Training traditional machine learning models on relatively big datasets performs well predicting diseases. Decision trees and logistic regression achieved good accuracies with datasets such as MIMIC-III and CRADLE, 11,353 samples and 34,404 samples, respectively. However, these traditional ML models struggle with few-shot learning, highlighting their reliance on large amounts of labeled data, which is not ideal for the medical environment with scarce data and a need for adaptability.
Language models (LLMs) demonstrated exemplary performance in zero-shot and few-shot learning compared to traditional ML models. In these settings, LLMs exhibit lower specificity (higher false-positive rates) but higher sensitivity in identifying positive cases. This reflects that LLMs have a cautious bias, trading off minimizing the risk of missing true positives for more false positives. The Zero-shot setting is capable of improving LLM performance but relies significantly on the prompts used, emphasizing quality and design in prompt engineering. Few-shot learning consistently outperformed Zero-shot learning, even with additional prompts. Supplying task-specific examples, even in small quantities, is essential for better accuracy and navigating the nuances of EHR prediction tasks. The proposed EHR-CoAgent framework outperformed single-agent LLMs and traditional ML models, attaining an F1 score of 60.21% on the CRADLE dataset and 73.88% on the MIMIC-III dataset.
Figure 7: presents the experimental results on the two datasets.
Comparison
| Paper | Health-LLM: A Personalized Retrieval-Augmented Disease Prediction System | LLMs-based Few-Shot Disease Predictions using EHR | Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping |
|---|---|---|---|
| Methodology | Personalized EHR-based disease prediction | EHR data with few-shot learning | Retrieval-Augmented Generation (RAG) for disease phenotyping with Zero-Shot |
| Application | Personalized health diagnostics with retrieval-augmented systems | Disease prediction by combining both predictive agent reasoning and critical agent instruction | Zero-shot learning for identifying diseases and phenotyping using structured EHR data |
| Base Model | GPT-4 with retrieval augmentation | GPT-based architecture with additional specialized reasoning and instruction modules | GPT with RAG integration |
| Advantages |
|
|
|
| Limitations |
|
|
|
Conclusion
Through this post we briefly discussed three equally exciting approaches for disease prediction using LLMs. Each tackled different challenges from the high personalization level moving to limited training data and finally generalization to rare diseases. However, their real-world adaptability highly depend on the how they address each of their limitation. Together, they highlighted an insight on how uniquely each approach tackled LLMs problems, forming roundabout solutions to bridge the gap between research and clinical approach.
References
- Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach. Pearson.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Chui, M., Manyika, J., & Miremadi, M. (2016). Where machines could replace humans—and where they can’t (yet). McKinsey Quarterly.
- Topol, E. J. (2019). Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. Basic Books.
- Krittanawong, C., et al. (2017). "Artificial Intelligence in Cardiology." Journal of the American College of Cardiology, 69(21), 2634-2645.
- Esteva, A., et al. (2019). "A guide to deep learning in healthcare." Nature Medicine, 25(1), 24-29.
- Litjens, G., et al. (2017). "A survey on deep learning in medical image analysis." Medical Image Analysis, 42, 60-88.
- Rajpurkar, P., et al. (2017). "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." arXiv preprint arXiv:1711.05225.
- Huang, G., et al. (2020). "Deep Learning for Medical Image Analysis: Overview, Challenges and the Future." Medical Image Analysis, 66, 101797.
- McKinsey Global Institute. (2019). "AI in healthcare: The future is now." McKinsey & Company.
- Obermeyer, Z., & Emanuel, E. J. (2016). "Predicting the Future—Big Data, Machine Learning, and Health Care." New England Journal of Medicine, 375(13), 1216-1219.
- Caruana, R., Gehrke, J., Koch, P., & Sturm, M. (2015). Intelligible Models for Healthcare: Predicting Pneumonia Risk and Hospital 30-Day Readmission. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721-1730.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
- Liu, Z., Ott, M., Goyal, N., & Du, J. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
- Obermeyer, Z., Powers, B. W., Vogeli, C., & Mullainathan, S. (2019). Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science, 366(6464), 447-453.
- Peng, Y., Yan, S., & Sun, M. (2021). MedBERT: Pretrained Contextualized Embeddings on Large-Scale Structured Electronic Health Record Data for Disease Prediction. IEEE Transactions on Biomedical Engineering, 68(3), 956-965.
- Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine Learning for Healthcare: On the Verge of a Major Transformation.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (NeurIPS 2020).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Radford, A., Wu, J., Child, R., & Luan, D. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
- Jin, M., Yu, Q., Shu, D., Zhang, C., Fan, L., Hua, W., Zhu, S., Meng, Y., Wang, Z., Du, M., & Zhang, Y. (2024). Health-LLM: A personalized retrieval-augmented disease prediction system. arXiv. https://arxiv.org/abs/2402.00746
- Cui, H., Shen, Z., Zhang, J., Shao, H., Qin, L., Ho, J. C., & Yang, C. (2024). LLMs-based few-shot disease predictions using EHR: A novel approach combining predictive agent reasoning and critical agent instruction. arXiv. https://arxiv.org/abs/2403.15464
- Thompson, W., Vidmar, D., De Freitas, J., Altay, G., Manghnani, K., Nelsen, A., Morland, K., Pfeifer, J., Fornwalt, B., Chen, R., Stumpe, M., & Miotto, R. (2023). Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping. Proceedings of NeurIPS Workshop: Deep Generative Models for Health. Retrieved from https://neurips.cc
- Zhou, S., Xu, Z., Zhang, M., Xu, C., Guo, Y., Zhan, Z., Ding, S., Wang, J., Xu, K., Fang, Y., Xia, L., Yeung, J., Zha, D., Melton, G. B., Lin, M., Zhang, R. (2024). Large Language Models for Disease Diagnosis: A Scoping Review. Journal of Computational Health Sciences, 1(1), 1-23.













