Home Medizin GPT-3.5 und 4 zeichnen sich durch klinisches Denken aus

GPT-3.5 und 4 zeichnen sich durch klinisches Denken aus

von NFI Redaktion

A recent study published in npj Digital Medicine details the development of prompts for diagnostic thinking to investigate whether large language models (LLMs) can simulate diagnostic clinical reasoning.

Doctor sitting at laptop with futuristic projection representing artificial intelligence
Study: Diagnostic thinking prompts demonstrate the potential for interpretability of large language models in medicine. Image source: chayanuphol/Shutterstock.com

LLMs, based on artificial intelligence and trained on massive text data, are known for simulating tasks such as composing clinical notes and passing medical examinations. Understanding their clinical diagnostic thinking abilities is crucial for their integration into medical care.

Prior studies focused on open clinical questions, indicating that innovative large language models like GPT-4 have the potential to identify complex patients. Prompt engineering has started to address this issue, as LLM performance varies depending on the type of prompts and questions.

About the Study

In this study, researchers evaluated diagnostic thinking using GPT-3.5 and GPT-4 for open clinical questions, hypothesizing that GPT models could surpass conventional chain-of-thought prompts with diagnostic argumentation prompts.

The team used the revised USMLE dataset (MedQA United States Medical Licensing Exam) and the case series from the New England Journal of Medicine (NEJM) to compare conventional chain-of-thought prompts with various diagnostic logic prompts, mimicking cognitive processes for forming differential diagnoses, analytical thinking, Bayesian inferences, and intuitive thinking.

They examined whether large language models could simulate clinical thinking using specific prompts, combining clinical domain knowledge with advanced prompt engineering techniques.

The team used prompt engineering to generate prompts for diagnostic considerations and converted questions from multiple-choice to free-response by eliminating answer choices. They focused solely on questions from the USMLE dataset stages II and III, as well as patient diagnosis evaluation questions.

Each round of prompt engineering included GPT-3.5 accuracy assessment using the MEDQA training set. The training and testing sets, each containing 95 and 518 questions, respectively, were reserved for evaluation purposes.

The researchers also evaluated GPT-4 performance on 310 cases recently published in the NEJM journal. They excluded 10 individuals with no final diagnosis or exceeding the maximum context length for GPT-4. They compared conventional chain-of-thought prompts with high-performing chain-of-thought prompts for clinical diagnostic reasoning (reasoning for differential diagnosis) in the MedQA dataset.

Each prompt consisted of two sample questions with reasoning using target inference techniques or zero-shot learning. Free-response questions from the USMLE and NEJM case series were used for a thorough comparison of prompt strategies during the study evaluation.

Author-authors, attending physicians, and an internal medicine resident assessed the model’s responses, with each question reviewed by two blinded doctors. A third researcher resolved any discrepancies, and doctors verified answer accuracy using software when necessary.


The study shows that GPT-4 prompts could mimic physicians‘ clinical thinking without compromising diagnostic accuracy. This is crucial for building trust in LLMs for patient care. The approach can help overcome black-box limitations of LLMs and bring them closer to safe and effective use in medicine.

GPT-3.5 answered 46% of evaluation questions accurately with standard chain-of-thought prompts and 31% with zero-shot prompts without chain of thought. GPT-3.5 performed best with prompts related to intuitive thinking (48% vs. 46%) compared to traditional chain-of-thought prompts.

Compared to traditional chain-of-thought, GPT-3.5 performed significantly worse with prompts for analytical thinking (40%) and differential diagnostic thinking (38%), while Bayesian inferences lagged in significance (42%). The team observed an inter-reviewer consensus of 97% for MedQA data GPT-3.5 evaluations.

The GPT-4 API returned errors for 20 test questions, limiting the test dataset to 498. GPT-4 showed higher accuracy than GPT-3.5. GPT-4 achieved accuracy of 76%, 77%, 78%, 78%, and 72% for traditional chain-of-thought, intuitive thinking, differential diagnostic thinking, prompts for analytical thinking, and Bayesian inferences, respectively. The consensus between reviewers for GPT-4 MedQA evaluations was 99%.

As for the NEJM dataset, GPT-4 achieved accuracy of 38% with conventional chain-of-thought compared to 34% with differential diagnosis formulation (a 4.2% difference). The consensus between reviewers for GPT-4 NEJM evaluation was 97%. GPT-4’s responses and rationales for the entire NEJM dataset. Prompts that promote stepwise thinking and focus on a single diagnostic argumentation strategy performed better than those combining multiple strategies.

Overall, the study’s findings indicate that while GPT-3.5 and GPT-4 have improved thinking abilities, they lack accuracy. GPT-4 performed similarly with prompts for conventional and intuitive chain of thought but fared worse with analytical and differential diagnostic prompts. Bayesian inferences and chain-of-thought prompts also showed poorer performance compared to traditional chain-of-thought.

The authors propose three explanations for the difference: GPT-4’s argumentation mechanisms could fundamentally differ from those of human providers; post-hoc diagnostic assessments could explain the desired argumentation formats; or maximum precision could be achieved with the provided vignette data.

Related Posts

Adblock Detected

Please support us by disabling your AdBlocker extension from your browsers for our website.