AI tools face challenges in real-world medical conversations, says study
BOSTON, United States — Artificial intelligence (AI) tools are increasingly being explored for use in healthcare, offering potential solutions to alleviate clinician workloads by triaging patients, taking medical histories, and even providing preliminary diagnoses.
However, a recent study led by researchers from Harvard Medical School (HMS) and Stanford University reveals that while these AI models excel in standardized medical tests, they struggle significantly in real-world medical conversations.
CRAFT-MD: A new benchmark for AI clinicians
Published in Nature Medicine, the study introduces CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine), a novel evaluation framework designed to simulate real-world doctor-patient interactions. Unlike traditional multiple-choice tests, CRAFT-MD assesses how well large language models (LLMs) can gather patient information through open-ended conversations and provide accurate diagnoses.
Researchers tested four LLMs across 2,000 clinical scenarios spanning primary care and 12 specialties. While the models performed well on exam-style questions, their diagnostic accuracy declined sharply when engaging in dynamic, conversational exchanges.
“Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” said Pranav Rajpurkar, senior author of the study.
The real-world gap in AI diagnostic skills
The study highlights several challenges faced by AI clinicians:
- Difficulty asking relevant questions during patient history-taking
- Missing critical information scattered throughout conversations
- Struggling to synthesize unstructured data into accurate diagnoses
- Reduced performance in dynamic exchanges compared to structured formats
These limitations highlight the need for more realistic training and evaluation methods before deploying AI tools in clinical settings.
Strategies to improve AI’s clinical performance
To address these gaps, the researchers propose several strategies for optimizing AI tools:
- Training models with open-ended, conversational datasets to reflect real-world interactions
- Enhancing capabilities to extract key information from unstructured inputs
- Developing systems that integrate textual data with non-textual inputs like images or lab results
- Incorporating nonverbal cues such as tone and body language into AI design
CRAFT-MD itself exemplifies innovation by using an AI agent to simulate patient interactions and evaluate diagnostic accuracy efficiently. This method processed thousands of conversations within hours while minimizing risks to real patients.
“As a physician-scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” said co-senior author Roxana Daneshjou from Stanford University.
The study highlights the importance of aligning AI tools with the complexities of actual medical practice before widespread deployment. By addressing these challenges, researchers hope to pave the way for more reliable and effective AI applications in healthcare settings.