Artificial intelligence has shown remarkable promise in healthcare, from reading X-rays to suggesting treatment plans. But when it comes to actually talking to patients and making accurate diagnoses through conversation — a cornerstone of medical practice — AI still has significant limitations, according to new research from Harvard Medical School and Stanford University.
Published in Nature Medicine, the study introduces an innovative testing framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) to evaluate how well large language models (LLMs) perform in simulated doctor-patient interactions. As patients increasingly turn to AI tools like ChatGPT to interpret symptoms and medical test results, understanding these systems’ real-world capabilities becomes crucial.
“Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” explains study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. “The dynamic nature of medical conversations – the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms – poses unique challenges that go far beyond answering multiple choice questions.”
The research team, led by senior authors Rajpurkar and Roxana Daneshjou of Stanford University, evaluated four prominent AI models across 2,000 medical cases spanning 12 specialties. Current evaluation methods typically rely on multiple-choice medical exam questions, which present information in a structured format. However, study co-first author Shreya Johri notes that “in the real world this process is far messier.”
Testing conducted through CRAFT-MD revealed stark performance differences between traditional evaluations and more realistic scenarios. In four-choice multiple-choice questions (MCQs), GPT-4’s diagnostic accuracy dropped from 82% when reading prepared case summaries to 63% when gathering information through dialogue. This decline became even more pronounced in open-ended scenarios without multiple-choice options, where accuracy fell to 49% with written summaries and 26% during simulated patient interviews.
The AI models demonstrated particular difficulty synthesizing information from multiple conversation exchanges. Common problems included missing critical details during patient history-taking, failing to ask appropriate follow-up questions, and struggling to integrate various types of information, such as combining visual data from medical images with patient-reported symptoms.
CRAFT-MD’s efficiency highlights another advantage of the framework: it can process 10,000 conversations in 48-72 hours, plus 15-16 hours of expert evaluation. Traditional human-based evaluations would require extensive recruitment and approximately 500 hours for patient simulations and 650 hours for expert assessments.
“As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” says Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. “CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care.”
Based on these findings, the researchers provided comprehensive recommendations for AI development and regulation. These include creating models capable of handling unstructured conversations, better integration of various data types (text, images, and clinical measurements), and the ability to interpret non-verbal communication cues. They also emphasize the importance of combining AI-based evaluation with human expert assessment to ensure thorough testing while avoiding premature exposure of real patients to unverified systems.
The study demonstrates that while AI shows promise in healthcare, current systems require significant advancement before they can reliably engage in the complex, dynamic nature of real doctor-patient interactions. For now, these tools may best serve as supplements to, rather than replacements for, human medical expertise.
Source : https://studyfinds.org/critical-flaws-in-medical-ai-systems-exposed/