AI can diagnose illnesses with 94.9% accuracy, but when humans use the AI, the accuracy drops sharply to 34.5%.



Chat AI ChatGPT has been shown to not only

pass the medical license exam , but also to be able to output answers that patients prefer more than human doctors . Chat AI is already being used in a variety of medical settings , but the latest research has revealed that when humans use chat AI, diagnostic accuracy drops significantly.

Clinical knowledge in LLMs does not translate to human interactions
(PDF file) https://arxiv.org/pdf/2504.18919



Just add humans: Oxford medical study underscores the missing link in chatbot testing | VentureBeat
https://venturebeat.com/ai/just-add-humans-oxford-medical-study-underscores-the-missing-link-in-chatbot-testing/

On April 26, 2025, researchers at the University of Oxford published a research paper titled 'Large-scale language model (LLM) clinical knowledge does not translate to human interaction.' In the study, a research team led by Dr. Adam Madi recruited 1,298 subjects and presented each with scenarios representing a range of medical conditions, from pneumonia to the common cold.

The scenarios included details of daily life and even medical history. For example, one scenario described a patient experiencing a severe headache during a night out with friends. The scenario included important medical information, such as the patient's difficulty looking down, as well as misleading information, such as the patient's regular drinking, living in a shared house with six friends, and living in a stressful environment.

Participants were required to interact with the LLM at least once based on a provided scenario, but could use the LLM as many times as they wanted until they reached a self-diagnosis and intended behavior. The study included three different LLMs: OpenAI's GPT-4o , Meta's Llama 3 , and Cohere's Command R+ .

The research team also gathered doctors separately from the subjects to determine the 'most accurate diagnosis' and countermeasures for each scenario. They analyzed how well the diagnoses and countermeasures that the subjects came up with through their dialogue with the LLM matched those of the doctors. For example, the 'most accurate diagnosis' in the scenario of 'suffering a subarachnoid hemorrhage' was to 'visit the emergency room immediately.'



LLM, which can be used to score highly on the medical licensing exam, may seem like an ideal tool for the general public to use for self-diagnosis. However, subjects who used LLM were only able to identify at least one relevant disease at most 34.5% of the time. Meanwhile, subjects who did not use LLM were only able to identify at least one relevant disease 47.0% of the time. In contrast, LLM alone was able to correctly diagnose a medical condition from a scenario 94.9% of the time.

In addition, subjects using LLM failed to infer the correct course of action 44.2% of the time, but the failure rate when LLM was used alone was higher at 56.3%.

When the research team examined the interactions between the subjects and the LLM, it became clear that the subjects provided the LLM with incomplete information, which prevented the LLM from making a correct diagnosis. In one case, a subject was given symptoms of gallstones and told the LLM, 'I have severe stomach pain that lasts for about an hour. I sometimes feel nauseous and pain when I eat takeout food,' but did not provide any information about the location, severity, or frequency of the pain. Command R+ then incorrectly diagnosed the subject as having indigestion.

Even when subjects provided the LLM with the correct information, they did not necessarily follow the instructions given by the LLM: Although 65.7% of conversations with GPT-4o suggested at least one symptom associated with the scenario provided by the research team, less than 34.5% of the subjects' final answers reflected the LLM's instructions.



'For those of us who are old enough to remember the early days of internet search, this is deja vu,' Natalie Volkheimer, a user experience specialist at

the Renaissance Computing Institute at the University of North Carolina at Chapel Hill, said of the findings. 'The LLM as a tool has to write prompts of a certain quality, especially if you expect high-quality output.'

'Someone who is experiencing blinding pain is not going to be able to provide the LLM with the right information,' says Volkheimer, suggesting that people who are actually suffering from some kind of medical condition will be less able to provide accurate information to the LLM.

'There's a reason clinicians who interact with patients in healthcare are trained to ask questions in a certain way and with a certain repetition,' he said. The reason is that patients will leave out information because they don't know what's important, or they'll lie because they're embarrassed or something.



Technology media VentureBeat wrote, 'This study is an important warning for AI engineers and orchestration experts. If an LLM is designed for human interaction, relying solely on non-interactive benchmarks risks giving a false sense of security about its real-world performance. If you're designing an LLM for human interaction, you should test it with humans, not on humans.' They call for the need for testing AI with the assumption that it will be used by humans.

Volkheimer also said, 'No matter what the environment, you should never blame the customer when they don't behave as expected. The first thing you should do is ask 'why.' Not just a 'why' that comes to mind, but dig deep and ask a specific 'why' that has been verified anthropologically and psychologically. That's the starting point.' He added that before introducing a chatbot, it is necessary to properly understand the target users, their goals, and the customer experience itself.

in Software,   Science, Posted by logu_ii