May 07, 2026 19:00:00

A study showed that OpenAI o1 was able to make diagnoses far more accurate than both conventional models and human doctors, using only electronic medical records and a few sentences of information from nurses.

Researchers from Harvard Medical School and

Beth Israel Deaconess Medical Center have published a study in the journal Science measuring how well OpenAI models perform compared to human physicians. The study reports that ' OpenAI o1 ' performed at least as well as two physicians, demonstrating a significant advantage in diagnostic triage.

Performance of a large language model on the reasoning tasks of a physician | Science
https://www.science.org/doi/10.1126/science.adz4433

Study Suggests AI Is Good Enough at Diagnosing Complex Medical Cases To Warrant Clinical Testing | Harvard Medical School
https://hms.harvard.edu/news/study-suggests-ai-good-enough-diagnosing-complex-medical-cases-warrant-clinical-testing

AI outperforms doctors in Harvard trial of emergency triage diagnoses | AI (artificial intelligence) | The Guardian
https://www.theguardian.com/technology/2026/apr/30/ai-outperforms-doctors-in-harvard-trial-of-emergency-triage-diagnoses

A team of physicians and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center evaluated how OpenAI's o1 preview performed against hundreds of clinicians, using criteria developed in the 1950s for physician training and evaluation. The evaluation included case study-based diagnostic tasks, reasoning exercises, and real-world emergency room cases.

In one experiment, the research team instructed o1 preview to evaluate patients at various stages in a standard emergency room setting. At each stage, it was given only the information available at that point and asked to generate a likely diagnosis and recommend the next course of action. The study used electronic medical records from 76 actual patients who visited the emergency room, and both o1 preview and the physicians were given only the same information when making their diagnoses.

As a result, it was reported that while human physicians were able to make accurate or very close diagnoses from standard electronic medical records in 50-55% of cases, o1 preview recorded a significantly higher accuracy of 67%. When more detailed information was available, o1 preview showed a high accuracy of 82%, but the accuracy of human experts was also high at 70-79%, so the difference was not statistically significant. Researchers report that AI will demonstrate its advantage particularly in the early stages of diagnosis, such as triage , where rapid decisions are required with minimal information.

Furthermore, the study involved o1 preview and 46 physicians reviewing five clinical cases to develop longer-term treatment plans, such as antibiotic administration plans and end-of-life care plans. The results showed that while human plans created using traditional methods like search engines were only 34% successful, o1 preview achieved a significantly higher success rate of 89%. However, the researchers point out that this comparison was based on only five cases and further verification is needed.

Furthermore, the study reported that o1 preview showed significant improvement in 'diagnostic reasoning,' an area where conventional LLMs tended to struggle. The graph below compares diagnostic accuracy, showing that o1 preview achieved approximately 78%, significantly outperforming GPT-4 (approximately 64%) and other conventional diagnostic systems.

Arjun Manlai, lead author of the study and head of the AI Lab at Harvard Medical School, said, 'We don't think these findings mean that AI will replace doctors. However, I do think they indicate that a very significant technological innovation is underway that will fundamentally change the way we practice medicine.'

Adam Rodman, also a lead author of the study and a physician at Beth Israel Deaconess Medical Center, stated, 'AI is one of the most influential technologies of the last few decades. In the next decade, AI will not replace doctors, but will be part of a new tripartite healthcare model consisting of doctors, patients, and AI systems.'

The researchers pointed out that while this study required judgments based solely on text from medical records, in the real world, it's necessary to pay attention to a lot of information, including images, sounds, and nonverbal cues. They emphasized that while previous versions of the AI model did not perform well in dealing with uncertainty or generating symptom descriptions, o1 preview has been greatly improved and has progressed to a level where it can be considered for practical use.

Related Posts:

May 07, 2026 19:00:00 in AI, Science, Posted by log1e_dh