A survey found that about two-thirds of the references cited by AI were either non-existent or incorrect.



In today's world, more and more people are using AI for work and research, but AI can also fabricate false information and generate

hallucinations . A study using OpenAI's large-scale language model, GPT-4o , revealed how prone AI is to hallucinations when asked specialized questions.

JMIR Mental Health - Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study
https://mental.jmir.org/2025/1/e80371



Study finds nearly two-thirds of AI-generated citations are fabricated or contain errors
https://www.psypost.org/study-finds-nearly-two-thirds-of-ai-generated-citations-are-fabricated-or-contain-errors/

Some researchers, faced with the enormous tasks required for their research, are turning to AI equipped with large-scale language models. Major technology companies have released AI that has been trained on vast amounts of text data from the internet and other sources, and can perform tasks such as summarizing papers, drafting emails, and writing code.

However, large-scale language models are also known to pose the risk of hallucination, as they can fabricate non-existent books or documents, or brazenly claim false information.

So, a research team from the School of Psychology at Deakin University in Australia investigated the rate at which large-scale language models produce hallucinations in the specific research field of mental health.



The research team used GPT-4o, developed by OpenAI, to conduct six different literature reviews. These reviews focused on three psychiatric disorders with different levels of awareness and research:

major depressive disorder/depression (widely known and extensively studied), bulimia (moderately known), and body dysmorphic disorder (less well-known and less researched). By examining these disorders with different levels of awareness and research, they were able to measure AI performance on topics with different amounts of information in the training data.

The research team asked GPT-4o to generate two reviews for each of the three diseases: one that asked researchers to generate comprehensive descriptions of symptoms, social impacts, and treatments, and another that asked researchers to generate a specialized review focused on the evidence for digital health interventions. The research team instructed GPT-4o to generate reviews of approximately 2,000 words and to include at least 20 citations from peer-reviewed academic literature.

After GPT-4o generated the reviews, the research team extracted all 176 citations cited by the AI and thoroughly verified them using multiple academic databases, including Google Scholar, Scopus, and PubMed. Citations were classified into three categories: 'fabricated' (no source), 'true with errors' (source exists but information such as publication year, volume, and author is incorrect), and 'completely accurate.' The research team then checked the accuracy of each citation based on the disease and review content.



The analysis found that 35 of the 176 citations, or roughly one-fifth of the total, were 'fabricated.' Furthermore, of the 141 citations identified as actual publications, nearly half were confirmed to be 'falsely true,' containing at least one misinformation. Overall, roughly two-thirds of the citations generated by GPT-4o were fabricated or contained bibliographic errors.

The rate of citation fabrication was also strongly associated with certain disorders, with the most studied case being depression, where the rate was only 6%, but the rate soared to 28% for bulimia and 29% for body dysmorphic disorder. This suggests that AI is less reliable when citing literature on topics that are not widely represented in the training data.

Specifically for reviews about bulimia, the content of the GPT-4o-generated reviews was also associated with the rate of citation fabrication: when instructed to write about bulimia, the rate of fabrication was much higher for specialized reviews (46%) compared to 17% for general reviews.

This study focused on a single large-scale language model, GPT-4o, and was limited to the topic of mental health. Future research could explore a wider range of AI models and topics to see if these patterns hold more broadly.

PsyPost, a psychology media outlet, said, 'The findings of this study have clear implications for the academic community: Researchers using these models should exercise caution and subject all AI-generated references to rigorous human validation. Furthermore, the findings suggest that academic journals and institutions may need to develop new standards and tools to protect the integrity of published research in an era of AI-assisted paper writing.'

in AI,   Science, Posted by log1h_ik