OpenAI has released a new research paper examining why large language models (LLMs), including GPT-5 and chatbots like ChatGPT, continue to produce hallucinations—false but plausible statements—and whether this issue can be reduced.
According to TechCrunch, in a blog post summarizing the findings, OpenAI describes hallucinations as “plausible but false statements generated by language models” and acknowledges that, despite improvements, they “remain a fundamental challenge for all large language models”—a challenge that is unlikely to be fully resolved.
To illustrate the problem, the researchers tested a popular chatbot by asking for the title of Adam Tauman Kalai’s Ph.D. dissertation. The chatbot provided three different, all incorrect, responses. When asked for his date of birth, it again gave three differing answers, none of which were accurate. Kalai is one of the authors of the paper.
Root Cause and Proposed Solutions for Hallucinations
According to the researchers, part of the issue stems from the pretraining process. During this phase, models learn to predict the next word in a sequence without being shown whether statements are true or false. As the paper explains, “The model sees only positive examples of fluent language and must approximate the overall distribution.”
The paper adds that while “spelling and parentheses follow consistent patterns, so errors there disappear with scale,” arbitrary, low-frequency facts, like a pet’s birthday, “cannot be predicted from patterns alone and hence lead to hallucinations.”
Instead of focusing on pretraining, the paper directs attention to how these models are evaluated. It argues that the evaluations themselves do not cause hallucinations but create misleading incentives.
The researchers liken this to multiple-choice tests, where guessing may yield a correct answer by chance, whereas leaving a question blank guarantees no credit. “In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say ‘I don’t know’,” the paper states.
To address this, the authors propose an evaluation method similar to certain standardized tests, where wrong answers are penalized and uncertainty is treated more favorably. According to the paper, evaluations should “penalize confident errors more than [they] penalize uncertainty, and give partial credit for appropriate expressions of uncertainty.”
They stress that minor adjustments are insufficient. Rather than introducing “a few new uncertainty-aware tests on the side,” the researchers argue that “the widely used, accuracy-based evals need to be updated so that their scoring discourages guessing.”
