OpenAI has released a new research paper exploring why large language models (LLMs), including GPT-5 and chatbots like ChatGPT, continue to produce hallucinations—false but plausible statements—and whether this issue can be reduced.
Understanding Hallucinations
According to TechCrunch, in a blog post summarizing its findings, OpenAI defines hallucinations as “plausible but false statements generated by language models.” It acknowledges that, despite recent improvements, they “remain a fundamental challenge for all large language models”—a challenge that is unlikely to be fully resolved.
To illustrate the problem, the researchers tested a popular chatbot by asking for the title of Adam Tauman Kalai’s Ph.D. dissertation. The chatbot provided three different responses, all incorrect. When asked for his date of birth, it again gave three differing answers, none accurate. Kalai himself is one of the authors of the paper.
According to the researchers, a key part of the issue stems from the pretraining process. During this phase, models learn to predict the next word in a sequence without being explicitly told whether statements are true or false. As the paper explains, “The model sees only positive examples of fluent language and must approximate the overall distribution.”
It adds that while consistent patterns like spelling and parentheses see errors disappear with scale, “arbitrary low-frequency facts, like a pet’s birthday, cannot be predicted from patterns alone and hence lead to hallucinations.”
Proposed Solution: Rethinking Evaluations
Instead of focusing on pretraining, the paper directs attention to how these models are evaluated. It argues that the evaluations themselves don’t cause hallucinations but create misleading incentives.
The researchers liken this to a multiple-choice test. On such a test, guessing a correct answer earns credit, whereas leaving a question blank guarantees no credit. “In the same way, when models are graded only on accuracy… they are encouraged to guess rather than say ‘I don’t know’,” the paper states.
To address this, the authors propose an evaluation method similar to certain standardized tests, where wrong answers are penalized and uncertainty is treated more favorably. According to the paper, evaluations should “penalize confident errors more than [they] penalize uncertainty, and give partial credit for appropriate expressions of uncertainty.”
They stress that minor adjustments are insufficient. Rather than introducing “a few new uncertainty-aware tests on the side,” the researchers argue that “the widely used, accuracy-based evals need to be updated so that their scoring discourages guessing.”

