OpenAI’s new reasoning models found to hallucinate more frequently

OpenAI’s latest reasoning models o3 and o4-mini represent significant advancements in the ever-evolving world of Artificial Intelligence (AI). However, these new models hallucinate by generating false or fabricated information at substantially higher rates than their predecessor model o1.

The problem of AI hallucinations, where models generate plausible but false information, has long been recognized as one of the most persistent challenges in the development of Artificial Intelligence. Traditionally, newer AI models have demonstrated additional improvements in reducing hallucinations compared to their predecessors. However, the recent release of OpenAI’s o3 and o4-mini models seems to be an outlier and has shaken up the pattern of progress.

These new models, designed to be cutting-edge with a state-of-the-art performance in their ability to perform complex reasoning tasks, have unexpectedly become overconfident in giving accurate answers. Internal evaluations from OpenAI reveal that both o3 and o4-mini hallucinate more frequently than former reasoning models like o1, o1-mini, and o3-mini, as well as OpenAI’s conventional “non-reasoning” model GPT-4o.

Internal testing also reveals that o3 hallucinates in 33% of responses on OpenAI’s PersonQA benchmark, which is double the rate of previous models – 16% in o1 and 14.8% in o3-mini. o4-mini performs even worse with hallucinations in nearly half of the cases with a 48% hallucination in responses.

The regression is puzzling as o3 and o4-mini excel exceedingly well in coding and math tasks. For example, o3 scores 69.1% on the SWE-bench coding test, where it outperforms many rivals, according to OpenAI’s report.

According to OpenAI’s system card, o3 tends to assert more statements that results in both more accurate assertions and an increased number of inaccurate or hallucinated statements. This suggests that the models’ increased verbosity and willingness to make claims may be directly related to the higher hallucination rates.

The practical implications of this situation might lead to misinformation and decline in trust, especially in fields like healthcare and finance, where accuracy is critical. As such they might find older models like o1 safer despite inferior reasoning.

The unexpected regression in factual reliability raises important questions about the trade-offs or the compromises involved in enhancing AI reasoning capabilities, as well as the challenges that’d be encountered in ensuring accuracy in highly sophisticated AI systems. As a result, the model’s advanced reasoning methods that involve refining cognitive processes may prioritize complex problem-solving over factual accuracy.

This situation is particularly concerning, as OpenAI itself acknowledged the uncertainties surrounding the increase in hallucinations in their newer models. In its technical documentation, the tech giant says that “more research is needed” to comprehend why hallucinations are escalating as reasoning models are being expanded.

For now, older models like o1 remain safer for factual queries, while o3 and o4-mini are best suited for tasks where creativity outweighs precision. Transparency about these limitations will be important to maintain trust as the world of AI continues to evolve.

Niko Felix, an OpenAI spokesperson said in an email to TechCrunch, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”

What's Hot

An Enterprise Client Accidentally Spent $500 Million on Claude in a Single Month. Every Company Deploying AI Agents Needs to Read This.

Anthropic Just Surpassed OpenAI in the Private Market. The AI Race Has A New Leader and the Gap Is Widening Fast.

GitHub Lost 3,800 Internal Repositories to a Poisoned Developer Extension. The Supply Chain Attack Nobody Saw Coming Is Now the Most Dangerous Kind.

Anthropic Just Surpassed OpenAI in the Private Market. The AI Race Has A New Leader and the Gap Is Widening Fast.

Trump Backed Down on His AI Executive Order After Big Tech Pushed Back. What the Retreat Reveals About U.S. AI Policy Is More Important Than the Order Itself.

SpaceX Filed Its IPO Papers and Is Targeting a $1.75 Trillion Valuation. If It Goes Through It Will Be the Largest Public Offering in History and It Will Reshape the Tech Market Permanently.

Foxconn Got Hit by Ransomware and 11 Million Files Were Stolen. The Nitrogen Attack on the World’s Largest Electronics Maker Has Consequences for Every Big Tech Supply Chain

Anthropic Is About to Turn a Profit for the First Time. Its Q2 Revenue Is Expected to Hit $10.9 Billion and That Number Changes Everything About the AI Business Model.

Market Collapse: What Happened to NFTs?

Quantum Computing Advances Force Coinbase and Institutional Custodians to Rethink Crypto Security

AI Assisted Hacking Groups Target Crypto Firms With Multi-Layered Social Engineering

Global Crypto Regulations Expand as 2026 Begins With New Data Collection Frameworks and National Laws

Coinbase Bets on Stablecoin and On-Chain Growth as Key Market Drivers in 2026 Strategy

Foldable Phones Are No Longer a Gimmick — The Motorola Razr 2026 Is the Latest Sign That Foldables Are Going Mainstream

Foldable Phones Are No Longer a Gimmick — The Motorola Razr 2026 Is the Latest Sign That Foldables Are Going Mainstream

Meta Raises Quest VR Headset Prices as Component Costs Rise

Robotics Showcase: China Uses a Half-Marathon to Signal Progress in Humanoid Tech

GitHub Lost 3,800 Internal Repositories to a Poisoned Developer Extension. The Supply Chain Attack Nobody Saw Coming Is Now the Most Dangerous Kind.

Foxconn Got Hit by Ransomware and 11 Million Files Were Stolen. The Nitrogen Attack on the World’s Largest Electronics Maker Has Consequences for Every Big Tech Supply Chain

A Cybersecurity Firm Just Had Its Own Source Code Stolen. Trellix’s Breach Is the Most Embarrassing Kind and the Most Instructive One.

Hackers Built a Zero-Day Exploit Using AI and Almost Got Away With It. Google Caught It in Time. Next Time May Be Different.

275 Million Students Had Their Data Exposed in the Largest Education Cyberattack Ever Recorded. Here Is Exactly What Happened to Canvas

OpenAI’s new reasoning models found to hallucinate more frequently

An Enterprise Client Accidentally Spent $500 Million on Claude in a Single Month. Every Company Deploying AI Agents Needs to Read This.

Anthropic Just Surpassed OpenAI in the Private Market. The AI Race Has A New Leader and the Gap Is Widening Fast.

Trump Backed Down on His AI Executive Order After Big Tech Pushed Back. What the Retreat Reveals About U.S. AI Policy Is More Important Than the Order Itself.

Coinbase responds to hack: customer impact and official statement

Anthropic Will Use Claude User Chats For Data Training

Cursor AI Hits 1 Million Daily Users. Why Developers Are Switching to This Coding Tool

MIT Study Reveals ChatGPT Impairs Brain Activity & Thinking

An Enterprise Client Accidentally Spent $500 Million on Claude in a Single Month. Every Company Deploying AI Agents Needs to Read This.

Anthropic Just Surpassed OpenAI in the Private Market. The AI Race Has A New Leader and the Gap Is Widening Fast.

GitHub Lost 3,800 Internal Repositories to a Poisoned Developer Extension. The Supply Chain Attack Nobody Saw Coming Is Now the Most Dangerous Kind.

Trump Backed Down on His AI Executive Order After Big Tech Pushed Back. What the Retreat Reveals About U.S. AI Policy Is More Important Than the Order Itself.

Our Picks

Most Popular

Coinbase responds to hack: customer impact and official statement

Anthropic Will Use Claude User Chats For Data Training

Cursor AI Hits 1 Million Daily Users. Why Developers Are Switching to This Coding Tool

Stay Ahead with Exclusive Updates!

What's Hot

OpenAI’s new reasoning models found to hallucinate more frequently

Related Posts