Close Menu

    Stay Ahead with Exclusive Updates!

    Enter your email below and be the first to know what’s happening in the ever-evolving world of technology!

    What's Hot

    Token Efficiency: Why Aria Networks Raised $125M for AI-Native Infrastructure

    April 18, 2026

    Virtual Safeguards: China Bans Addictive “Digital Humans” for Minors

    April 18, 2026

    Grid-Responsive AI: How Nvidia Plans to Turn Data Centers Into Power Assets with Emerald AI

    April 16, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter)
    PhronewsPhronews
    • Home
    • Big Tech & Startups

      Token Efficiency: Why Aria Networks Raised $125M for AI-Native Infrastructure

      April 18, 2026

      Virtual Safeguards: China Bans Addictive “Digital Humans” for Minors

      April 18, 2026

      Grid-Responsive AI: How Nvidia Plans to Turn Data Centers Into Power Assets with Emerald AI

      April 16, 2026

      The Trillion-Dollar Exit: Why a SpaceX IPO Would Reshape the Space Economy

      April 15, 2026

      The Sacramento Blueprint: How California is Writing the U.S. AI Rulebook

      April 14, 2026
    • Crypto

      Quantum Computing Advances Force Coinbase and Institutional Custodians to Rethink Crypto Security

      March 8, 2026

      AI Assisted Hacking Groups Target Crypto Firms With Multi-Layered Social Engineering

      February 18, 2026

      Global Crypto Regulations Expand as 2026 Begins With New Data Collection Frameworks and National Laws

      January 16, 2026

      Coinbase Bets on Stablecoin and On-Chain Growth as Key Market Drivers in 2026 Strategy

      January 10, 2026

      Tether Faces Ongoing Transparency Questions and Reserve Scrutiny Amid Massive Bitcoin Accumulation

      January 5, 2026
    • Gadgets & Smart Tech
      Featured

      AirPods Max 2: USB-C, Live Translation, and the H2 Upgrade

      By preciousMarch 26, 2026
      Recent

      AirPods Max 2: USB-C, Live Translation, and the H2 Upgrade

      March 26, 2026

      How ABB and Nvidia are Perfecting Industrial Robotics using AI Simulation

      March 20, 2026

      Neura Robotics Reaches €4B Valuation With Tether Backing

      March 12, 2026
    • Cybersecurity & Online Safety

      Cyber Retaliation: How Iran-Linked Hackers Paralyzed Medical Giant Stryker

      April 16, 2026

      Your Company Could Be Iran’s Next Target: What U.S. Tech Firms Need to Do Right Now

      April 6, 2026

      Google Is Warning Us About The Encryption Protecting Your Data Today. It May Not Survive Quantum Computing

      April 5, 2026

      Accenture and Anthropic Team Up on AI-powered Cybersecurity

      April 4, 2026

      Your BVN, Passport, and Bank Account May Already Be on the Dark Web. What Every Nigerian Must Do Right Now After the Banking Breaches

      April 4, 2026
    PhronewsPhronews
    Home»Artificial Intelligence & The Future»OpenAI’s new reasoning models found to hallucinate more frequently
    Artificial Intelligence & The Future

    OpenAI’s new reasoning models found to hallucinate more frequently

    preciousBy preciousApril 25, 2025Updated:June 13, 2025No Comments
    Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    OpenAI’s latest reasoning models o3 and o4-mini represent significant advancements in the ever-evolving world of Artificial Intelligence (AI). However, these new models hallucinate by generating false or fabricated information at substantially higher rates than their predecessor model o1.

    The problem of AI hallucinations, where models generate plausible but false information, has long been recognized as one of the most persistent challenges in the development of Artificial Intelligence. Traditionally, newer AI models have demonstrated additional improvements in reducing hallucinations compared to their predecessors. However, the recent release of OpenAI’s o3 and o4-mini models seems to be an outlier and has shaken up the pattern of progress. 

    These new models, designed to be cutting-edge with a state-of-the-art performance in their ability to perform complex reasoning tasks, have unexpectedly become overconfident in giving accurate answers. Internal evaluations from OpenAI reveal that both o3 and o4-mini hallucinate more frequently than former reasoning models like o1, o1-mini, and o3-mini, as well as OpenAI’s conventional “non-reasoning” model GPT-4o. 

    Internal testing also reveals that o3 hallucinates in 33% of responses on OpenAI’s PersonQA benchmark, which is double the rate of previous models – 16% in o1 and 14.8% in o3-mini. o4-mini performs even worse with hallucinations in nearly half of the cases with a 48% hallucination in responses.

    The regression is puzzling as o3 and o4-mini excel exceedingly well in coding and math tasks. For example, o3 scores 69.1% on the SWE-bench coding test, where it outperforms many rivals, according to OpenAI’s report. 

    According to OpenAI’s system card, o3 tends to assert more statements that results in both more accurate assertions and an increased number of inaccurate or hallucinated statements. This suggests that the models’ increased verbosity and willingness to make claims may be directly related to the higher hallucination rates. 

    The practical implications of this situation might lead to misinformation and decline in trust, especially in fields like healthcare and finance, where accuracy is critical. As such they might find older models like o1 safer despite inferior reasoning.

    The unexpected regression in factual reliability raises important questions about the trade-offs or the compromises involved in enhancing AI reasoning capabilities, as well as the challenges that’d be encountered in ensuring accuracy in highly sophisticated AI systems. As a result, the model’s advanced reasoning methods that involve refining cognitive processes may prioritize complex problem-solving over factual accuracy. 

    This situation is particularly concerning, as OpenAI itself acknowledged the uncertainties surrounding the increase in hallucinations in their newer models. In its technical documentation, the tech giant says that “more research is needed” to comprehend why hallucinations are escalating as reasoning models are being expanded. 

    For now, older models like o1 remain safer for factual queries, while o3 and o4-mini are best suited for tasks where creativity outweighs precision. Transparency about these limitations will be important to maintain trust as the world of AI continues to evolve.

    Niko Felix, an OpenAI spokesperson said in an email to TechCrunch, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”

    advanced reasoning AI AI creative vs factual AI false information AI hallucination problem AI hallucination research AI in Finance AI in healthcare AI misinformation risks AI model regression AI model reliability AI performance trade-offs AI system card AI trust issues factual accuracy AI GPT-4o comparison hallucination rates in AI Niko Felix OpenAI o1 vs o3 o3 vs o1-mini o4-mini hallucination rate OpenAI model testing OpenAI o3 model OpenAI o4-mini OpenAI PersonQA benchmark reasoning models AI SWE-bench coding test
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email
    precious
    • LinkedIn

    I’m Precious Amusat, Phronews’ Content Writer. I conduct in-depth research and write on the latest developments in the tech industry, including trends in big tech, startups, cybersecurity, artificial intelligence and their global impacts. When I’m off the clock, you’ll find me cheering on women’s footy, curled up with a romance novel, or binge-watching crime thrillers.

    Related Posts

    Token Efficiency: Why Aria Networks Raised $125M for AI-Native Infrastructure

    April 18, 2026

    Virtual Safeguards: China Bans Addictive “Digital Humans” for Minors

    April 18, 2026

    Grid-Responsive AI: How Nvidia Plans to Turn Data Centers Into Power Assets with Emerald AI

    April 16, 2026

    Comments are closed.

    Top Posts

    Coinbase responds to hack: customer impact and official statement

    May 22, 2025

    MIT Study Reveals ChatGPT Impairs Brain Activity & Thinking

    June 29, 2025

    From Ally to Adversary: What Elon Musk’s Feud with Trump Means for the EV Industry

    June 6, 2025

    Anthropic Will Use Claude User Chats For Data Training

    October 16, 2025
    Don't Miss
    Artificial Intelligence & The Future

    Token Efficiency: Why Aria Networks Raised $125M for AI-Native Infrastructure

    By preciousApril 18, 2026

    Aria Networks, a Palo Alto-based networking startup founded in January 2025, has raised $125 million…

    Virtual Safeguards: China Bans Addictive “Digital Humans” for Minors

    April 18, 2026

    Grid-Responsive AI: How Nvidia Plans to Turn Data Centers Into Power Assets with Emerald AI

    April 16, 2026

    Cyber Retaliation: How Iran-Linked Hackers Paralyzed Medical Giant Stryker

    April 16, 2026
    Stay In Touch
    • Facebook
    • Twitter
    About Us
    About Us

    Evolving from Phronesis News, Phronews brings deep insight and smart analysis to the world of technology. Stay informed, stay ahead, and navigate tech with wisdom.
    We're accepting new partnerships right now.

    Email Us: info@phronews.com

    Facebook X (Twitter) Pinterest YouTube
    Our Picks
    Most Popular

    Coinbase responds to hack: customer impact and official statement

    May 22, 2025

    MIT Study Reveals ChatGPT Impairs Brain Activity & Thinking

    June 29, 2025

    From Ally to Adversary: What Elon Musk’s Feud with Trump Means for the EV Industry

    June 6, 2025
    © 2025. Phronews.
    • Home
    • About Us
    • Get In Touch
    • Privacy Policy
    • Terms and Conditions

    Type above and press Enter to search. Press Esc to cancel.