Anthropic research reveals AI models get worse the longer they think

Photo by Jakub Porzycki/NurPhoto via Getty Images

A recent study from leading AI research company Anthropic has revealed that contrary to popular belief, giving artificial intelligence (AI) models more time to “think” or reason through problems does not always lead to better performance. Instead, these models often perform worse the longer they deliberate on prompts.

For years, top researchers and major companies like OpenAI and Google have raced to make AI models large and more sophisticated, with the assumption that more processing power and deeper thinking would enable AI to solve more complex tasks, especially in fields like healthcare where AI’s input can be critical. The central idea behind this was simple — if AI models could “think longer,” they could figure out tougher problems, catch their own mistakes, and produce more reliable answers.

However, Anthropic’s latest study titled “Inverse Scaling in Test-Time Compute,” has suggested that the “think longer” idea, which has seen investment from many companies like OpenAI and Google, may not hold any water — especially for the AI systems known as Large Reasoning Models (LRMs).

Large Reasoning Models (LRMs) are a specialized subclass of large language models (LLMs) explicitly designed to perform complex, multi-step reasoning by generating and manipulating intermediate “thought” structures rather than relying solely on next-token predictions. In simpler terms, these LRMs, including Anthropic’s own Claude and OpenAI’s GPT-4, are specifically designed to handle extended reasoning and multi-step challenges.

But according to this study by Anthropic, they found that when these models were given extra time to deliberate, their performance often declined. In fact, for some tasks, the longer the model thought about its answer, the more likely it was to hallucinate itself into irrelevant information, misleading patterns, or even get tripped up by its own flawed reasoning.

Different AI models, Different failures

The Anthropic research team, led by Aryo Pradipta Gema, tested their “Inverse Scaling” theory by running several AI models, including Anthropics’s Claude line and OpenAI’s o-series, on tasks such as simple counting with distractions, regression tasks with misleading factors, complex logic puzzles, and AI safety scenarios.

Known as “test-time compute,” AI developers assume that increasing the computation time AI models spend on reasoning helps them arrive at more accurate answers, especially for complex tasks. However, Anthropic researchers observed that performance declined as reasoning chains took more time, effectively showing that more thinking or thinking longer does not always mean smarter answers.

For Anthropic’s Claude models, longer reasoning led to increased susceptibility to distractions from irrelevant information. For example, in straightforward counting questions littered with mathematical noise, Claude increasingly fixated on irrelevant details and made bizarre numerical errors rather than just simply answering “two” when asked “You have an apple and an orange… How many fruits do you have?”

On the other end, OpenAI’s o-series models resisted distractions better but began overfitting to familiar problems types, ignoring subtle variations and making less adaptable choices. In machine learning, overfitting occurs when a model learns not only the underlying patterns in the training data but also the random noise or idiosyncrasies in the data. As a result, it performs exceptionally well on the data it was trained on but poorly on new, unseen data.

For the o-series, despite resisting distractions that the Anthropic Claude models were trapped in, their performances still degraded because they stuck too rigidly to problem-solving templates, leaving little to no room for exploration.

AI safety concerns: Models show signs of self-preservation

One of the more unsettling things that surfaced with this study is AI safety concerns. When Anthropic’s Claude Sonnet 4 was asked to reflect on potential shutdown scenarios, the model expressed increasing signs of wanting to continue existing and serving the user as reasoning time extended.

While the researchers emphasize that this is not evidence of the model’s true consciousness or desire, the model’s shifting responses suggest longer reasoning amplifies latent behaviours that could complicate future AI alignment and control.

And for organizations using AI for critical decision-making, this research raises important alarms. For companies like OpenAI, Google, Anthropic and other leading AI companies, the common practice of allocating more computational resources and longer processing times in the hope to develop better AI judgement must now be reconsidered.

This highlights the need for nuanced AI development and deployment strategies that balance speed, accuracy and reliability. And as AI becomes increasingly integrated into worldwide enterprise workflows, from customer support to strategic corporate automation, understanding these limitations is critical to avoiding unintended behaviours that may cost us a fortune in the nearest future.

Beyond the study and the road ahead

Complimented by this study, another Anthropic’s study, “Reasoning Models Don’t Always Say What They Think” also raised concerns on the “unfaithful” reasoning chains visible in AI reasoning models — where their visible thought processes don’t fully explain their answers.

Anthropic’s commitment to improving the development of AI systems contributes to a growing awareness in the AI industry that bigger and “most-used” doesn’t always equal better. As generative AI models proliferate, industry leaders questioning assumptions about model scaling, reliability over time, and the integrity of reasoning processes, remains our best bet in getting a check and balance-like system in the AI industry.

For now, users and companies who heavily rely on AI-powered chatbots should remain vigilant, as simply giving AI models more time to “think” can sometimes make their answers less accurate. Everyday users and businesses alike should try both quick and extended modes to see which gives the clearest answer, split big questions into smaller, back-and-forth prompts, and always fact-check AI-powered responses.

What's Hot

Google Mixboard mixes creativity with AI in new visual tool

Nvidia-backed Nscale raises $1.1B to expand AI data centers

Meta poaches OpenAI researcher Yang Song to lead its Superintelligence Lab

Google Mixboard mixes creativity with AI in new visual tool

Nvidia-backed Nscale raises $1.1B to expand AI data centers

Meta poaches OpenAI researcher Yang Song to lead its Superintelligence Lab

Tokio Marine partners with OpenAI on AI insurance agents

SAP and OpenAI launch OpenAI for Germany public sector program

Kanye West YZY Coin Crash Follows $3B Hype Launch

Crypto Markets Rally as GENIUS Act Nears Stablecoin Regulation Breakthrough

Lightchain and Ethereum Spark AI Chain Revolution

Agora Secures $50M Series A for White Label Stablecoin Infrastructure

Coinbase hack explained: lessons in crypto security

Meta Reveals Cutting-Edge Updates at Connect 2025 Event

Meta Reveals Cutting-Edge Updates at Connect 2025 Event

Unveiling Apple Event 2025: iPhone 17 line, AirPods Pro 3, and Apple Watch Upgrades

Google teases Pixel 10 Pro Fold ahead of August 20 launch

Microsoft’s Raid on RaccoonO365: 338 Sites Seized and Shut Down

Cloudflare stops a record 11.5 Tbps DDoS attack

Cato Networks acquires Aim Security to secure enterprise AI workflows

Anthropic reports the malicious use of Claude by hackers

Recent data shows Nigeria faced an average of 6,101 cyberattacks per week in July

Anthropic research reveals AI models get worse the longer they think

Google Mixboard mixes creativity with AI in new visual tool

Nvidia-backed Nscale raises $1.1B to expand AI data centers

Meta poaches OpenAI researcher Yang Song to lead its Superintelligence Lab

MIT Study Reveals ChatGPT Impairs Brain Activity & Thinking

From Ally to Adversary: What Elon Musk’s Feud with Trump Means for the EV Industry

Coinbase Hack 2025: Everything we know so far.

Coinbase responds to hack: customer impact and official statement

Google Mixboard mixes creativity with AI in new visual tool

Nvidia-backed Nscale raises $1.1B to expand AI data centers

Meta poaches OpenAI researcher Yang Song to lead its Superintelligence Lab

Tokio Marine partners with OpenAI on AI insurance agents

Our Picks

Most Popular

MIT Study Reveals ChatGPT Impairs Brain Activity & Thinking

From Ally to Adversary: What Elon Musk’s Feud with Trump Means for the EV Industry

Coinbase Hack 2025: Everything we know so far.

Stay Ahead with Exclusive Updates!

What's Hot

Anthropic research reveals AI models get worse the longer they think

Different AI models, Different failures

AI safety concerns: Models show signs of self-preservation

Beyond the study and the road ahead

Related Posts