Do Large Language Models Really Reason? Apple Sparks Debate

Apple’s latest research paper, The Illusion of Reasoning, has stirred quite a discussion in the AI community. According to the authors, today’s Large Language Models (LLMs) don’t truly reason - they simply recognize and replay patterns seen during training. But not everyone agrees. Many researchers argue that the paper’s methodology fails to capture how these models actually work.

What Apple’s Paper Claims
Key Findings
The Critics Respond
The Bottom Line
So... Who’s Right?

What Apple’s Paper Claims

At the core of the paper is a bold claim: even when using advanced techniques like Chain-of-Thought or Self-Reflection, LLMs aren’t really reasoning. They’re mimicking patterns they’ve already learned. In other words, what looks like reasoning may just be clever pattern reproduction.

The paper also critiques the benchmarks typically used to evaluate reasoning - math problems, coding challenges - saying they’re often contaminated. Models may have already seen similar examples during training. So simply looking at the final output can be misleading.

To address this, Apple designed a set of controlled puzzles, including:

Tower of Hanoi
Checkers Jumping
River Crossing (rule-based logic)
Block Wars

The goal? To measure step-by-step reasoning, not just end results.

Key Findings

On simple problems, non-reasoning models often outperformed reasoning models.
On moderately difficult problems, reasoning models did slightly better.
On complex tasks (like the Tower of Hanoi with 10 disks), all models failed to maintain coherent reasoning - even when given the correct algorithm in the prompt.

The takeaway? Current LLMs don’t truly generalize reasoning, and there’s a practical limit to how well they scale as task complexity increases. The paper suggests that building models capable of real reasoning will likely require new architectures.

Example: When prompted with “Execute the correct sequence to move 10 disks in the Tower of Hanoi,” the model starts making errors after just a few moves - even when provided with the correct algorithm.

The Critics Respond

Unsurprisingly, Apple’s paper has drawn its share of criticism.

First, some argue the paper doesn’t clearly define what “reasoning” actually is. Others point to a clear domain bias: current reasoning models are optimized for tasks like coding and math, not visual logic puzzles.

In other words, testing reasoning via puzzle games may not reflect what these models do best. It’s a bit like judging someone’s math ability by asking them questions about literature or history.

There are also concerns about the testing methodology:

The models weren’t allowed to use external tools (like Python code or visualizations), which are now part of how they reason in practice.
The collapse in performance beyond 10 disks may stem as much from context window limitations as from reasoning limitations. Today’s LLMs can only hold so much information at once.
The models weren’t given extra time to process or plan their responses, which may have limited their potential.

Critics argue that with the right tools (code generation, CSV outputs), models like GPT-4o or Gemini 2.5 can handle even complex puzzles effectively.

And when a pure text response isn’t practical (such as generating a 1,024-step sequence for the Tower of Hanoi), today’s models are smart enough to propose alternative solutions - like generating executable code - demonstrating a form of meta-reasoning.

The Bottom Line

Apple’s paper clearly shows that reasoning models aren’t perfect - but it doesn’t prove they can’t reason at all.

The observed failures may say more about the paper’s experimental setup than about the models themselves. The methodology was selective and didn’t capture the full range of today’s LLM capabilities. And these models reason in ways different from humans - for instance, by finding creative workarounds when output space is limited.

So... Who’s Right?

In short: both sides raise valid points.

It’s true that current reasoning benchmarks need improvement - a point well made by the paper. But the path to AGI isn’t binary (“we have it” or “we don’t”); it’s a continuum of ongoing progress.

Modern LLMs should be used with the right tools, in the right contexts. It’s misleading to draw sweeping conclusions about their intelligence from narrow, puzzle-based tests alone.

That said, Apple’s paper has sparked an important conversation: we do need better reasoning benchmarks to more accurately assess how LLMs think.

But claiming that they “don’t reason at all” is an overreach. Today’s models already exhibit practical forms of reasoning - though different from and less generalized than human reasoning.

The road to truly reasoning models remains long - but we’ve come a long way already.

Andrea Minini

06 / 12 / 2025