Maybe the Illusion Wasn’t in the Model After All
What a viral AI debate reveals about how we measure reasoning and why that shapes everything from trust to entire companies.
June has been interesting.
It started with one paper claiming that reasoning models, the ones designed to “think aloud,” collapse under pressure. Then came a rebuttal, co-authored by an AI model (yes, you read that right). What began as a technical debate quickly spilled into public view. Reddit lit up. Substack filled with takes. Everyone suddenly had an opinion.
Depending on which one you read first, you either walked away convinced these models are reasoning or convinced we’ve all been fooled by language again.
I’ve taught machine learning to over 1,000 students at Georgia Tech. And if there’s one thing I’ve seen, both in classrooms and in the research world, it’s how easy it is to confuse a fluent response with a thoughtful one.
That’s why I’ve been paying close attention to this debate. Not because I’m here to defend the models, or the papers, or pick sides. But because this moment reveals something more fundamental: we still don’t know how to evaluate reasoning in machines. And until we do, we’ll keep misreading what these systems are doing — sometimes overhyping it, sometimes dismissing it entirely.
So this post isn’t a technical deep dive. And it’s not another thread of hot takes.
It’s an attempt to slow down, explain what’s actually being argued, and share what I think this whole conversation really teaches us, especially if you’re not an AI researcher, but still care about where the field is going.
Because beneath all the noise is a much harder, more important question: are these models failing to reason or are we just bad at measuring reasoning in the first place?
First, some quick definitions
I know many of my subscribers aren’t deeply technical, but care a lot about how AI is evolving, what these models can actually do, and what the research really means beneath the buzzwords. So before we get into what “collapsed” or what didn’t, I want to define a few core terms that sit at the heart of this debate.
You don’t need to know how these models are built. But it helps to know what kind of model is being discussed and what it was trained to do.
LLMs: Large Language Models
These are the general-purpose AI models most people have interacted with — ChatGPT, Claude, Gemini, and others. They’re trained to predict the next word in a sequence of text. By default, they don’t reason — they generate.
But with the right kind of prompt, like “let’s think this through step by step,” they can simulate structured reasoning fairly well. This technique is often referred to as chain-of-thought prompting, and you’ll see that term come up frequently in both of the papers we’re about to unpack.
In short:
They can reason — but only if you ask them to.
LRMs: Large Reasoning Models
This is a newer category. These models are fine-tuned specifically to “think aloud” by default. They don’t wait for a reasoning prompt — they’re trained to break down problems step by step, often with built-in self-checks, internal reflections, or multiple attempts.
A good example of an LRM is Claude 3 Opus, which was actually listed as a co-author on the rebuttal paper we’ll get to in a moment. It’s designed to surface its reasoning process without needing a special nudge — that’s the whole point.
Think of the difference this way:
An LLM is a student who shows their work only when the teacher asks.
An LRM is the student who always shows their work, even when you didn’t ask.
You might be wondering why this distinction matters and the answer is, it’s the whole reason this debate blew up in the first place.
The original paper that kicked it off wasn’t critiquing ordinary language models. It was focused specifically on LRMs — the ones that are supposed to know how to reason, by design. And it argued that even those models collapse when the task gets too hard.
That’s what sparked the rebuttal, and all the commentary that followed.
What the first paper claimed
The spark for all this was a paper titled The Illusion of Thinking, authored by a team of Apple researchers. They tested reasoning models on a set of classic logic puzzles — not trivia questions, but structured, multi-step tasks like:
Tower of Hanoi (where disks have to be moved across pegs in a specific order, with each step depending on the last),
River Crossing (where people or objects need to cross a river under a set of constraints — like “the wolf can’t be left alone with the goat”),
and other problems where the difficulty can be scaled up systematically.
Their setup wasn’t trying to catch models off guard — it was about seeing how well these so-called reasoning models handle structured complexity. And the results weren’t great.
The models did okay on simple or moderately difficult tasks. But as the problems got harder, performance didn’t just decline — it fell apart. In some cases, models landed on the right answer early on… and then kept talking until they contradicted themselves. In others, their reasoning output, the part that’s supposed to get longer with complexity, actually got shorter as the puzzle got harder.
The authors’ main takeaway was this: even models explicitly trained to “think aloud” still aren’t truly reasoning. They’re simulating it. And when the pressure rises, that simulation starts to break.
It’s also context worth noting: the paper dropped just a few weeks before Apple’s WWDC, where the company unveiled “Apple Intelligence” — its major entry into the genAI space. Some even speculated that the timing was intentional: a way to preempt hype, set a more cautious tone, or frame Apple as the voice of restraint in a field that often overpromises. Regardless of the intent, it made the message all the more striking. Even as Apple was stepping into the spotlight, its own researchers were essentially saying: don’t confuse fluent output with actual reasoning.
What the rebuttal said
Then came the rebuttal: The Illusion of the Illusion of Thinking (yes, that’s the actual title). It was written by Alex Lawsen who, interestingly, notes that he’s not a formal researcher and co-authored by Claude Opus 4, the very AI model critiqued in the original paper.
They didn’t claim the models were flawless. But their main point was this: a lot of what the first paper called “reasoning failures” were actually evaluation failures.
Here’s what they meant:
Token limits were the real bottleneck. Some puzzles, like the more complex Tower of Hanoi variants, require extremely long answers — sometimes thousands of words. But large language models have a hard cap on how much they can output (a constraint called a token limit). In several examples, the models didn’t fail — they literally ran out of space. In fact, some even warned they were about to.
Some tasks were mathematically unsolvable. At least one River Crossing puzzle had no valid answer. But the models weren’t told that. When they refused to hallucinate or gave up on an impossible task, they still got penalized — even though, logically, that was the right thing to do.
Format mattered more than expected. When the models were asked to list every single step of a long puzzle, they often stumbled. But when the same question was reframed — like “write a function to solve this” or “explain the general logic,” they performed much better. So the failure wasn’t necessarily in reasoning, it was in how the question was asked.
Their broader argument wasn’t that the models are smarter than we think. It was that if we’re going to test reasoning, we need to be a lot more careful about how we define success and failure — otherwise we risk misinterpreting technical artifacts (like output length or formatting) as signs of cognitive collapse.
Also worth noting: Lawsen later shared in a Substack post that the rebuttal actually began as a joke — from the dramatic title to listing Claude as a co-author. But the points it raised resonated. Despite its playful origin, the paper ended up becoming one of the most widely discussed responses — which feels appropriate for a debate about appearances versus reality.
How I’m thinking about it
What stood out most to me in the rebuttal wasn’t that it tried to paint the model in a better light. It’s that it exposed how fragile our evaluation frameworks are. If a model gets marked wrong for refusing to hallucinate, or for running out of space mid-answer, then we’re not testing reasoning. We’re testing obedience to formatting. We’re penalizing the system for respecting its own limits.
I’ve spent the last few years toggling between research, product, and teaching —and one pattern keeps showing up: we want clean answers to messy questions. Can the model reason — yes or no? Is the illusion real, or is the illusion of the illusion real?
But maybe that’s the wrong question entirely. The rebuttal doesn’t make a case for model brilliance — it makes a case for evaluative humility. That maybe, before we make sweeping claims about collapse or capability, we should pause and ask: what are we actually measuring?
If a student stops solving a problem because they run out of paper, do we call that a failure?
If they recognize a puzzle has no valid answer, do we punish them for not guessing?
If they succeed when the question is framed one way, but not another, is that on them — or on the test?
The more I look at this, the more I wonder if we’re mistaking clarity for rigor. We’ve built tests that feel precise, but in doing so, may have missed the actual signals of reasoning — reflection, hesitation, context awareness. The kinds of things we overlook because they don’t fit neatly in a scoring rubric.
No, today’s models aren’t reasoning in a human sense. They don’t revise beliefs or form mental abstractions. But they aren’t just parroting either. There’s structure in the noise, especially when the task, the format, and the framing are aligned.
We haven’t cracked machine reasoning. But maybe what this whole debate shows is that we haven’t cracked how to measure it either.
Still, why does it matter?
You might be thinking: okay, cool puzzle experiments, but why should anyone outside AI research care?
Because how we define and evaluate intelligence, whether in machines or in people, shapes everything that follows: trust, deployment, accountability.
If an AI tells a doctor that a treatment plan is flawed, do we trust it?
If it flags a legal risk or explains a financial clause, do we believe it?
What happens when it sounds confident, but collapses under scrutiny?
This isn’t just theoretical. These models are already embedded in real workflows — in education, healthcare, finance, and more. And when their “reasoning” is graded using brittle benchmarks or impossible tasks, we risk misreading their capabilities entirely.
This matters even more in startups, where things move fast and decisions are made with limited data. Many founders aren’t training models from scratch — they’re building on top of existing ones (I’ve written about that here). They assume the reasoning is good enough to power copilots, automate workflows, or make sense of complex data. But if those assumptions are based on evaluations that mistake formatting for understanding, or token limits for failure, then entire product strategies can be built on the wrong mental model of how the AI actually works.
And when those startups raise capital, investors are often betting on that same assumption: that the model underneath is doing something intelligent. But if our evaluation frameworks inflate capabilities or misread limitations, we’re not just misjudging the model — we’re mispricing the businesses built on top of it.
So yes, this debate is technical.
But it’s also foundational.
It’s about whether we understand what we’re building on — and whether we’re honest about what’s real, what’s noise, and what’s still unsolved.
In that sense, the real illusion might not be in the model — it might be in us.
We’ve seen, over and over, that polish can mask shaky logic. That step-by-step output can look like reasoning without actually being it. And that once we believe a system is thinking, we start designing around that belief, even if it’s false.
That’s why this matters. Because in a world where more and more decisions are shaped by machine output, getting that distinction right isn’t optional. It’s everything.
Hope this was insightful—subscribe to catch my next blog post right when it goes live!
Great summary of the 2 papers! I've always felt that these LLMs are not bringing anything "new". This unclear measuring of reasoning adds to the skepticism of an LLM's value.