Why your shiny chatbot still fumbles a children’s puzzle
Open a tech feed and you’ll find boosterish claims that “the era of thinking machines is here.” Scroll a little farther and you’ll find the counter-tweet: “AI can’t think at all.” Two extremes, same timeline, five minutes apart. My inbox looks the same. That dissonance nudged a group of Apple researchers to ask a pesky question: What happens when we stop scoring the final answer and start peeking at the thinking?
Their study threw Large Reasoning Models—yes, the new clique of “Chain-of-Thought” champions—into four deceptively simple puzzle arenas: Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World. Each arena lets us crank up complexity one click at a time. The result is a stress test that turns hype into hard numbers and unearths stubborn AI limitations hiding behind pretty demos.
Table of Contents
From one disk to twenty: the three regimes of struggle

Picture Tower of Hanoi. With one disk, a toddler wins. Double the disks, you need eight moves. Twenty disks? More moves than a GPU can count before lunch. Apple’s team discovered that every current “thinking” model obeys the same heartbeat:
- Low complexity
Regular language models, the ones that spit the answer without narrating their thought stream, actually do better here. They’re lean, rarely overrun their token budget, and sidestep many AI limitations driven by verbosity. - Medium complexity
Now the Chain-of-Thought brigade shines. They write long, branching rationales, explore dead ends, self-correct, then land the right plan. This is the marketing sweet spot—tweetable demos of Dragonslayer-9000 solving Sudokus. - High complexity
Everything collapses. Accuracy flatlines. Worse, the models think less precisely when you beg them to think more. Their generated token count peaks, then falls off a cliff even though plenty of context window remains. That counter-intuitive drop is a flashing red sign of deeper AI limitations we can’t patch with more GPUs.
Why does the engine stall?

Large Reasoning Models are trained to produce reasoning traces that look good. They’re reinforced by reward models that score a plausible chain. But those rewards don’t penalize early hallucinations that derail later steps. Imagine telling a chess student, “I’ll grade your plan, not the outcome.” The student writes an eloquent plan, sacrifices the queen on move nine, and still passes.
Inside the puzzles the paper uses, that dynamic appears as overthinking. On easy tasks the model finds a valid solution quickly, then keeps exploring nonsense until it ruins its own answer. On harder ones, it stumbles early, digs a trench, and never recovers. The symptom list adds to our twenty-count of AI limitations:
• redundancy that wastes compute
• shallow look-ahead—no long causal chain survives token-level noise
• brittle self-verification routines (they stop checking after the first false “got it!”)
• inability to follow even an explicit algorithm without drifting off course
Claude vs DeepSeek vs OpenAI: same cliff, different mile markers

Apple’s team compared Claude 3.7 Sonnet (thinking on), Claude 3.7 Sonnet (thinking off), DeepSeek-R1, DeepSeek-V3, and OpenAI’s o3-mini. Swap logos, the pattern stays. DeepSeek fires more tokens before toppling; Claude stays concise but still crashes; o3-mini’s high-config tumbles a little later yet falls just as hard. Brand loyalty doesn’t buy immunity from AI limitations.
Tokens aren’t cheap, and they aren’t saving you
Let’s talk carbon and cash. Every extra generated token means more FLOPs and higher billable seconds. The study shows that “non-thinking” models often hit the right answer with a tenth of the tokens. On any pay-per-token API that gap converts into real money. Token thrift is also good engineering hygiene. Flooding a context window might earn style points on social media, but it’s a coroutine that drains your battery and your patience.
Turning puzzles into X-rays
Why not stick with standard math benchmarks? Because popular test sets leak. A model trained on five terabytes of web text probably saw last year’s AIME problems. Puzzle generators, by contrast, create infinite, never-published instances on demand. That eradicates contamination, shows raw generalization, and lays bare core AI thinking limits.
In Hanoi, complexity is just disk count. In Checker Jumping, it’s number of pieces. River Crossing scales by passenger pairs and boat size. Blocks World spirals into factorial growth the moment you add a seventh block. All four let you ask: If the model really understood the rules, would accuracy slide gracefully or would it nosedive? We got the nosedive.
Overthinking, under-planning
Digging into the reasoning streams, the researchers noticed a timing quirk. For smaller puzzles, correct plans show up early in the token trace, yet the model keeps poking at wrong detours until the trace ends. That’s cognitive noise translated into compute waste. For mid-level puzzles, the first half of the trace is junk, the fix appears near the end, and the model squeaks by. For high complexity, a valid partial plan never appears. Each of these stages underscores persistent AI limitations around memory, control flow, and long-range credit assignment.
The “follow the algorithm” trap
What if we spoon-feed the model the exact recursive algorithm for Tower of Hanoi? Zero effect. Collapse happens at the same disk count. The model parrots the pseudo-code, then fails to apply it consistently. This exposes another layer of AI limitations: token-level reasoning is not algorithm execution. It’s prediction of the next token that looks like algorithm execution. The gulf matters when you want airtight correctness.
Taking the scalpel to Chain-of-Thought reveals seven hard truths
Large Reasoning Models dress their answers in verbose “thoughts” because we reward them for transparency. That transparency is a two-edged sword. It gives us a microscope; it also shows how messy the machinery is. Let’s walk through what the Apple paper teaches about reading those thoughts—and why they shout AI limitations louder than the final answer ever could.
- Good thoughts arrive early, bad ones stick around longer
Plot the position of each intermediate solution inside the trace. For easy puzzles green dots (right ideas) cluster near 10% of the token budget, while red dots (wrong turns) smear across the remaining 90%. The model finds gold fast then digs holes anyway. On harder puzzles the dots flip—green crawls toward the back of the trace, red dominates the front. Either way, noise wins most of the airtime. This is not just verbosity. It’s a structural generative AI limitation tied to the token-by-token objective. - Self-correction is finite and fragile
A human chess player can spot an early blunder, backtrack mentally, and rebuild. Language models rarely backtrack. Once they assert a false claim, the follow-up reasoning often builds on it. This cascading error chain caps the depth of solvable tasks. That cap is one more nail in the coffin of “just scale it to AGI” takes—and a key talking point for engineers who must handle AI limitations in production systems. - Token budgets hide, not cure, scaling walls
Modern context windows look gigantic. Sixty-four thousand tokens feels infinite until you realize a Hanoi solution with 20 disks already eats half that. Even before hitting the hard cap, reasoning quality drops. More context didn’t finish the puzzle; it padded the search tree with fluff. The study proves that AI thinking falls apart for logical reasons, not window size. - Verifier tricks have ceiling effects
Several open-source stacks now pair a “thinker” with a “verifier.” The thinker spits chains, the verifier checks each step. Apple’s paper suggests the combo helps only while complexity rides the middle lane. Once you cross the cliff, both thinker and verifier cascade into failure. That’s a sobering AI limitation for safety engineers who hope to bolt a spell-checker onto superhuman reasoning. - Memorization isn’t the villain you think
Remember River Crossing with N > 2? The models tank. The likely culprit isn’t just deep logic, it’s rarity. Web text hardly contains those variants. Meanwhile, Tower of Hanoi with five disks is a meme, so models ace it. The line between reasoning and pattern exposure blurs. Better data can stretch the curve, yet pure memorization cannot erase the deeper AI limitations revealed when novelty spikes. - Non-thinking models sometimes fail later
Counter-intuitive but repeatable: non-thinking variants occasionally survive deeper into a move sequence. How so? They avoid the self-imposed overhead of evaluating and discarding false plans, thus stumbling a few logical steps later. That doesn’t make them smarter; it highlights that Chain-of-Thought adds noise as well as signal. Every extra token is a chance to hallucinate. Noise is another core member of the AI limitations family. - Execution trumps explanation
The “algorithm provided” experiment is the smoking gun. The language model recites the right algorithmic steps, yet deviates mid-stream. Explanation alone doesn’t guarantee execution fidelity. If you need an AI to run a production workflow, you must checkpoint every critical step or embed it in symbolic machinery that enforces state transitions. Treat LLM output like advice from an intern—smart, helpful, and casually wrong. That attitude keeps AI limitations from leaking into customer-facing bugs.
What this means for builders
- Choose verbosity wisely
When a project manager asks for “transparent reasoning,” show them the token invoice. Decide whether the insight outweighs the cost and the extra surface for error. - Gate complexity upstream
Before you send a problem to a Large Reasoning Model, estimate its compositional depth. If it lives in the collapse regime, break it into sub-tasks or shift to a symbolic solver. - Audit intermediate states
Pull the reasoning trace, run your own validators after each major step, and kill the run on the first contradiction. Think of it as circuit breakers for text. - Pair learning with constraints
Hybrids that mix neural generation with hard-coded rules—think SQL-aware natural-language systems—can compensate for many claude AI limitations and generative AI limitations. Pure autoregression can’t. - Measure, don’t assume
Accuracy-only benchmarks mask where a system wastes tokens or collapses without warning. Collect trace-level telemetry. Count thinking tokens. Plot collapse curves. Those graphs drive product-ready reliability.
Open research lanes
- Long-horizon credit assignment
Reinforcement Learning from Human Feedback gave us baseline reward models, but they score final output, not step quality. We need credits that seep backward through thousands of tokens. - Adaptive stopping
Early-exit triggers that detect when the model has a valid answer would slash compute cost and reduce overthinking. - Trace compression
Can we encourage minimal valid chains instead of verbose rambling? Compression might fight certain AI limitations tied to noise. - Symbolic hybrids
Feed Chain-of-Thought into a prover that checks each inference against a formal rule base. Fail fast, rewind, retry. - Curriculum-driven fine-tuning
If a model practices on controlled-complexity generators the way AlphaGo practiced ladder sequences, perhaps the collapse point nudges to the right.
A parting reality check
Large Reasoning Models are breathtaking in scope, yet the moment we make the puzzle a hair tougher, they stall. That stall doesn’t vanish because we tune a few hyper-parameters or slap a bigger GPU cluster on the problem. It’s structural.
The good news? We now own a sharper diagnostic toolkit. Puzzles with tunable knobs let us catch failure regimes early. They turn mystical “intelligence” into concrete engineering metrics. Embrace the grim graphs—they’re the fastest path to sturdier systems.
Next time someone claims their product “thinks like a PhD,” hand them a Tower of Hanoi with eight disks and a stopwatch. When the chatbot locks up, you’ll share a laugh—and maybe spark a deeper conversation about the enduring, fascinating, utterly human problem of AI limitations.
Azmat — Founder of Binary Verse AI | Tech Explorer and Observer of the Machine Mind Revolution. Looking for the smartest AI models ranked by real benchmarks? Explore our AI IQ Test 2025 results to see how top models. For questions or feedback, feel free to contact us or explore our website.
- Large Reasoning Model (LRM) – A language model fine-tuned or prompted to show its step-by-step “thoughts” rather than just final answers, often via Chain-of-Thought prompting.
- Chain-of-Thought (CoT) – A prompting technique that elicits a natural-language reasoning trace; useful for transparency but prone to verbosity and error propagation.
- Reasoning Cliff – The abrupt drop-off in accuracy once puzzle complexity crosses a threshold; it is a visible symptom of AI limitations when token noise overwhelms planning depth.
- Reward Model – A secondary model that scores generated text (e.g., a reasoning trace) during reinforcement learning; if it rewards fluency over correctness, it can reinforce bad plans.
- Cascading Error – A failure mode where an early hallucination becomes the foundation for later steps, causing the entire solution chain to collapse.
- Self-Verification – The model’s attempt to check its own intermediate outputs; brittle routines may stop validating after a single false “looks good” signal.
- Credit Assignment – In reinforcement learning, the challenge of mapping final rewards back to individual early actions; long token sequences make accurate assignment difficult.
- Context Window – The maximum number of tokens a model can consider at once; adding more space delays but does not eliminate AI limitations tied to reasoning depth.
- Token Budget – The count of generated tokens allowed before cost or latency becomes prohibitive; overspending this budget amplifies AI limitations in both efficiency and reliability.
- Symbolic Hybrid – A system that combines neural text generation with rule-based or formal logic components, using the symbolic part to constrain or verify the LLM’s output.
1. Why AI can’t think like humans?
Large language models generate the next token that looks plausible; they don’t build an internal world model the way a brain does. Apple’s puzzle study shows how this prediction-only objective runs straight into AI Limitations such as brittle memory and shallow look-ahead once complexity rises beyond a few steps.
2. What is AI thinking and why does it feel convincing?
“AI thinking” is really a narrative the model weaves moment-by-moment. Chain-of-thought traces give us the prose of reasoning, but—as the study highlights—they’re often an illusion of thinking: eloquent plans that can still derail mid-stream.
3. What are the main challenges and limitations of AI in complex puzzles?
Key obstacles include token-level noise, cascading errors after a single bad assumption, and reward models that prize fluent explanations over correct execution. All three belong to the broader family of AI Limitations that become obvious in Tower-of-Hanoi, River Crossing, and similar stress tests.
4. Do larger context windows or more tokens solve AI’s reasoning cliff?
Not really. DeepSeek fires off more tokens before toppling, while Claude stays concise but still crashes, and OpenAI’s o3-mini follows the same curve. Bigger windows can postpone the plunge, yet they don’t remove the cliff itself.
5. How can builders mitigate AI Limitations when deploying chain-of-thought models?
Break problems into smaller sub-tasks, checkpoint intermediate states with verifiers, and combine neural generators with symbolic guards (SQL, formal provers, or rule engines). These techniques fence in AI Limitations so they don’t surface in customer-facing workflows.