A model can riff on Gödel, write code, and pass bar exams, yet ask it to multiply two four-digit numbers and it fumbles. If you have ever wondered why, you’re in good company. The answer lives in the mathematics of large language models, and the story is both technical and refreshingly practical. A recent paper pried open a small Transformer that does learn multi-digit multiplication and showed, step by step, how it succeeds where a standard fine-tuned model fails. What drops out is a clean diagnosis of the long-range dependency problem and a fix that is simple enough to try this week.
Before we dive in, here’s the promise. We’ll cut through the “autocomplete” myth, walk a real example by hand, peek at the attention patterns that make multiplication work, and highlight a compact training tweak that unlocks the skill. Along the way we’ll keep our compass set to the mathematics of large language models, not hype or hope.
Table of Contents
1. The Misconception: “It Just Predicts The Next Token”

“Why can’t AI do math” shows up in search because the common story sounds plausible. If a model only predicts the next token, then arithmetic is a stretch. That story is neat. It’s also incomplete. The Transformer architecture can represent and execute algorithms. The question isn’t capability in theory. The question is which algorithm a trained model actually learns. For the mathematics of large language models, that difference is everything.
A multiplication result depends on many digits that are far apart in the sequence. That is a stress test for long-range dependencies transformers often miss during vanilla fine-tuning. When an llm math problem requires combining many remote signals in the right order, the statistical shortcut fails. The paper we’re unpacking shows this cleanly by contrasting two training recipes on the same architecture.
2. What Makes Multiplication Hard For A Transformer
Humans solve multi-digit multiplication by keeping a scratchpad. We compute partial products, add them, and carry. A model has to replicate the same long-distance data flow inside its activations.
The paper uses a crisp formulation with an intermediate “running sum” that captures everything needed at each output position. Let the digits of the two numbers be (a_0..a_3) and (b_0..b_3), least significant first, and the output digits (c_0..c_7). Define
The key signal is (\hat{c}_k). If the model can reconstruct (\hat{c}_k) at step (k), it can emit (c_k) and pass the carry forward. That is the long-range relay race most models drop in the middle digits. This is the beating heart of the mathematics of large language models when they face arithmetic.
2.1 A Concrete Example You Can Verify
Compute (12 \times 34) with the scheme above, writing numbers least significant first:
- (a_0=2, a_1=1).
- (b_0=4, b_1=3).
Now walk the positions:
- (k=0): (s_0 = a_0 b_0 = 2\times 4=8). (\hat{c}_0 = 8). So (c_0=8), (r_0=0).
- (k=1): (s_1 = a_1 b_0 + a_0 b_1 = 1\times 4 + 2\times 3 = 10).
- (\hat{c}_1 = 10 + r_0 = 10). So (c_1=0), (r_1=1).
- (k=2): (s_2 = a_1 b_1 = 1\times 3 = 3).
- (\hat{c}_2 = 3 + r_1 = 4). So (c_2=4), (r_2=0).
Read the result most significant first: (408). The model that learned to reconstruct (\hat{c}_k) nails this. The one that didn’t, fails. The paper uses this exact logic to probe what information sits inside the hidden states, a direct lens into the mathematics of large language models in action.
3. The Experiment: Standard Fine-Tuning Versus ICoT
Two small, comparable Transformers are trained on 4×4-digit multiplication:
- SFT. Standard fine-tuning on inputs and final answers only.
- ICoT. Implicit chain-of-thought, which starts with intermediate step tokens, then gradually deletes them across epochs so the model must internalize those steps.
Both models share a minimal 2-layer, 4-head setup. The ICoT model reaches 100 percent accuracy on 4×4 multiplication. The SFT model languishes below 1 percent, even when scaled deeper, and plateaus with big errors on the middle digits. That is not a vague “ai reasoning limitations” claim. It is a measured gap with tight controls on data and architecture, and it squarely concerns the mathematics of large language models.
3.1 What The Probes Reveal
Two airtight checks make the case.
- Logit attribution. Perturb a single input digit and see which output digits’ logits move. The ICoT model shows the right long-range pattern, strongest influence for pairs with (i+j=k), and diminishing influence as pairs move away. The SFT model misses those middle-digit dependencies.
- Linear probes for (\hat{c}_k). From the final attention outputs, a simple linear regressor predicts the running sum almost perfectly for ICoT and very poorly for SFT. The signal is either present or it isn’t. In ICoT, it is present. In SFT, it isn’t. This is the mathematics of large language models made visible.
4. How The Successful Model “Thinks”: Attention Trees
The ICoT model doesn’t magic the answer out of thin air. Its attention heads organize a sparse directed acyclic graph that functions like a tiny expression tree.
- Layer 1 caches pairwise products. Heads focus on two digits at a time, compute the contribution (a_i b_j), and store that representation in the hidden state at different timesteps.
- Layer 2 retrieves the right set. When it is time to produce (c_k), heads query earlier positions that cached the needed pairs with (i+j=k), along with the previous running sum for the carry. The shape you see in the attention map is a neat binary tree spread across time.
That is a specific, testable mechanism that fits the task. It is also a blueprint for handling long-range dependencies transformers often struggle with under SFT. If you care about the mathematics of large language models, this is a model-internals moment worth bookmarking.
4.1 Geometry Matters: The Model’s Digits Live On A Pentagonal Prism
There is a lovely geometric twist. The winning model’s hidden states arrange digits using a Fourier basis. Visualize the last-layer activations with PCA and you get two stacked pentagons, one for even digits and one for odd, forming a pentagonal prism. That basis makes adding and combining digits efficient for attention heads, which then assemble partial products as Minkowski sums. The SFT model lacks this clean structure. The geometry is not a party trick. It is a compact code that makes the downstream arithmetic easy to express, a fine example of the mathematics of large language models discovering the right coordinate system.
5. Why Standard Fine-Tuning Fails In The Middle

If the model learns the first two digits and the last one quickly, why does it stall on the middle? Gradients flow heavily to (c_0, c_1,) and (c_7) early, those losses drop, then the training gets stuck in a local minimum that doesn’t route information correctly for (c_3..c_6). More depth doesn’t help. The training signal does not shape the right long-range computation. This isn’t a generic “transformer architecture” complaint. It’s a specific observation about learning dynamics, and it shows up again and again when you chart token-wise loss and gradient norms. Another entry in the notebook for the mathematics of large language models.
6. A Small, Practical Fix: Add The Right Inductive Bias

The authors tried a simple idea. If the right internal signal is (\hat{c}_k), then supervise it. They add a tiny auxiliary loss that asks a linear head, attached to attention outputs at each step, to predict the running sum. That’s it. No chain-of-thought tokens at inference. No extra compute trickery. The model learns to keep track of the sum, and accuracy jumps to about 99 percent on 4×4 multiplication.
This is a minimalist example of inductive bias in LLMs paying off. You point the model toward the computation you want, without changing the architecture, and it learns it. If you’re tracking the mathematics of large language models as a distinct body of knowledge, this auxiliary-loss pattern is a keeper.
6.1 Do We Get The Same Mechanism As ICoT?
Not identical, but close. The auxiliary-loss model still builds a sparse attention tree for most heads, and one head develops a broader “parallelogram” pattern that scoops up all relevant digits for the current position. Mechanisms can differ in detail, but they serve the same purpose, which is to keep and combine the right partial products. That is a recurring theme in the mathematics of large language models. Different internal circuits, same external skill.
7. Core Math From The Paper, In Plain Sight
Let’s tie the threads with a compact, inspectable recipe you can implement in a small model or even a notebook:
- Represent numbers as digit tokens, least significant first.
- Teach the model the running sum (\hat{c}_k) at each output step (k).
- Encourage sparse attention to pairs that satisfy (i+j=k).
- Decode (c_k) from (\hat{c}_k \bmod 10), then carry.
You can verify the presence of (\hat{c}_k) with a one-vector linear probe on the post-attention hidden state. When that probe’s mean absolute error drops near zero, you’ve wired the long-range path correctly. That probe is an engineer’s stethoscope on the mathematics of large language models.
8. Table: Three Paths To Multi-Digit Multiplication
| Approach | Training Signal | Mechanism Observed | Accuracy on 4×4 | Notes |
|---|---|---|---|---|
| Standard Fine-Tuning (SFT) | Inputs and final answer only | Weak long-range coupling, middle digits stall | < 1% | Fails even with deeper model. Long-range dependencies transformers are not learned. |
| Implicit Chain-of-Thought (ICoT) | Start with step tokens, gradually remove | Sparse attention tree, cached partial products, Fourier digit code | 100% | Forces internal scratchpad. Strong signal for the mathematics of large language models. |
| Auxiliary Running-Sum Loss | Add linear head predicting (\hat{c}_k) at each step | Attention tree plus one broad collector head | ~99% | A small inductive bias in LLMs that teaches the right algorithm without CoT at inference. |
9. Why This Matters Beyond Multiplication
Arithmetic is a toy problem with sharp edges. You either get the digit or you don’t. That makes it perfect for reverse-engineering. The broader lesson carries over to program synthesis, calendar logic, long-horizon tool use, and any llm math problem that chains many operations.
- The mathematics of large language models rewards the right inductive bias in LLMs.
- Long-distance information flow needs supervision, curriculum, or structure.
- Mechanisms that build and cache intermediate results, like a scratchpad, win.
If your current stack is failing on multi-step reasoning, don’t reach for a bigger model by reflex. Reach for the mathematics of large language models. Bake in the signal that the desired algorithm needs, then check with probes that it emerges.
10. A Hands-On Recipe You Can Try In A Weekend
If you want something concrete, do this:
- Small model. A 2-layer, 4-head GPT-style model is enough. Start from scratch to avoid confounds.
- Data. Generate 80k random 4×4 multiplication pairs, with targets as digit sequences.
- Aux loss. At each output step (k), add an MSE loss from a linear head on the post-attention tensor to the scalar (\hat{c}_k).
- Sparsity. Encourage head sparsity with mild attention entropy regularization to nudge the tree structure.
- Probing. Track (\hat{c}_k) probe error during training, not just token cross-entropy.
- Attn viz. Visualize attention at the step producing (c_2). You should see layer-1 heads caching products like (a_0b_2, a_1b_1, a_2b_0), and layer-2 heads retrieving them.
- Geometry check. Project last-layer activations for digits. Look for even-odd separation and a five-fold pattern. That pentagonal prism is a healthy sign that the mathematics of large language models fell into a good basis.
This won’t just improve arithmetic. It will harden long-range pipelines. Your “ai reasoning limitations” bug reports will get shorter. Your users will notice.
11. Frequently Asked Pushbacks, Answered Briefly
- Isn’t this just laddering up model size. No. The SFT model stayed lost even when scaled. The win came from training signal, not parameter count. That is squarely about the mathematics of large language models, not brawn.
- Why not bolt on a tool. Tools help, and many systems route calculations to a solver. If you care about reliability without tool latency, teaching the core algorithm is still valuable.
- Is this special to multiplication. The pattern generalizes to any task that needs a running invariant or accumulator. Once you see it, you’ll start designing auxiliaries for date math, currency rounding, even symbolic derivations.
12. Where Research Goes Next
Three trails look promising.
- General accumulators. The auxiliary signal here is a running sum. Other tasks have other conserved quantities. Designing those signals is an open garden for the mathematics of large language models.
- Attention scaffolds. Sparse, tree-like patterns keep the receptive field tidy. We can bias heads toward those patterns without hardcoding them.
- Geometry as a guide. If digits self-organize on a Fourier-friendly manifold, maybe we can steer earlier layers toward those manifolds on purpose. The long-range dependencies transformers need would come for free.
These lines don’t replace the transformer architecture. They refine it. They teach it to keep a pencil behind its ear.
13. Closing: Build Models That Can Show Their Work
You came in with a simple worry, why can’t AI do math. You’ve seen that the barrier isn’t magic. It’s bookkeeping. Multiplication asks for a stable relay of intermediate results across many steps. The ICoT model learned that relay. The SFT model didn’t. A tiny auxiliary loss taught it, and attention maps proved it. This is the mathematics of large language models at its most useful, not a slogan, a set of engineering moves that transfer.
If you ship models, bake in a running-sum head for any task with a conserved quantity. If you run a research group, pick one long-range task and design the minimal auxiliary that makes its invariant visible during training. If you write about AI, hold systems to this standard. They shouldn’t just reach an answer. They should carry the torch from step to step.
Ready to put this into your stack. Start with a small reproducer. Add the auxiliary. Probe for (\hat{c}_k). When the middle digits stop failing, you’ll feel it. That is progress you can measure, that is the mathematics of large language models earning its name.
References: Bai, Pres, Deng, Tan, Shieber, Viégas, Wattenberg, and Lee, “Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls.” arXiv:2510.00184v1.
1) What is the mathematics of large language models?
It is the toolkit that explains how modern LLMs turn text into vectors, apply attention, and predict tokens with calibrated probabilities. In practice this spans linear algebra for embeddings, geometry for representation spaces, and optimization for training. When we study the mathematics of large language models, we focus on how queries, keys, and values interact, how softmax shapes attention weights, and how positional signals let models track order. The same math also clarifies where models fail, for example when long-range information decays or when gradients favor shortcuts over actual computation.
2) Why can’t AI do math reliably, and how does this relate to the mathematics of large language models?
LLMs learn patterns from text. Arithmetic and multi-step reasoning require stable intermediate results that must be carried across many positions. That is hard without a scratchpad or an inductive bias that rewards the right steps. The mathematics of large language models shows that attention can route information across long spans, yet training often encourages shallow correlations. Add structure, for example an auxiliary head that predicts a running sum, and reliability jumps. The failure is not capacity. It is the learned algorithm.
3) How does the transformer architecture implement computation in LLMs?
A transformer maps tokens to embeddings, then applies self-attention where each position forms a query to read from keys and values. Softmax turns dot products into weights that mix information. Stacked layers let the model build higher-level features, while a causal mask keeps prediction forward-only. In the mathematics of large language models, this becomes a sequence of matrix multiplications and nonlinearities that can emulate algorithms, provided the training signal rewards the right circuits.
4) What fixes help LLMs solve math and long-horizon reasoning tasks?
Three proven levers work well. First, curriculum or implicit chain-of-thought that exposes intermediate steps early, then removes them so the model internalizes the logic. Second, inductive bias such as an auxiliary loss that predicts a running sum or other conserved quantity. Third, data that stresses long-range dependencies instead of short memorization. Each choice reshapes gradients so the mathematics of large language models favors stable computation over guesswork.
5) Is chain-of-thought required, or can models reason without it?
Visible chain-of-thought can help but is not mandatory. You can train with implicit steps or auxiliary targets, then infer answers directly. The mathematics of large language models supports both modes. What matters is guiding the internal representations so they cache and retrieve intermediate results. When the internal signal is present, models solve arithmetic and similar tasks without emitting long explanations.
