Multimodal Chain Of Thought: 7 Powerful DeepSeek AI Lessons

The weird thing about vision models is that they can often describe an image beautifully, then fall apart when asked to do something a five-year-old does with a finger. Count the bears on the ground. Trace the line from the crown icon. Navigate the maze. Don’t hallucinate a shortcut through a wall.

That is where Multimodal chain of thought becomes interesting. DeepSeek’s Thinking with Visual Primitives argues that the next step for vision-language reasoning is not simply feeding models more pixels. It is giving them a way to point while they think. Instead of letting reasoning float around in prose, the model drops coordinates, boxes, and points into its own internal narration. The result is a cleaner bridge between language and the image.

1. What DeepSeek Actually Changed

Multimodal chain of thought for What DeepSeek Actually Changed using boxes and points

DeepSeek’s paper names the central failure the Reference Gap. Current Multimodal AI models can often perceive an object, but they struggle to keep a stable reference to that object during multi-step reasoning. Natural language is fuzzy. “The small object near the left” sounds fine until there are twelve small objects, three left-ish regions, and one overconfident model trying to improvise geometry.

The fix is simple in spirit: make coordinates part of the reasoning process. Bounding boxes handle concrete objects. Points handle paths, routes, curves, and other spatial traces. This turns Multimodal chain of thought into something less like a monologue and more like a marked-up map. DeepSeek describes these points and boxes as “minimal units of thought,” and the model is trained to interleave them directly into its reasoning trajectory.

Idea	Old Habit	DeepSeek’s Move
Visual Reasoning	Describe the image in words	Attach reasoning to coordinates
Counting	Guess from a scene description	Box every candidate, then count
Maze Solving	Narrate directions vaguely	Trace reachable points step by step
Path Tracing	Follow color or shape cues loosely	Emit a coordinate trail along the curve
Efficiency	Spend more visual tokens	Compress aggressively, then point precisely

This is the useful intuition: Multimodal chain of thought should not be text pretending to be vision. It should be language plus spatial handles.

2. The DeepSeek Deleted Repo Github Moment

The paper also arrived with a small amount of internet theater, because of course it did. Developers noticed the project, then saw chatter around the “DeepSeek deleted repo github” situation. A community clone appeared, while the official GitHub page is currently public and states that the technical report was released on April 30, 2026. The official page also says DeepSeek plans to release in-house benchmarks, a subset of cold-start data, and future model weights through its foundation model line.

That matters because the repo drama is less interesting than the availability question. Right now, the paper gives us the method and enough detail to evaluate the architecture. It does not yet give us the full artifact chain required for clean independent reproduction. For builders, that means optimism with a raised eyebrow. Bookmark the official repo, know the mirrors exist, but treat benchmark claims as “promising technical report,” not settled law.

The useful takeaway for readers is simple: the idea is bigger than the release hiccup. Multimodal chain of thought with coordinates is a design pattern that other labs can copy, challenge, or improve. Whether DeepSeek’s exact model becomes the default is less important than the fact that the paper gives the community a sharper vocabulary for a stubborn failure mode.

3. The Reference Gap AI Problem

The phrase Reference gap AI sounds academic, but the problem is painfully practical. Imagine asking a model whether the red capacitor is left of the inductor in a dense circuit diagram. The model may see both. It may even name them. Then, three reasoning steps later, “the red one” has quietly become the wrong red one.

This is why Multimodal chain of thought needs grounding. Text is a wonderful interface for ideas, but it is a mediocre coordinate system. Humans solve this with gestures. We point at the apple we mean. We run a finger along a maze. We tap each item while counting. DeepSeek’s proposal gives that habit to the model.

The paper separates this from the older “Perception Gap.” More resolution helps when the model cannot see a tiny object. It does not help enough when the model can see the object but cannot keep referring to it accurately. Seeing clearly and thinking clearly are not the same skill. The official project page describes the same bottleneck: natural language is too ambiguous to precisely point to dense spatial layouts.

4. Multimodal Chain Of Thought With Visual Primitives

Here is the core move in one sentence: Multimodal chain of thought becomes spatially grounded when the model writes points and boxes into its reasoning, not just at the end of its answer.

That sounds like a formatting trick. It is not. A final bounding box says, “I found the thing.” A bounding box inside the reasoning trace says, “I am thinking about this exact thing now.” That difference is the whole paper.

4.1. How It Mimics Human Cognition

The human analogy is almost suspiciously good. When we count a crowded team photo, we do not hold all faces in working memory as vague nouns. We scan, mark, group, and tally. When we trace a cable behind a desk, we slow down at crossings. When we solve a maze, we remember branches and backtrack. Very little of this feels like formal logic. It feels like using the world as scratch paper.

DeepSeek’s Multimodal chain of thought follows that same pattern. In counting tasks, boxes anchor each candidate object. In maze navigation and path tracing, points become breadcrumbs. The model no longer has to say “go down, then left, then toward the opening near the center,” which is exactly the sort of sentence that sounds helpful while being nearly useless.

5. The 7056x Compression Marvel

Multimodal chain of thought infographic for The 7056x Compression Marvel

The paper’s other surprise is efficiency. For a 756 by 756 image, DeepSeek reports 571,536 raw pixels. The image becomes 2,916 ViT patch tokens, then 324 visual tokens after 3 by 3 spatial compression, and finally just 81 visual KV cache entries after Compressed Sparse Attention. That is the famous 7056x compression ratio from pixels to final KV cache entries.

For an 800 by 800 image, the paper’s figure reports roughly 90 KV cache entries for DeepSeek’s model, compared with about 660 for Qwen3-VL, 740 for GPT-5.4, 870 for Claude-Sonnet-4.6, and 1100 for Gemini-3-Flash. The reported comparison is limited to selected visual reasoning benchmarks, not overall model quality. That caveat is important and refreshingly sane.

This is the architecture lesson: Multimodal chain of thought does not have to mean dumping a giant image into a giant context window and praying. If references are precise, fewer visual tokens may be enough.

6. DeepSeek Vs. Molmo, Qwen, And Older Grounding Tricks

Visual grounding is not new. Models have been drawing boxes, returning coordinates, and answering location questions for years. The meaningful distinction is timing.

Older grounding pipelines often produce coordinates as evidence, verification, or final output. DeepSeek makes coordinates part of the live reasoning process. That lets the model build a chain where each step is tied to a visual referent. In dense scenes, this is less glamorous than a bigger benchmark number, but it is probably more important.

This also explains why System 2 multimodal AI is the right framing. Slow reasoning is not just “think longer.” For visual tasks, thinking longer without stable references can make a model more eloquent and more wrong. A good Multimodal chain of thought has to maintain object identity, spatial continuity, and causal structure across steps.

7. Cold-Start Data And The Benchmaxxing Question

The skeptical read is obvious: did DeepSeek invent a general visual reasoning method, or did it build a very good maze-and-counting machine?

The paper tries to answer that with cold-start data and specialized reward design. It builds around four task families: counting, spatial reasoning and visual QA, maze navigation, and path tracing. The dataset sizes are not tiny. DeepSeek reports roughly:

10,000 cold-start counting samples
9,000 spatial reasoning and general VQA samples
460,000 maze navigation samples
125,000 path tracing samples

The maze work is especially telling. Solvable and unsolvable mazes are generated with algorithms such as DFS, Prim, and Kruskal. The model must explore, hit dead ends, backtrack, and output verified paths with points. In other words, it is rewarded for legal exploration, not just for guessing the final label. That is the right instinct.

Still, the strongest version of the claim needs public weights, public data slices, and external replication. Until then, Multimodal chain of thought with visual primitives is a compelling design, not a courtroom verdict.

8. DeepSeek V4 Flash Vision Locally: The Hardware Reality

Now for the r/LocalLLaMA question: can you run DeepSeek V4 Flash vision locally?

Not in the casual “download it on your gaming PC tonight” sense. The language backbone is DeepSeek-V4-Flash, a mixture-of-experts model with 284B total parameters and 13B active during inference. The active parameter count is the compute story. The total parameter count is the storage story. You still need the experts somewhere, unless you are offloading, sharding, or using a specialized serving setup.

A rough 4-bit weight-only estimate for 284B parameters lands around 142GB before runtime overhead. In practice, you should think 150GB+ as the entry point, and more if you want sane context, vision components, routing overhead, and speed that does not feel like archaeology. FP16 would be wildly larger. 8-bit would still be uncomfortable for most hobbyist rigs.

The compression helps a lot. It makes visual context cheaper. It does not magically turn a 284B MoE into a laptop model. Multimodal chain of thought may become efficient, but the underlying beast is still a beast. A very clever beast, but not a toaster.

9. Why This Matters For Agentic UI

Multimodal chain of thought for Why This Matters For Agentic UI

The most obvious product use case is not “better image captioning.” It is agents.

A browser agent needs to click the right button, not describe the vibe of the button. A robotics planner needs to track the object being moved, not maintain a poetic memory of “the cup near the edge.” A design assistant needs to point to the exact component that violates spacing, contrast, or alignment. Ambiguity is not charming when software is taking actions.

This is where Multimodal chain of thought can become infrastructure. If a model can reason through a screenshot while binding each step to coordinates, the path from perception to action gets shorter. The model can say what it is doing, why it is doing it, and where it is doing it.

For SaaS products, this is not just a benchmark curiosity. It is the shape of reliable UI automation. For robotics, it is a step toward visual plans that stay attached to the physical world. For education, it could let a tutor point at the exact algebra term, circuit node, or anatomical region being discussed.

10. The Limitations Are Not Fine Print

The paper is unusually clear about its limits. The method can still struggle in fine-grained scenarios because input resolution constrains precision. The visual primitive behavior currently depends on explicit trigger words. And point-based reasoning for complex topology still has limited cross-scenario generalization.

That last point deserves attention. Maze navigation and path tracing are useful laboratories, but the real world is messier. Wires occlude each other. Roads have signs and lane markings. Interfaces change state after clicks. A visual primitive that works in a synthetic maze must survive glare, clutter, animation, and bad screenshots.

The research direction is right. The production version will need stronger self-triggering, better uncertainty handling, and tighter integration with perception methods that still matter. The future is not “resolution versus references.” It is both, stitched together carefully.

11. The End Of Text-Only Reasoning

The big lesson from DeepSeek’s work is not that every model now needs to dump coordinates into every answer. Nobody wants a chatbot that responds to “What’s in this photo?” with a coordinate soup. The lesson is subtler: when reasoning depends on visual structure, text alone is a leaky abstraction.

Multimodal chain of thought is strongest when it becomes grounded, selective, and action-ready. A model should know when to narrate, when to point, when to box an object, and when to trace a path. That is closer to how people reason. We do not merely think in sentences. We gesture, sketch, count, mark, and look again.

DeepSeek’s Thinking with Visual Primitives feels like one of those papers that names a problem everyone has been stepping around. The Reference Gap was always there. It showed up whenever a model counted confidently and incorrectly, solved a maze through a wall, or confused two nearly identical objects while sounding very composed about it.

The next generation of Multimodal AI models will not win by seeing more pixels alone. They will win by making better references. More precise pointers. Cleaner memory. Less theatrical certainty. More contact with the thing actually being reasoned about.

If you build AI agents, visual tools, research workflows, or developer products, pay attention to this direction now. The useful frontier is moving from “Can the model see it?” to “Can the model keep hold of what it means?” That is where Multimodal chain of thought stops being a buzzword and starts becoming the control layer for real multimodal intelligence.

Multimodal Chain Of Thought: DeepSeek’s Visual Primitives And The 7056x Compression Trick

Table of Contents