Introduction
If you’ve ever tried to stitch together ten tabs, three PDFs, one spreadsheet, and a half-finished notebook into a clear answer, you know the bottleneck. It’s not the lack of information. It’s the plumbing. Tongyi DeepResearch turns that plumbing into a system, an AI research agent that reads, plans, checks, and synthesizes at web scale, then hands you a defensible result. This isn’t another chat toy. It’s a focused tool for long-horizon, deep information seeking.
Table of Contents
1. What Is An Agentic AI? The “Deep Research” Difference

Most language models respond. Agentic systems act. An agentic LLM keeps a plan in working memory, calls tools, adjusts when evidence disagrees, and only then writes. Tongyi DeepResearch is built for that mode of work. It runs a loop of thought, action, and observation, using Search, Visit for targeted page reading, a Python Interpreter, Google Scholar, and a File Parser to gather and verify evidence before drafting a report. That loop is simple on purpose. The goal is reliable, cumulative progress on messy, multi-step questions, the kind you’ll recognize from real research and due diligence.
2. Why Tongyi DeepResearch Stands Out
Efficiency. The model uses a Mixture-of-Experts backbone with 30.5B total parameters while activating only about 3.3B per token. You get a large model’s reach without paying for every parameter on every step. That design keeps throughput high and cost in check.
Openness. Tongyi DeepResearch ships as open source with paper, code, and model weights publicly available, which matters if you build systems and need transparency, repeatable evaluations, and the freedom to adapt the pipeline for your domain.
Specialization. The team trained for agency, not chat, combining agentic mid-training, supervised fine-tuning, and strictly on-policy reinforcement learning. The outcome is an AI web agent that treats research as an environment to navigate, not a paragraph to autocomplete.
3. How To Use Tongyi DeepResearch: From Zero To First Result
This section answers the practical questions that show up the same day a new tool lands. How do I try it now, how do I wire it into code, and how do I run it myself.
3.1 The Easiest Way, Online Demos And API Access
Fastest path, try an online demo on Hugging Face or ModelScope to get a feel for the behavior. For programmatic use without GPUs, call the model through OpenRouter with the name alibaba/tongyi-deepresearch-30b-a3b. Send your research prompt, let the agent handle search and browsing, then retrieve the final report and citations. Wrap that call in a small job that sets a tool budget and timeout, and log every action so you can review how the answer was produced. This keeps you focused on integration, not infrastructure.
3.2 The Power User’s Path, Local Deployment
If you want control, run locally. Create a clean Python 3.10 environment, install the repository requirements, and copy .env.example to .env. Add API keys for a search provider, a page reader, and Scholar access, then set dataset and output paths. Run the provided inference script. You get the full loop, including tool orchestration and a saved report for review. The repository releases reproduction scripts and prompt configs so your settings match the paper.
4. The Tools And API Keys You’ll Need

Results depend on the tools the agent can call. Tongyi DeepResearch expects five core tools, and you can swap in compatible services.
- Search. A web search API that returns a ranked list of candidate sources.
- Visit. A targeted page reader that fetches full content and extracts only the relevant bits. Many teams use Jina to parse pages, then summarize for the specific goal.
- Python Interpreter. For arithmetic, quick data checks, and small plots during the investigation.
- Google Scholar. For academic lookups and citation trails.
- File Parser. To read local files and media, convert everything to text, then answer directly from that unified view.
These are the levers that turn a language model into an AI research agent that can validate itself in the wild.
5. Performance Benchmarks, An Open-Source Challenger With Range
Benchmarks are not the whole story. They’re a useful map. On the standard deep-research suites, Tongyi DeepResearch is competitive with proprietary systems while staying fully open. The team evaluates with fixed parameters, a 128K context window, and Avg@3 over three runs for stability.
5.1 Snapshot Of Results
The table below lists representative Avg@3 scores across benchmarks reported in the technical report. Scores will shift as the ecosystem moves, the pattern is what matters.
Tongyi DeepResearch Benchmarks Overview
| Benchmark | Avg@3 |
|---|---|
| Humanity’s Last Exam | 32.9 |
| BrowseComp | 43.4 |
| BrowseComp-ZH | 46.7 |
| WebWalkerQA | 72.2 |
| GAIA | 70.9 |
| xbench-DeepSearch | 75.0 |
| FRAMES | 90.6 |
Source, paper figures and results.
5.2 Heavy Mode, When You Want More Certainty

For hard problems, Tongyi DeepResearch can scale test time. Heavy Mode runs several parallel research rollouts, compresses each trajectory into a context-efficient report, then synthesizes a final answer. Because the reports are compact, the synthesis model stays within context. Heavy Mode lifts accuracy further, for example to 38.3 on Humanity’s Last Exam and 58.1 on BrowseComp-ZH, with a competitive 58.3 on BrowseComp.
6. Hardware Reality Check
People often ask if a single consumer GPU can run the full model. The honest answer, not comfortably. Tongyi DeepResearch in its unquantized 30B configuration generally wants server-class VRAM. You can experiment with quantized builds as the community produces them, and you can offload tool logic to CPUs. For production speed and consistency, plan for cloud or multi-GPU. Treat local runs as a development environment while you tune prompts, tool limits, and timeouts.
6.1 Running On Modest Machines
Most teams do early research on laptops or a single workstation. You can still test ideas. Use a hosted endpoint for the model while running the tool stack locally. Cache search queries and normalized page text so retries are cheap. Add exponential backoff to handle provider QPS limits. Keep automated data extraction near your data and move only the minimal text summaries through the agent. With these basics, you get a realistic feel for costs and latency before you scale out.
7. Architecture In Brief, Why The Training Recipe Matters
Tongyi DeepResearch isn’t a chat model wearing a lab coat. It learns the habits of research. The team trains in phases. Agentic mid-training on long sequences builds the inductive bias for planning, memory management, and multi-step tool use. Supervised fine-tuning supplies a clean starting policy. Strictly on-policy RL sharpens behavior in real or simulated environments with reward on answer correctness, not format tricks. That mix makes the agent deliberate and steady when the web gets noisy.
That setup pairs well with ReAct, which keeps a running chain of thought alongside explicit actions and observations. The implementation also uses context management that maintains a compressed report as working memory, so the agent can push deeper without drowning in its own transcripts. When you enable Heavy Mode, several parallel rollouts explore different tool strategies and a synthesis step merges their compact reports into one answer inside the same context window. It’s a pragmatic path to agentic LLM behavior that scales with your appetite for certainty.
8. Practical Setup, A Minimal Recipe You Can Reproduce
You can bring Tongyi DeepResearch into a workflow without inventing new infrastructure. Start with a job runner that can queue tasks and checkpoint partial results. Give the agent budgeted access to the tools above. Add a cache for search and page fetches. Log every action and observation, then publish the final answer with links and working notes. The project includes reproduction scripts and fixed inference parameters so your numbers match the paper, which makes A/B testing straightforward.
8.1 Configuration You’ll Actually Use
- SERPER_KEY_ID for web search.
- JINA_API_KEYS for page parsing and content extraction.
- DASHSCOPE_API_KEY if you use the file parsing pipeline.
- MODEL_PATH, DATASET, OUTPUT_PATH so runs are reproducible.
Set them once, then script the agent launch per dataset. Keep the rest of your stack unchanged.
9. The Geopolitical Context, Answering The “CCP” Comments
Address the elephant in the room, directly and factually. Tongyi DeepResearch was developed by Tongyi Lab at Alibaba AI and released openly. It’s public and permissively licensed. Code, weights, and technical report live on common developer hubs. The release broadens access to a capable open source research AI, which is healthy for the field and for practitioners who need transparency. Debate is fine. Shipping open tools is better.
10. Quick Reference Tables
10.1 Core Specs And Setup
Tongyi DeepResearch Core Specs
| Item | Value |
|---|---|
| Model Type | Agentic LLM with Mixture-of-Experts |
| Total Parameters | 30.5B |
| Activated Per Token | About 3.3B |
| Context Length | 128K |
| Inference Modes | ReAct, Heavy Mode |
| Core Tools | Search, Visit, Python, Scholar, File Parser |
| Reproduction | Official scripts and prompts available |
| License | Apache-2.0 style, open source |
Numbers and toolset per the technical report.
10.2 Benchmarks At A Glance
Tongyi DeepResearch Benchmark Scores
| Suite | Score Type | Tongyi DeepResearch |
|---|---|---|
| Humanity’s Last Exam | Avg@3 | 32.9 |
| BrowseComp | Avg@3 | 43.4 |
| BrowseComp-ZH | Avg@3 | 46.7 |
| WebWalkerQA | Avg@3 | 72.2 |
| GAIA | Avg@3 | 70.9 |
| xbench-DeepSearch | Avg@3 | 75.0 |
| FRAMES | Avg@3 | 90.6 |
As reported in the paper’s figures and results.
11. Risk And Failure Modes, What To Expect In The Wild
A deep research agent touches the open web, which is messy. Expect three families of failures. First, tool outages and throttling. Fix those with retries, circuit breakers, and provider fallbacks. Second, grounding errors, where the agent cites an off-topic page or misreads a chart. Reduce that with stricter page goals, conservative summarization, and a few Python checks. Third, synthesis drift, where a final paragraph softens or overstates claims. Add a short verification pass that re-reads the citations, re-computes any numbers, and flags unsupported sentences for a human to inspect. Measured this way, your deep research agent behaves like a careful assistant, not a confident storyteller.
12. Integration Playbook, From Trial To Value
- Start with one narrow use case such as weekly competitor tracking or vendor diligence.
- Write a one-page spec that lists the tools you’ll allow, budgets, and success metrics.
- Set the context window to 128K and keep prompts short to reduce memory churn.
- Log every tool call with inputs and outputs, then sample twenty logs each week for review.
- Teach CI to run a small benchmark set before each deploy so regressions are obvious.
- Publish a human-readable report template with citations, figures, and a one-line verdict.
This cadence builds trust. You’re not shipping magic. You’re shipping a system that reads, reasons, and gives you a clear trail. That is how an AI research agent earns a place in real workflows.
13. Closing Thoughts, And A Concrete Next Step
Tongyi DeepResearch shows what a focused agent can do. It reads widely. It checks its own work. It scales test time when you ask it to be certain. It’s also open and hackable. If you build on Alibaba AI platforms, you can plug it in today. If you prefer a cloud broker, you can call it through OpenRouter. If you need full control, you can run it yourself.
Start small. Pick one question that matters to your team each week, and let Tongyi DeepResearch investigate with a fixed time and tool budget. Compare its answer to your baseline. Keep the logging and the citations. After a few cycles, wire the agent into a real workflow. The sooner you move from curiosity to use, the sooner you find where it shines. That’s the point. Tongyi DeepResearch is a tool for getting real work done.
1) What is Tongyi DeepResearch, and how is it different from a normal chatbot?
Tongyi DeepResearch is an agentic AI research agent that plans tasks, calls tools like search and page readers, verifies evidence, then writes a cited answer. It is built for long, multi-step web investigations, not small talk.
2) How can I use Tongyi DeepResearch right now?
You can try online demos on Hugging Face or ModelScope, or call it through OpenRouter with the model name alibaba/tongyi-deepresearch-30b-a3b. Power users can run it locally by installing the GitHub repo and configuring API keys.
3) Is Tongyi DeepResearch free to use?
The code and weights are open source under Apache-2.0, so downloading is free. Running it still has costs, either API usage through providers like OpenRouter or hardware and tool API keys when you deploy locally.
4) How does its performance compare to OpenAI’s Deep Research or Gemini?
According to its technical report and model card, Tongyi DeepResearch achieves strong results on web-agent benchmarks such as HLE and BrowseComp, making it a leading open-source option. Proprietary systems may still hold edges on some tasks.
5) Can I run Tongyi DeepResearch on my own computer?
Most users cannot run the full 30B model comfortably on a single consumer GPU. It typically requires data-center class VRAM, while community quantized builds can reduce the footprint for experimentation.
