Tongyi DeepResearch: 7 Proven Wins, Best Benchmarks Guide!

AI News — Tongyi DeepResearch Update

Introduction

If you’ve ever tried to stitch together ten tabs, three PDFs, one spreadsheet, and a half-finished notebook into a clear answer, you know the bottleneck. It’s not the lack of information. It’s the plumbing. Tongyi DeepResearch turns that plumbing into a system, an AI research agent that reads, plans, checks, and synthesizes at web scale, then hands you a defensible result. This isn’t another chat toy. It’s a focused tool for long-horizon, deep information seeking.

1. What Is An Agentic AI? The “Deep Research” Difference

Top-down agentic loop with tools circling a tidy desk, clearly mapping Tongyi DeepResearch workflows.

Most language models respond. Agentic systems act. An agentic LLM keeps a plan in working memory, calls tools, adjusts when evidence disagrees, and only then writes. Tongyi DeepResearch is built for that mode of work. It runs a loop of thought, action, and observation, using Search, Visit for targeted page reading, a Python Interpreter, Google Scholar, and a File Parser to gather and verify evidence before drafting a report. That loop is simple on purpose. The goal is reliable, cumulative progress on messy, multi-step questions, the kind you’ll recognize from real research and due diligence.

2. Why Tongyi DeepResearch Stands Out

Efficiency. The model uses a Mixture-of-Experts backbone with 30.5B total parameters while activating only about 3.3B per token. You get a large model’s reach without paying for every parameter on every step. That design keeps throughput high and cost in check.

Openness. Tongyi DeepResearch ships as open source with paper, code, and model weights publicly available, which matters if you build systems and need transparency, repeatable evaluations, and the freedom to adapt the pipeline for your domain.

Specialization. The team trained for agency, not chat, combining agentic mid-training, supervised fine-tuning, and strictly on-policy reinforcement learning. The outcome is an AI web agent that treats research as an environment to navigate, not a paragraph to autocomplete.

3. How To Use Tongyi DeepResearch: From Zero To First Result

This section answers the practical questions that show up the same day a new tool lands. How do I try it now, how do I wire it into code, and how do I run it myself.

3.1 The Easiest Way, Online Demos And API Access

Fastest path, try an online demo on Hugging Face or ModelScope to get a feel for the behavior. For programmatic use without GPUs, call the model through OpenRouter with the name alibaba/tongyi-deepresearch-30b-a3b. Send your research prompt, let the agent handle search and browsing, then retrieve the final report and citations. Wrap that call in a small job that sets a tool budget and timeout, and log every action so you can review how the answer was produced. This keeps you focused on integration, not infrastructure.

3.2 The Power User’s Path, Local Deployment

If you want control, run locally. Create a clean Python 3.10 environment, install the repository requirements, and copy .env.example to .env. Add API keys for a search provider, a page reader, and Scholar access, then set dataset and output paths. Run the provided inference script. You get the full loop, including tool orchestration and a saved report for review. The repository releases reproduction scripts and prompt configs so your settings match the paper.

4. The Tools And API Keys You’ll Need

Clean grid of core tools and API keys linking to Tongyi DeepResearch for transparent, reproducible research operations.

Results depend on the tools the agent can call. Tongyi DeepResearch expects five core tools, and you can swap in compatible services.

Search. A web search API that returns a ranked list of candidate sources.
Visit. A targeted page reader that fetches full content and extracts only the relevant bits. Many teams use Jina to parse pages, then summarize for the specific goal.
Python Interpreter. For arithmetic, quick data checks, and small plots during the investigation.
Google Scholar. For academic lookups and citation trails.
File Parser. To read local files and media, convert everything to text, then answer directly from that unified view.

These are the levers that turn a language model into an AI research agent that can validate itself in the wild.

5. Performance Benchmarks, An Open-Source Challenger With Range

Benchmarks are not the whole story. They’re a useful map. On the standard deep-research suites, Tongyi DeepResearch is competitive with proprietary systems while staying fully open. The team evaluates with fixed parameters, a 128K context window, and Avg@3 over three runs for stability.

5.1 Snapshot Of Results

The table below lists representative Avg@3 scores across benchmarks reported in the technical report. Scores will shift as the ecosystem moves, the pattern is what matters.

Tongyi DeepResearch Benchmarks Overview

Tongyi DeepResearch benchmark results table
Benchmark	Avg@3
Humanity’s Last Exam	32.9
BrowseComp	43.4
BrowseComp-ZH	46.7
WebWalkerQA	72.2
GAIA	70.9
xbench-DeepSearch	75.0
FRAMES	90.6

Source, paper figures and results.

5.2 Heavy Mode, When You Want More Certainty

Parallel research lanes merging into one verified report, illustrating Heavy Mode synthesis in Tongyi DeepResearch.

For hard problems, Tongyi DeepResearch can scale test time. Heavy Mode runs several parallel research rollouts, compresses each trajectory into a context-efficient report, then synthesizes a final answer. Because the reports are compact, the synthesis model stays within context. Heavy Mode lifts accuracy further, for example to 38.3 on Humanity’s Last Exam and 58.1 on BrowseComp-ZH, with a competitive 58.3 on BrowseComp.

6. Hardware Reality Check

People often ask if a single consumer GPU can run the full model. The honest answer, not comfortably. Tongyi DeepResearch in its unquantized 30B configuration generally wants server-class VRAM. You can experiment with quantized builds as the community produces them, and you can offload tool logic to CPUs. For production speed and consistency, plan for cloud or multi-GPU. Treat local runs as a development environment while you tune prompts, tool limits, and timeouts.

6.1 Running On Modest Machines

Most teams do early research on laptops or a single workstation. You can still test ideas. Use a hosted endpoint for the model while running the tool stack locally. Cache search queries and normalized page text so retries are cheap. Add exponential backoff to handle provider QPS limits. Keep automated data extraction near your data and move only the minimal text summaries through the agent. With these basics, you get a realistic feel for costs and latency before you scale out.

7. Architecture In Brief, Why The Training Recipe Matters

Tongyi DeepResearch isn’t a chat model wearing a lab coat. It learns the habits of research. The team trains in phases. Agentic mid-training on long sequences builds the inductive bias for planning, memory management, and multi-step tool use. Supervised fine-tuning supplies a clean starting policy. Strictly on-policy RL sharpens behavior in real or simulated environments with reward on answer correctness, not format tricks. That mix makes the agent deliberate and steady when the web gets noisy.

That setup pairs well with ReAct, which keeps a running chain of thought alongside explicit actions and observations. The implementation also uses context management that maintains a compressed report as working memory, so the agent can push deeper without drowning in its own transcripts. When you enable Heavy Mode, several parallel rollouts explore different tool strategies and a synthesis step merges their compact reports into one answer inside the same context window. It’s a pragmatic path to agentic LLM behavior that scales with your appetite for certainty.

8. Practical Setup, A Minimal Recipe You Can Reproduce

You can bring Tongyi DeepResearch into a workflow without inventing new infrastructure. Start with a job runner that can queue tasks and checkpoint partial results. Give the agent budgeted access to the tools above. Add a cache for search and page fetches. Log every action and observation, then publish the final answer with links and working notes. The project includes reproduction scripts and fixed inference parameters so your numbers match the paper, which makes A/B testing straightforward.

8.1 Configuration You’ll Actually Use

SERPER_KEY_ID for web search.
JINA_API_KEYS for page parsing and content extraction.
DASHSCOPE_API_KEY if you use the file parsing pipeline.
MODEL_PATH, DATASET, OUTPUT_PATH so runs are reproducible.

Set them once, then script the agent launch per dataset. Keep the rest of your stack unchanged.

9. The Geopolitical Context, Answering The “CCP” Comments

Address the elephant in the room, directly and factually. Tongyi DeepResearch was developed by Tongyi Lab at Alibaba AI and released openly. It’s public and permissively licensed. Code, weights, and technical report live on common developer hubs. The release broadens access to a capable open source research AI, which is healthy for the field and for practitioners who need transparency. Debate is fine. Shipping open tools is better.

10. Quick Reference Tables

10.1 Core Specs And Setup

Tongyi DeepResearch Core Specs

Tongyi DeepResearch core specification table
Item	Value
Model Type	Agentic LLM with Mixture-of-Experts
Total Parameters	30.5B
Activated Per Token	About 3.3B
Context Length	128K
Inference Modes	ReAct, Heavy Mode
Core Tools	Search, Visit, Python, Scholar, File Parser
Reproduction	Official scripts and prompts available
License	Apache-2.0 style, open source

Numbers and toolset per the technical report.

10.2 Benchmarks At A Glance

Tongyi DeepResearch Benchmark Scores

Tongyi DeepResearch benchmark scores table
Suite	Score Type	Tongyi DeepResearch
Humanity’s Last Exam	Avg@3	32.9
BrowseComp	Avg@3	43.4
BrowseComp-ZH	Avg@3	46.7
WebWalkerQA	Avg@3	72.2
GAIA	Avg@3	70.9
xbench-DeepSearch	Avg@3	75.0
FRAMES	Avg@3	90.6

As reported in the paper’s figures and results.

11. Risk And Failure Modes, What To Expect In The Wild

A deep research agent touches the open web, which is messy. Expect three families of failures. First, tool outages and throttling. Fix those with retries, circuit breakers, and provider fallbacks. Second, grounding errors, where the agent cites an off-topic page or misreads a chart. Reduce that with stricter page goals, conservative summarization, and a few Python checks. Third, synthesis drift, where a final paragraph softens or overstates claims. Add a short verification pass that re-reads the citations, re-computes any numbers, and flags unsupported sentences for a human to inspect. Measured this way, your deep research agent behaves like a careful assistant, not a confident storyteller.

12. Integration Playbook, From Trial To Value

Start with one narrow use case such as weekly competitor tracking or vendor diligence.
Write a one-page spec that lists the tools you’ll allow, budgets, and success metrics.
Set the context window to 128K and keep prompts short to reduce memory churn.
Log every tool call with inputs and outputs, then sample twenty logs each week for review.
Teach CI to run a small benchmark set before each deploy so regressions are obvious.
Publish a human-readable report template with citations, figures, and a one-line verdict.

This cadence builds trust. You’re not shipping magic. You’re shipping a system that reads, reasons, and gives you a clear trail. That is how an AI research agent earns a place in real workflows.

13. Closing Thoughts, And A Concrete Next Step

Tongyi DeepResearch shows what a focused agent can do. It reads widely. It checks its own work. It scales test time when you ask it to be certain. It’s also open and hackable. If you build on Alibaba AI platforms, you can plug it in today. If you prefer a cloud broker, you can call it through OpenRouter. If you need full control, you can run it yourself.

Start small. Pick one question that matters to your team each week, and let Tongyi DeepResearch investigate with a fixed time and tool budget. Compare its answer to your baseline. Keep the logging and the citations. After a few cycles, wire the agent into a real workflow. The sooner you move from curiosity to use, the sooner you find where it shines. That’s the point. Tongyi DeepResearch is a tool for getting real work done.

Agentic LLM: A language model that plans multi-step actions, uses tools, and updates its strategy as it observes new evidence.

AI Research Agent: A system that automates web search, page reading, note synthesis, and citation building to produce defensible answers.

Mixture of Experts (MoE): A model architecture where only a subset of specialized “experts” activate per token, improving efficiency at scale.

ReAct: An inference pattern that interleaves reasoning steps with explicit actions like “search” or “visit,” then observes results before continuing.

Heavy Mode: A test-time strategy that runs multiple research rollouts, compresses each into a report, then synthesizes a final answer.

Context Window (128K): The maximum amount of text the model can consider at once, large enough for long research threads.

Avg@3: A reporting metric that averages results across three runs to reduce variance and show stable performance.

OpenRouter: A broker that exposes many models behind one API so developers can call Tongyi DeepResearch without hosting it.

Automated Data Extraction: Turning raw pages, PDFs, and tables into structured notes the agent can cite.

On-Policy Reinforcement Learning: A training approach where the model improves using data generated by its current policy, aligning behavior with real tool use.

Group Relative Policy Optimization: A reinforcement learning variant that stabilizes training with relative advantages computed across sampled trajectories.

Tool Orchestration: Coordinating search, page parsing, Python checks, and file parsing so the agent can gather and verify evidence.

JSONL: A newline-delimited JSON format where each line is one record, convenient for large evaluation datasets.

GGUF: A quantized model file format used by local inference toolchains to make large models run on smaller hardware.

Apache-2.0 License: A permissive open-source license that allows broad use, modification, and distribution with attribution and notices.

1) What is Tongyi DeepResearch, and how is it different from a normal chatbot?

Tongyi DeepResearch is an agentic AI research agent that plans tasks, calls tools like search and page readers, verifies evidence, then writes a cited answer. It is built for long, multi-step web investigations, not small talk.

2) How can I use Tongyi DeepResearch right now?

You can try online demos on Hugging Face or ModelScope, or call it through OpenRouter with the model name alibaba/tongyi-deepresearch-30b-a3b. Power users can run it locally by installing the GitHub repo and configuring API keys.

3) Is Tongyi DeepResearch free to use?

The code and weights are open source under Apache-2.0, so downloading is free. Running it still has costs, either API usage through providers like OpenRouter or hardware and tool API keys when you deploy locally.

4) How does its performance compare to OpenAI’s Deep Research or Gemini?

According to its technical report and model card, Tongyi DeepResearch achieves strong results on web-agent benchmarks such as HLE and BrowseComp, making it a leading open-source option. Proprietary systems may still hold edges on some tasks.

5) Can I run Tongyi DeepResearch on my own computer?

Most users cannot run the full 30B model comfortably on a single consumer GPU. It typically requires data-center class VRAM, while community quantized builds can reduce the footprint for experimentation.

Tongyi DeepResearch: A Guide To The AI Agent Automating Web Research

Introduction

Table of Contents

1. What Is An Agentic AI? The “Deep Research” Difference

2. Why Tongyi DeepResearch Stands Out

3. How To Use Tongyi DeepResearch: From Zero To First Result

3.1 The Easiest Way, Online Demos And API Access

3.2 The Power User’s Path, Local Deployment

4. The Tools And API Keys You’ll Need

5. Performance Benchmarks, An Open-Source Challenger With Range

5.1 Snapshot Of Results

Tongyi DeepResearch Benchmarks Overview

5.2 Heavy Mode, When You Want More Certainty

6. Hardware Reality Check

6.1 Running On Modest Machines

7. Architecture In Brief, Why The Training Recipe Matters

8. Practical Setup, A Minimal Recipe You Can Reproduce

8.1 Configuration You’ll Actually Use

9. The Geopolitical Context, Answering The “CCP” Comments

10. Quick Reference Tables

10.1 Core Specs And Setup

Tongyi DeepResearch Core Specs

10.2 Benchmarks At A Glance

Tongyi DeepResearch Benchmark Scores

11. Risk And Failure Modes, What To Expect In The Wild

12. Integration Playbook, From Trial To Value

13. Closing Thoughts, And A Concrete Next Step

1) What is Tongyi DeepResearch, and how is it different from a normal chatbot?

2) How can I use Tongyi DeepResearch right now?

3) Is Tongyi DeepResearch free to use?

4) How does its performance compare to OpenAI’s Deep Research or Gemini?

5) Can I run Tongyi DeepResearch on my own computer?

Recent Comments

Introduction

Table of Contents

1. What Is An Agentic AI? The “Deep Research” Difference

2. Why Tongyi DeepResearch Stands Out

3. How To Use Tongyi DeepResearch: From Zero To First Result

3.1 The Easiest Way, Online Demos And API Access

3.2 The Power User’s Path, Local Deployment

4. The Tools And API Keys You’ll Need

5. Performance Benchmarks, An Open-Source Challenger With Range

5.1 Snapshot Of Results

Tongyi DeepResearch Benchmarks Overview

5.2 Heavy Mode, When You Want More Certainty

6. Hardware Reality Check

6.1 Running On Modest Machines

7. Architecture In Brief, Why The Training Recipe Matters

8. Practical Setup, A Minimal Recipe You Can Reproduce

8.1 Configuration You’ll Actually Use

9. The Geopolitical Context, Answering The “CCP” Comments

10. Quick Reference Tables

10.1 Core Specs And Setup

Tongyi DeepResearch Core Specs

10.2 Benchmarks At A Glance

Tongyi DeepResearch Benchmark Scores

11. Risk And Failure Modes, What To Expect In The Wild

12. Integration Playbook, From Trial To Value

13. Closing Thoughts, And A Concrete Next Step

Related Articles

Gemini 2.5 Pro vs Gemini Deep Research

ChatGPT Agent Guide

AgentKit: Guide, Pricing & Setup

ChatGPT Agent Use Cases

Grok 4 Heavy Review

GPT-OSS Guide

Med-Gemma Guide

Qwen3 Coder Review

Best LLM for Coding (2025)

How to Use OpenAI Codex

1) What is Tongyi DeepResearch, and how is it different from a normal chatbot?

2) How can I use Tongyi DeepResearch right now?

3) Is Tongyi DeepResearch free to use?

4) How does its performance compare to OpenAI’s Deep Research or Gemini?

5) Can I run Tongyi DeepResearch on my own computer?