Introduction
Speed claims are cheap. Latency is not.
Anyone can make a language model “faster” by picking an easy prompt, a short output, and a baseline that was never tuned. The harder problem is shaving seconds off the stuff people actually wait on. Code that must compile. Math that must stay consistent. Tool-call arguments that must match a schema.
WeDLM 8B is Tencent’s attempt to make diffusion-style parallel decoding pay off in the only place it matters, real serving stacks. This 8B model sits in a pragmatic size tier, big enough to be useful, small enough to run without a data center. The design goal is blunt: keep standard causal attention, keep prefix KV caching, then still predict more than one token per forward pass. The paper argues that this combination is why diffusion decoding can beat an optimized autoregressive engine under matched deployment settings, instead of winning only in toy demos.
Table of Contents
1. WeDLM 8B In One Paragraph: What Tencent Shipped And Why It Matters
WeDLM 8B is a diffusion language model that stays inside standard causal attention, then uses Topological Reordering to give masked positions full observed context without switching to bidirectional attention. On top of that, it adds a streaming decoder that continuously commits confident tokens into a growing left-to-right prefix, so work turns into cache-valid progress instead of recomputation. In the authors’ own summary, that combination is what enables speedups that approach 3× on challenging reasoning benchmarks and up to 10× in low-entropy generation, against vLLM-served baselines under matched settings.
Right away, that tells you what this is really about. It is not a new backbone. It is a new decode loop that is trying to play nicely with the infrastructure everyone already uses.
Table 1. Quick Download Decision
WeDLM 8B Model Pick Guide
Choose Base vs Instruct in one glance.
| Pick | Best For | First Move |
|---|---|---|
| WeDLM 8B Base | Fine-tuning, research, controlled evals | Start here if you want to change behavior |
| WeDLM 8B Instruct | Chat, coding, day-to-day assistant use | Start here if you want answers today |
2. Base Vs Instruct: Two Models, Two Jobs
The naming is boring, but the choice matters.
The base checkpoint is for people who like to own the model. You fine-tune. You run ablations. You evaluate the decoding method without mixing in policy layers and chat formatting. If you are comparing this approach to a classic autoregressive model, base-to-base is the cleanest experiment.
The instruction-tuned checkpoint is the one most readers will actually use. WeDLM 8B Instruct is meant to follow directions, handle multi-turn dialogue, and behave like a modern assistant, not like a raw pretrain.
If you’re torn, download WeDLM 8B Base for experiments, and keep WeDLM 8B Instruct around for real work. One tells you what the method is doing, the other tells you whether it helps your day.
A simple decision rule:
- Shipping an assistant or a coding helper, start with the instruct model.
- Building a domain model or writing a paper, start with base.
- Doing both, fine-tune base, then benchmark against instruct as a reality check.
3. Why The Qwen3 8B Comparison Is The Only Fair Baseline
WeDLM does not pretend it was born from scratch. The paper states that WeDLM is initialized from pretrained Qwen-family autoregressive checkpoints, including Qwen3-8B, then adapted with continued pretraining and later supervised fine-tuning.
For WeDLM 8B, that means the honest baseline is Qwen3-8B in the same serving conditions. That is why WeDLM vs Qwen3 8B Instruct is the comparison that feels fair. Same size class, same lineage, and a baseline that already has a serious inference story. It also makes a better LLM latency benchmark, because you are not comparing a new method to an unoptimized reference implementation.
4. Diffusion Language Model Vs Autoregressive Model: The Only Difference That Pays Rent
Autoregressive decoding is sequential by construction. One token depends on the previous tokens, so you generate, commit, and move on. KV caching makes this efficient because the prefix is reusable.
A Diffusion LLM tries to fill multiple missing tokens per step. Conceptually, it can trade “more work per step” for “fewer steps,” which is exactly what you want on parallel hardware.
The reason diffusion text has been hard to deploy is not philosophical. It is mechanical. Many diffusion approaches rely on bidirectional attention so every masked position can see the full context, including tokens to its right. That breaks standard prefix caching and forces repeated contextualization, which is the opposite of what modern serving engines are built to do.
The rest of the paper is an argument that staying causal is the difference between “fast on paper” and “fast in production.”
5. KV Cache: The Constraint That Kills Parallel Dreams
If you want to understand deployment speed, stop staring at “tokens predicted per forward.” The more reliable question is: how many of those predictions become a committed prefix that the cache can reuse.
The paper formalizes this with prefix cacheability, p_cache. It defines p_cache as the ratio of final new tokens to the total token instances processed across all post-prefill forwards. Low p_cache means the model is doing work it will redo, which inflates compute and destroys latency.
This is where the approach earns its keep. If you can commit earlier, you can turn “I predicted several tokens” into “I moved the cache forward,” which is what inference engines actually reward.
Once you see the problem this way, the solution path becomes obvious. Improve commitment, not just prediction.
6. Topological Reordering: How You Get Full Context Without Going Bidirectional

Topological Reordering is the core trick inside WeDLM 8B. Instead of changing the attention mask, it changes the physical order of computation. Observed tokens are permuted into the physical prefix, masked tokens follow, and logical positions are preserved via the position ids. Under a standard causal mask, that makes every masked token able to attend to all observed tokens, without any bidirectional attention hack.
This is the compatibility story in one paragraph. If the attention remains causal, the model fits the ecosystem. If it goes bidirectional, you are suddenly rewriting half your inference stack.
It also answers the quiet question engineers ask first: “Will this break my cache?” The whole point is that it should not.
7. Streaming Parallel Decoding: Turning Predictions Into Cache-Valid Progress

Streaming is the second core trick in WeDLM 8B.
Block diffusion has an annoying habit: stop, predict a block, then wait until the whole block is finalized before committing anything. The paper illustrates that this stop-and-wait behavior prevents immediate caching, then contrasts it with a streaming approach where resolved tokens can be committed as soon as they become cache-valid.
Streaming Parallel Decoding keeps a fixed-size window W. Each step reorders filled tokens before masks, runs a causal forward conditioned on the persistent KV cache, commits the leftmost contiguous filled prefix, fills some masks based on confidence, then refills with new masks to keep parallelism constant.
Two details matter in practice:
- Commitment is left-to-right, because only a prefix is cacheable.
- Mask selection is biased toward earlier positions using an entropy rule plus a distance penalty, so contiguous prefixes form more often.
8. Benchmarks: Where The Gains Show Up, And Where They Don’t

The paper reports results for both base and instruct variants, and it evaluates against Qwen baselines plus several diffusion baselines. Across the board, the authors’ headline trend is that WeDLM preserves, and often improves, the capabilities of its underlying autoregressive checkpoints.
Here is the practical slice that most readers care about.
WeDLM 8B Benchmarks vs Qwen3 8B
Quality snapshot across Base and Instruct checkpoints.
| Benchmark | Qwen3-8B (Base) | WeDLM-8B (Base) | Qwen3-8B (Instruct) | WeDLM-8B (Instruct) |
|---|---|---|---|---|
| ARC-C (0-shot) | 92.66 | 92.92 +0.26 | 91.47 | 92.92 +1.45 |
| GSM8K (3-shot) | 85.97 | 90.20 +4.23 | 89.91 | 92.27 +2.36 |
| MATH (4-shot) | 50.80 | 53.60 +2.80 | 69.60 | 64.80 −4.80 |
| MMLU (5-shot) | 74.03 | 75.46 +1.43 | 71.52 | 75.14 +3.62 |
| GPQA-Diamond (5-shot) | 37.00 | 42.42 +5.42 | 41.41 | 44.95 +3.54 |
| HumanEval (4-shot) | 68.90 | 75.00 +6.10 | 71.95 | 80.49 +8.54 |
The one obvious wrinkle is MATH on the instruct side. WeDLM 8B improves MATH in the base setting, but the instruct variant is lower than the Qwen3 instruct baseline in this table.
Do not overreact. Do the thing engineers do when they get new machinery. Test your own workload. If your “math” is actually structured tool calls and deterministic steps, you may still see net wins. If your math is long, brittle proofs, measure first.
9. Speed Claims: When You Get 3×, When You Get 10×, And When You Won’t
The authors put real numbers in the abstract: speedups approaching 3× on challenging reasoning, and up to 10× in low-entropy generation regimes.
That last phrase, low entropy, is the cheat code for predicting where WeDLM 8B will feel great.
- Code generation often has low entropy because syntax is unforgiving.
- Structured reasoning often has lower entropy once the model commits to a plan.
- Open-ended writing stays high entropy for longer, and confidence-based acceptance becomes harder.
The paper explicitly calls out the disparity: low-entropy tasks can get over 8× speedup, while high-entropy generation sees diminishing returns because uncertainty limits parallel acceptance.
One more deployment detail is easy to miss and hard to overvalue. Streaming decoding reduces each step to a causal forward over a small window conditioned on the cached prefix, which is natively supported by common acceleration methods like FlashAttention, PagedAttention, and CUDA Graphs.
10. How To Run WeDLM 8B Locally: Fast Path, Then Transformers Path
If you want to feel the idea, start with the inference path, not the training path.
10.1. Fast Path: WeDLM Engine
WeDLM 8B Instruct: 60-Second Local Run
Install the engine, load the model, then generate a structured answer.
pip install git+https://github.com/tencent/WeDLM.git
from wedlm import LLM, SamplingParams
llm = LLM(model="tencent/WeDLM-8B-Instruct")
prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
out = llm.generate([prompt], SamplingParams(temperature=0.2, max_tokens=256))
print(out[0]["text"])Start with a low temperature and a structured prompt. You want the method to operate in the regime where confidence-based commitment works.
10.2. Dev Path: Transformers For Training And Forward Passes
Use Transformers when you are fine-tuning, probing, or integrating with existing training code. The paper’s thesis is about inference throughput, so treat Transformers as the flexible route, not the fastest route.
11. Hardware And Deployment Reality: GPU Vs CPU, VRAM, Quantization, Offloading
WeDLM 8B is still an 8B model. Memory math still applies.
A useful mental model:
- FP16 weights are roughly 16 GB, plus overhead.
- 8-bit quantization roughly halves that, and 4-bit goes lower.
- KV cache cost scales with context length, batch size, and generated length.
When you benchmark WeDLM 8B, do it on a GPU first. Parallel decoding is meant to keep accelerators saturated. On CPU, bandwidth and per-step overhead often dominate, so do not expect miracles. Establish a GPU baseline, then decide whether quantization or offloading buys enough cost savings to justify the complexity.
12. Ecosystem Support And Gotchas: vLLM, GGUF, Ollama, MLX
The paper’s biggest practical bet is compatibility. The evaluation setup notes that Qwen baselines are served with vLLM, and that WeDLM models are also served via vLLM to demonstrate seamless compatibility with industrial inference systems.
That makes the first recommendation easy. If you want to deploy WeDLM 8B, follow the path that keeps you close to vLLM-style serving and standard causal attention. That is the environment this method was designed for.
For everything else, assume friction.
GGUF and llama.cpp ecosystems are optimized around plain autoregressive decoding loops. Ollama has similar assumptions. MLX on Apple Silicon may load weights, but the real work is implementing the decode algorithm, not just loading tensors.
So here is the clean call to action: grab WeDLM 8B, run it on your own structured tasks, and publish the numbers. If it is faster, show the setup. If it is not, show the setup anyway. That feedback loop is how new decoding paradigms become engineering reality.
What is the difference between an autoregressive and a diffusion LLM?
An autoregressive model generates text one token at a time, left to right, where each new token depends on the full prefix. A diffusion LLM starts with masked or noisy tokens and iteratively “denoises” them, often predicting many tokens in parallel. In practice, diffusion only wins when it can still reuse the prefix efficiently, which is why designs like WeDLM 8B focus on cache-friendly decoding.
Is ChatGPT an autoregressive model?
Yes. In general use, ChatGPT-style models generate text autoregressively, meaning they produce the next token based on the tokens already generated. That sequential structure is why KV cache is so important for speed at inference time.
How do you make LLM inference faster without losing quality?
You speed up inference by reducing wasted compute while keeping the same model behavior. The highest-impact levers are:
KV cache (avoid recomputing attention for the whole prefix every token)
Efficient attention kernels (FlashAttention-style implementations)
Batching and continuous batching (keep the GPU busy)
Smarter decoding (accept more tokens per forward only when confidence supports it)
Quantization (if accuracy holds for your workload)
The trick is measuring end-to-end latency, not just tokens/sec in a cherry-picked setup.
How do you measure LLM inference speed, latency, and throughput?
Use three metrics, measured under realistic prompts and output lengths:
Latency (ms): time to first token, and time to finish the full response
Throughput (tokens/sec): steady-state generation rate over longer outputs
Cost per output: GPU time or total FLOPs per completed answer (useful for comparisons)
For a fair LLM latency benchmark, lock the hardware, context length, batch policy, decoding settings, and serving engine, then report averages over many prompts.
What is throughput in LLM inference, and why does KV cache matter?
Throughput is how many tokens you can generate per second once generation is underway. KV cache matters because, without it, each new token forces the model to recompute attention over the entire prompt-plus-generated prefix. With KV cache, the model reuses prior key/value states and only computes what’s needed for the newest token window, which is often the difference between “feels instant” and “feels laggy.”
