Skip to content

Commit 7bff73e

Browse files
[Blog] Add SGLang diffusion LLM blog (#281)
* [DLLM] Add diffusion LLM blog * Update blog post title and date --------- Co-authored-by: 赵晨阳 <[email protected]>
1 parent 6fd22d6 commit 7bff73e

File tree

9 files changed

+232
-0
lines changed

9 files changed

+232
-0
lines changed

blog/2025-12-17-diffusion-llm.md

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
---
2+
title: "Power Up Diffusion LLMs: Day‑0 Support for LLaDA 2.0"
3+
author: "Ant Group DeepXPU Team, SGLang Team"
4+
date: "December 19, 2025"
5+
previewImg: /images/blog/dllm/preview.png
6+
---
7+
8+
## TL;DR
9+
10+
We are excited to introduce the design and implementation of the Diffusion Large Language Model (dLLM) framework within SGLang. By leveraging the existing Chunked-Prefill mechanism, our system achieves:
11+
12+
- Seamless integration: Built into the SGLang ecosystem without core architectural changes.
13+
- Inherited performance: The framework benefits from the existing inference optimization.
14+
- Maximum flexibility: Full flexibility for users to define and customize diffusion decoding algorithms.
15+
16+
## Background
17+
18+
### Motivation
19+
Earlier this year, [LLaDA](https://arxiv.org/pdf/2502.09992) made its debut as the first Diffusion Large Language Model, immediately capturing significant attention from both the academic and industrial communities. This achievement, a collaboration between Renmin University of China and Ant Group, demonstrated that the unique execution paradigm of dLLMs exhibits superior data comprehension capabilities. Moreover, dLLMs enable faster inference speeds compared to Auto-Regressive models, especially in low-latency scenarios such as small batch sizes.
20+
21+
At the same time, as the parameter scale of dLLMs continues to grow, we have also observed scaling-law effects similar to those seen in AR LLMs. In pursuit of better dLLMs, we trained the 100B [LLaDA2.0-flash](https://github.com/inclusionAI/LLaDA2.0/blob/main/tech_report.pdf) model.
22+
23+
However, in the process of training the [LLaDA2.0-flash](https://github.com/inclusionAI/LLaDA2.0/blob/main/tech_report.pdf), we encountered a series of serious AI infrastructure engineering challenges. The most important challenges are the efficency and stability of model evaluation and RL post training.
24+
25+
### Challenges
26+
27+
The current inference engines available for dLLMs are insufficient to support the evaluation and RL post-training requirements of larger-scale dLLMs. For instance, tools like [Fast-dLLM](https://github.com/NVlabs/Fast-dLLM) are excellent research tools, better suited for algorithm researchers to tune and validate various Diffusion decoding algorithms. However, they fall short in providing production-ready serving capabilities, such as batching, scheduling, RL ecosystem integration, and parallelism.
28+
29+
In contrast, SGLang is one of the most popular LLM inference engines today and has multiple advantages:
30+
31+
1. Production-Ready: It has been deployed in inference services across thousands of companies, offering mature and reliable engineering capabilities.
32+
2. Technological Lead: SGLang itself incorporates a vast array of excellent and advanced inference optimization techniques, with a continuous flow of new optimizations emerging from the community.
33+
3. Complete Ecosystem: It integrates extremely well with the RL post-training ecosystem, particularly in areas like distributed weight GPU P2P updates.
34+
35+
However, the core issue is that SGLang currently only supports the Auto-Regressive calculation paradigm, and has not yet adapted to the diffusion calculation method for LLMs.
36+
37+
Therefore, the challenge we face is: How can we introduce support for the dLLMs within the existing SGLang framework without compromising its current architecture? The goal is two-fold: allow dLLMs to benefit from all the optimization advantages SGLang offers, while avoiding major, compromising modifications to the SGLang framework just to accommodate diffusion computation.
38+
39+
## Design
40+
41+
### Key Insights
42+
43+
Based on our observations of the current developments in dLLM, we have identified several key insights:
44+
45+
1. Due to the enormous computational cost of Bidirectional Attention Diffusion and its inefficient utilization of the KV Cache, mainstream dLLMs are increasingly moving toward the Block Diffusion architecture.
46+
2. The computation pattern of Block Diffusion bears a high degree of similarity to SGLang's existing Chunked-Prefill process.
47+
3. Unlike auto-regressive language models, diffusion language models utilize various decoding strategies, which require a dedicated interface for flexible decoding algorithm customization.
48+
49+
### Architecture
50+
51+
Our approach is to leverage SGLang’s existing Chunked-Prefill pipeline to implement computational support for Block Diffusion LLM. This method allows us to seamlessly integrate dLLM into the SGLang ecosystem without changing the core SGLang framework, enabling dLLM to directly benefit from all the inference optimization techniques SGLang has accumulated.
52+
53+
<p align="center">
54+
<img src="/images/blog/dllm/main-flow.png" alt="main execution flow">
55+
<br>
56+
</p>
57+
58+
59+
As illustrated in the diagram, our modifications to the SGLang framework are very restrained, barely touching its core. SGLang's original `generate request` execution flow remains unchanged. Our implementation primarily focuses on leveraging and modifying its existing Chunked Prefill mechanism, with the specific work concentrated on two critical components: the `prefill adder` and `chunked reqs`.
60+
61+
In SGLang, the initial purpose of Chunked Prefill was to maximize GPU utilization. Consequently, the size of a single chunk is typically set quite large—ranging from 2K to 16K tokens in sequence length, depending on the GPU model. When the sequence is long enough, it naturally processes only one request, which is how the current `prefill adder` and `chunked req` are implemented.
62+
63+
However, the decoding process for dLLM differs: it segments the sequence length at the block level. Taking LLaDA2.0 as an example, its block Size is 32 tokens. If we were to follow SGLang's previous logic of processing only one large request at a time, GPU performance would clearly be wasted. Therefore, batching is a crucial problem that must be solved. To achieve efficient batching, we modified both `chunked reqs` and the `prefill adder` to enable them to process multiple Diffusion Blocks within a single computation cycle.
64+
65+
Furthermore, at the actual decoding execution level, we inserted an abstraction layer for the diffusion algorithm between the TP Worker and the Model Runner.
66+
67+
Specifically:
68+
- If the Worker identifies that it is handling a Diffusion model, the execution flow enters this dedicated branch.
69+
- The TP Worker then calls the Diffusion algorithm's `run` function.
70+
- Internally, this algorithm utilizes a forward iteration loop to continuously drive the Model Runner to perform inference computations until the entire Block (e.g., all 32 tokens) is decoded.
71+
72+
### Attention Mask
73+
74+
<p align="center">
75+
<img src="/images/blog/dllm/casual-mask.png" alt="Logo preview">
76+
<br>
77+
</p>
78+
79+
The most significant difference between Block Diffusion and Chunk Prefill during a single model forward pass lies in the handling of the attention mask.
80+
81+
- Block Diffusion utilizes a block-wise causal mask.
82+
- Chunk Prefill for AR model uses the traditional token-wise causal mask.
83+
84+
We can view Block Diffusion as a functional extension to the existing Chunk Prefill mechanism within SGLang. Regarding the specific attention calculation, a single forward pass involves two computational parts, whose final outputs are concatenated:
85+
86+
1. Context Query: This uses the current `Q_curr` (the query vectors of the current block) to perform bidirectional attention against the existing KV Cache. This computation is completely identical for both Block Diffusion and Chunk Prefill. The objective here is to ensure the current block attends to all historical information.
87+
2. Intra-Block Query: This uses the current `Q_curr` against its own KV (i.e., the keys and values within the current block) to perform the forward calculation.
88+
- Block Diffusion employs bidirectional attention in this step.
89+
- Chunk Prefill must use a causal Mask in this step.
90+
91+
Simply put, if we visualize the attention mask as a geometric shape for the `Q_curr` portion:
92+
- The calculation for Chunk Prefill (causal mask) corresponds to a trapezoidal (or triangular) mask.
93+
- The calculation for Block Diffusion (bidirectional attention) corresponds to a rectangular mask.
94+
95+
## Streaming output animation
96+
97+
Here is an animation comparing the streaming output of LLaDA2.0-flash (100B / BF16) and gpt-oss-120B (117B / MXFP4). LLaDA2.0-flash is served using SGLang dLLM with TP8 on 8 × H20, while gpt-oss-120B is served using SGLang's standard AR process on the same hardware.
98+
99+
Both models are asked to implement the quicksort algorithm in 10 programming languages — a task particularly well-suited for diffusion LLMs. As shown, LLaDA2.0-flash achieves significantly higher throughput at 935 tokens/s, compared to gpt-oss-120B (263 tokens/s) in this scenario.
100+
101+
<p align="center">
102+
<img src="/images/blog/dllm/llada2-vs-gpt-oss.gif" alt="LLaDA2.0-flash vs gpt-oss-120B animation">
103+
<br>
104+
</p>
105+
106+
SGLang dLLM supports streaming output just like SGLang auto-regressive models: but it outputs one block (e.g., 32 tokens) at a time instead of one token.
107+
108+
<p align="center">
109+
<img src="/images/blog/dllm/dllm-animation.gif" alt="Logo preview">
110+
<br>
111+
</p>
112+
113+
## How to Use
114+
115+
### Example Launch Command
116+
117+
```shell
118+
python3 -m sglang.launch_server \
119+
--model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path
120+
--dllm-algorithm LowConfidence \
121+
--dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set.
122+
--host 0.0.0.0 \
123+
--port 30000
124+
```
125+
> NOTE: Use `--dllm-algorithm-config` for advanced configuration of the selected `--dllm-algorithm`. This feature decouples configuration from code, enabling flexible customization and argument passing for user-defined algorithms via a unified entry point.
126+
127+
### Example Client Code Snippet
128+
129+
Just like other supported models, dLLMs can be used via the REST API or offline engine API.
130+
131+
Curl example for making a generation request to the running server:
132+
133+
```bash
134+
curl -X POST "http://127.0.0.1:30000/generate" \
135+
-H "Content-Type: application/json" \
136+
-d '{
137+
"text": [
138+
"<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role>Write the number from 1 to 128<|role_end|><role>ASSISTANT</role>",
139+
"<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role>Write a brief introduction of the great wall<|role_end|><role>ASSISTANT</role>"
140+
],
141+
"stream": true,
142+
"sampling_params": {
143+
"temperature": 0,
144+
"max_new_tokens": 1024
145+
}
146+
}'
147+
```
148+
149+
The following contains a code snippet illustrating how to use the offline engine generate content based on given inputs:
150+
151+
```python
152+
import sglang as sgl
153+
154+
def main():
155+
llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini",
156+
dllm_algorithm="LowConfidence",
157+
max_running_requests=1,
158+
trust_remote_code=True)
159+
160+
prompts = [
161+
"<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role>Write a brief introduction of the great wall<|role_end|><role>ASSISTANT</role>"
162+
]
163+
164+
sampling_params = {
165+
"temperature": 0,
166+
"max_new_tokens": 1024,
167+
}
168+
169+
outputs = llm.generate(prompts, sampling_params)
170+
print(outputs)
171+
172+
if __name__ == '__main__':
173+
main()
174+
```
175+
176+
## Performance
177+
<p align="center">
178+
<img src="/images/blog/dllm/llada2_flash_main_bench.png" alt="LLaDA2.0-flash main results">
179+
<br>
180+
</p>
181+
182+
We assessed the task efficacy of LLaDA2.0-flash by benchmarking it against advanced Auto-Regressive (AR) models of a comparable scale on a wide range of standard evaluation tasks.
183+
184+
The overall results indicate that the LLaDA2.0 architecture is not only highly competitive, but also shows a promising trend of closing the capability gap with AR models.
185+
186+
<p align="center">
187+
<img src="/images/blog/dllm/llada2_despine_comparison.png" alt="LLaDA2.0-flash performance">
188+
<br>
189+
</p>
190+
191+
The chart presents two complementary measurements for LLaDA2.0‑flash:
192+
- Average score and tokens‑per‑forward (TPF) obtained with and without Confidence‑Aware Parallel (CAP) training across 12 benchmark tasks.
193+
- Inference speed (tokens per second) of LLaDA2.0‑flash, benchmarked against AR models of comparable size on HumanEval, MBPP, GSM8K, and CRUXEval suites.
194+
195+
All numbers are collected under a consistent serving environment (SGLang with TP8 on H20), ensuring a fair comparison between the diffusion LLM and the Auto-Regressive baselines.
196+
197+
With a 0.95 threshold decoder, LLaDA2.0-flash-CAP achieved 500 TPS, significantly outperforming standard LLaDA2.0-flash (383 TPS) and delivering up to a 1.9× speedup over AR baselines (258 TPS and 237 TPS) with small batch sizes.
198+
199+
## Roadmap
200+
201+
### Implemented key features
202+
203+
The current implementation fully supports the following critical serving features:
204+
205+
- Block Diffusion LLM framework main logic
206+
- Full KV cache support for sequence management
207+
- Model integration for LLaDA-2.0-mini/flash
208+
- Support for custom decoding algorithm
209+
- Full streaming I/O capability
210+
- Batching support (reviewing)
211+
- Tensor parallelism support
212+
- Cuda graph optimization
213+
214+
### Mid & Long-term Roadmaps
215+
216+
[Roadmap for 2025-Q4 and 2026-Q1](https://github.com/sgl-project/sglang/issues/14199)<br>
217+
[RFC: Block Diffusion Large Language Model (dLLM) Framework In SGLang](https://github.com/sgl-project/sglang/issues/12766)<br>
218+
- Support more system optimizations that autoregressive language models already have
219+
- Integrate additional common diffusion decoding strategies/algorithms (e.g, [Fast-dLLM v2](https://arxiv.org/pdf/2509.26328))
220+
- Add compatibility for non-block dLLMs (e.g., LLaDA & RND1)
221+
222+
## Reference
223+
[LLaDA1 technique report](https://arxiv.org/pdf/2502.09992)<br>
224+
[LLaDA2 technique report](https://github.com/inclusionAI/LLaDA2.0/blob/main/tech_report.pdf)<br>
225+
[Fast-dLLM v2 technique report](https://arxiv.org/pdf/2509.26328)
226+
227+
## Acknowledgements
228+
229+
- Ant Group DeepXPU Team: [Zehuan Li](https://github.com/Clawseven), [Tiwei Bie](https://github.com/btw616), Zhonghui Jiang, Jinghua Yao, Yusong Gao, [Mingliang Gong](https://github.com/brightcoder01), Jianfeng Tan
230+
- Ant Group inclusionAI Team: Kun Chen, [Zenan Huang](https://lccurious.github.io/), Lin Liu, Fuyuan Chen, Lun Du, Da Zheng
231+
- SGLang dLLM Team: [Jinwei Yao](https://kivi-yao.github.io/), [Mick Qian](https://github.com/mickqian), [Liangsheng Yin](https://www.lsyin.me/), [BBuf](https://github.com/BBuf), Banghua Zhu, [Chenyang Zhao](https://zhaochenyang20.github.io/Chayenne/)
232+
- NVIDIA Fast-dLLM Team: [Chengyue Wu](https://hills-code.github.io/), [Hao Zhang](https://research.nvidia.com/person/hao-zhang), [Enze Xie](https://xieenze.github.io/), [Song Han](https://hanlab.mit.edu/songhan)

c

647 KB
Binary file not shown.
647 KB
Loading
12.1 MB
Loading
5.19 MB
Loading
91.2 KB
Loading
322 KB
Loading
385 KB
Loading
767 KB
Loading

0 commit comments

Comments
 (0)