You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-09-22-sglang-deterministic.md
+12-4Lines changed: 12 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ Key enhancements include:
26
26
-**Implementation of batch-invariant attention kernels** with fixed split-KV size. Multiple backends are supported, including FlashInfer, FlashAttention 3, and Triton.
27
27
-**Full compatibility with common inference features**, such as chunked prefill, CUDA graph, radix cache, all of which remain supported when deterministic inference is enabled.
28
28
-**Expose a per-request seed** in sampling arguments, allowing users to enable deterministic inference even when temperature > 0.
29
-
-**Better performance**: Compared to the **61.5%** slowdown reported in TML’s blog, SGLang achieves an average slowdown of only **34.35%** with the FlashInfer and FlashAttention 3 backends.
29
+
-**Better performance**: Compared to the **61.5%** slowdown reported in TML’s blog, SGLang achieves an average slowdown of only **34.35%** with the FlashInfer and FlashAttention 3 backends. With CUDA graphs, 2.8x speedup can be achieved compared to the minimal integration.
30
30
31
31
32
32
## Results
@@ -77,7 +77,7 @@ CUDA graphs can accelerate the inference process by consolidating multiple kerne
77
77
78
78
We measured end-to-end latency for both non-deterministic and deterministic modes using three common RL rollout workloads (256 requests with varying input/output lengths).
79
79
80
-
Deterministic inference is generally usable, with most slowdowns ranging from 25% to 45%. The majority of this overhead comes from unoptimized batch-invariant kernels (matrix multiplication and attention), indicating significant room for performance improvements
80
+
Deterministic inference is generally usable, with most slowdowns ranging from 25% to 45%, and average slowdown of FlashInfer and FlashAttention 3 backends being 34.35%. The majority of this overhead comes from unoptimized batch-invariant kernels (matrix multiplication and attention), indicating significant room for performance improvements.
@@ -136,9 +136,16 @@ As illustrated in the figure, consider two input sequences, `seq_a` and `seq_b`,
136
136
137
137
The standard chunking strategy operates on a "best-effort" principle. In this example, this strategy tries to generate a `chunk_1` of 8,192 tokens by splitting the `b2` unit of `seq_b` into two smaller parts. This can cause inconsistent truncation points since the length of `b2` after splitting depends on the length of `seq_a`. To address this, we adapted the chunking logic to **align the truncation point with an integer multiple of the split_kv_size**. This adjustment ensures that the processing of `b2` is deferred to a subsequent chunk, allowing it to be computed as a complete unit by the attention kernel.
138
138
139
+
### Attention Backends
140
+
141
+
Attention kernel is an important part of determinism. For different attention backends, we modified them in different ways to satisfy their usage requirements.
142
+
- For Flashinfer backend, we utilize the `fixed_split_size` and `disable_kv_split` arguments from [batch invariant FA2 kernels](https://github.com/flashinfer-ai/flashinfer/pull/1675) to fix split sizes during kernel planning. Truncation of chunked prefill is aligned to the prefill split size. ([PR link](https://github.com/sgl-project/sglang/pull/10645))
143
+
- For FlashAttention-3 backend, num-splits of flash attention kernel are fixed to 1 to ensure determinism. ([PR link](https://github.com/sgl-project/sglang/pull/10651))
144
+
- For Triton backend, we fix the split size of decoding, and manually set the alignment size of chunked prefill. ([PR link](https://github.com/sgl-project/sglang/pull/10694))
145
+
139
146
140
147
### Reproducible Non-Greedy Sampling
141
-
To extend determinism beyond greedy decoding, we introduce a new sampling function: `multinomial_with_seed`.
148
+
To extend determinism beyond greedy decoding, we introduce a new sampling function: [multinomial_with_seed](https://github.com/sgl-project/sglang/blob/fb1e8acd2954b6267c73a199427976d89887ff0e/python/sglang/srt/layers/sampler.py#L263).
142
149
143
150
Instead of relying on `torch.multinomial`, which is inherently nondeterministic under batching, this operator perturbs logits with Gumbel noise generated from a **seeded hash function**. As a result, the same `(inputs, seed)` pair always yields the same sample, even when temperature > 0.
144
151
@@ -159,7 +166,8 @@ Our future efforts will focus on enhancing deterministic inference by addressing
159
166
-**True On-Policy RL**: We plan to further integrate deterministic inference into reinforcement learning frameworks (e.g., [slime](https://github.com/THUDM/slime)) to enable reproducible sampling, with the ultimate goal of achieving true on-policy training.
160
167
-**Enhancing Radix Cache Functionality**: We will improve the radix tree to enable compatibility with a wider variety of attention kernels, moving beyond current limitation to the FlashAttention 3 backend.
161
168
-**Tensor Parallelism**: TP1 and TP2 are deterministic due to consistent floating-point addition order; larger TP setups require modifications to reduce kernels for determinism.
162
-
- A roadmap for deterministic inference features can be found in [this issue](https://github.com/sgl-project/sglang/issues/10278).
169
+
-**FlexAttention Integration**: Besides currently supported attention backends, we plan to extend our support of deterministic inference to FlexAttention in the future.
170
+
- A **roadmap** for deterministic inference features can be found in [this issue](https://github.com/sgl-project/sglang/issues/10278).
163
171
164
172
## Acknowledgement
165
173
We would like to extend our heartfelt gratitude to the following teams and collaborators:
0 commit comments