Skip to content

Commit 30d0b9b

Browse files
authored
Update deterministic blog (#203)
1 parent 8a240ac commit 30d0b9b

File tree

1 file changed

+12
-4
lines changed

1 file changed

+12
-4
lines changed

blog/2025-09-22-sglang-deterministic.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Key enhancements include:
2626
- **Implementation of batch-invariant attention kernels** with fixed split-KV size. Multiple backends are supported, including FlashInfer, FlashAttention 3, and Triton.
2727
- **Full compatibility with common inference features**, such as chunked prefill, CUDA graph, radix cache, all of which remain supported when deterministic inference is enabled.
2828
- **Expose a per-request seed** in sampling arguments, allowing users to enable deterministic inference even when temperature > 0.
29-
- **Better performance**: Compared to the **61.5%** slowdown reported in TML’s blog, SGLang achieves an average slowdown of only **34.35%** with the FlashInfer and FlashAttention 3 backends.
29+
- **Better performance**: Compared to the **61.5%** slowdown reported in TML’s blog, SGLang achieves an average slowdown of only **34.35%** with the FlashInfer and FlashAttention 3 backends. With CUDA graphs, 2.8x speedup can be achieved compared to the minimal integration.
3030

3131

3232
## Results
@@ -77,7 +77,7 @@ CUDA graphs can accelerate the inference process by consolidating multiple kerne
7777

7878
We measured end-to-end latency for both non-deterministic and deterministic modes using three common RL rollout workloads (256 requests with varying input/output lengths).
7979

80-
Deterministic inference is generally usable, with most slowdowns ranging from 25% to 45%. The majority of this overhead comes from unoptimized batch-invariant kernels (matrix multiplication and attention), indicating significant room for performance improvements
80+
Deterministic inference is generally usable, with most slowdowns ranging from 25% to 45%, and average slowdown of FlashInfer and FlashAttention 3 backends being 34.35%. The majority of this overhead comes from unoptimized batch-invariant kernels (matrix multiplication and attention), indicating significant room for performance improvements.
8181

8282
| Attention Backend | Mode | Input 1024 Output 1024| Input 4096 Output 4096 | Input 8192 Output 8192 |
8383
| --- | --- | --- | --- | --- |
@@ -136,9 +136,16 @@ As illustrated in the figure, consider two input sequences, `seq_a` and `seq_b`,
136136

137137
The standard chunking strategy operates on a "best-effort" principle. In this example, this strategy tries to generate a `chunk_1` of 8,192 tokens by splitting the `b2` unit of `seq_b` into two smaller parts. This can cause inconsistent truncation points since the length of `b2` after splitting depends on the length of `seq_a`. To address this, we adapted the chunking logic to **align the truncation point with an integer multiple of the split_kv_size**. This adjustment ensures that the processing of `b2` is deferred to a subsequent chunk, allowing it to be computed as a complete unit by the attention kernel.
138138

139+
### Attention Backends
140+
141+
Attention kernel is an important part of determinism. For different attention backends, we modified them in different ways to satisfy their usage requirements.
142+
- For Flashinfer backend, we utilize the `fixed_split_size` and `disable_kv_split` arguments from [batch invariant FA2 kernels](https://github.com/flashinfer-ai/flashinfer/pull/1675) to fix split sizes during kernel planning. Truncation of chunked prefill is aligned to the prefill split size. ([PR link](https://github.com/sgl-project/sglang/pull/10645))
143+
- For FlashAttention-3 backend, num-splits of flash attention kernel are fixed to 1 to ensure determinism. ([PR link](https://github.com/sgl-project/sglang/pull/10651))
144+
- For Triton backend, we fix the split size of decoding, and manually set the alignment size of chunked prefill. ([PR link](https://github.com/sgl-project/sglang/pull/10694))
145+
139146

140147
### Reproducible Non-Greedy Sampling
141-
To extend determinism beyond greedy decoding, we introduce a new sampling function: `multinomial_with_seed`.
148+
To extend determinism beyond greedy decoding, we introduce a new sampling function: [multinomial_with_seed](https://github.com/sgl-project/sglang/blob/fb1e8acd2954b6267c73a199427976d89887ff0e/python/sglang/srt/layers/sampler.py#L263).
142149

143150
Instead of relying on `torch.multinomial`, which is inherently nondeterministic under batching, this operator perturbs logits with Gumbel noise generated from a **seeded hash function**. As a result, the same `(inputs, seed)` pair always yields the same sample, even when temperature > 0.
144151

@@ -159,7 +166,8 @@ Our future efforts will focus on enhancing deterministic inference by addressing
159166
- **True On-Policy RL**: We plan to further integrate deterministic inference into reinforcement learning frameworks (e.g., [slime](https://github.com/THUDM/slime)) to enable reproducible sampling, with the ultimate goal of achieving true on-policy training.
160167
- **Enhancing Radix Cache Functionality**: We will improve the radix tree to enable compatibility with a wider variety of attention kernels, moving beyond current limitation to the FlashAttention 3 backend.
161168
- **Tensor Parallelism**: TP1 and TP2 are deterministic due to consistent floating-point addition order; larger TP setups require modifications to reduce kernels for determinism.
162-
- A roadmap for deterministic inference features can be found in [this issue](https://github.com/sgl-project/sglang/issues/10278).
169+
- **FlexAttention Integration**: Besides currently supported attention backends, we plan to extend our support of deterministic inference to FlexAttention in the future.
170+
- A **roadmap** for deterministic inference features can be found in [this issue](https://github.com/sgl-project/sglang/issues/10278).
163171

164172
## Acknowledgement
165173
We would like to extend our heartfelt gratitude to the following teams and collaborators:

0 commit comments

Comments
 (0)