Update deterministic blog (#203)

Fridge003 · web-flow · commit 30d0b9b6ffe2 · 2025-09-22T14:37:04.000-07:00
diff --git a/blog/2025-09-22-sglang-deterministic.md b/blog/2025-09-22-sglang-deterministic.md
@@ -26,7 +26,7 @@ Key enhancements include:
 - **Implementation of batch-invariant attention kernels** with fixed split-KV size. Multiple backends are supported, including FlashInfer, FlashAttention 3, and Triton.
 - **Full compatibility with common inference features**, such as chunked prefill, CUDA graph, radix cache, all of which remain supported when deterministic inference is enabled.
 - **Expose a per-request seed** in sampling arguments, allowing users to enable deterministic inference even when temperature > 0.
-- **Better performance**: Compared to the **61.5%** slowdown reported in TML’s blog, SGLang achieves an average slowdown of only **34.35%** with the FlashInfer and FlashAttention 3 backends.
+- **Better performance**: Compared to the **61.5%** slowdown reported in TML’s blog, SGLang achieves an average slowdown of only **34.35%** with the FlashInfer and FlashAttention 3 backends. With CUDA graphs, 2.8x speedup can be achieved compared to the minimal integration.
 
 
 ## Results
@@ -77,7 +77,7 @@ CUDA graphs can accelerate the inference process by consolidating multiple kerne
 
 We measured end-to-end latency for both non-deterministic and deterministic modes using three common RL rollout workloads (256 requests with varying input/output lengths).
 
-Deterministic inference is generally usable, with most slowdowns ranging from 25% to 45%. The majority of this overhead comes from unoptimized batch-invariant kernels (matrix multiplication and attention), indicating significant room for performance improvements
+Deterministic inference is generally usable, with most slowdowns ranging from 25% to 45%, and average slowdown of FlashInfer and FlashAttention 3 backends being 34.35%. The majority of this overhead comes from unoptimized batch-invariant kernels (matrix multiplication and attention), indicating significant room for performance improvements.
 
 | Attention Backend | Mode | Input 1024 Output 1024| Input 4096 Output 4096 | Input 8192 Output 8192 | 
 | --- | --- | --- | --- | --- |
@@ -136,9 +136,16 @@ As illustrated in the figure, consider two input sequences, `seq_a` and `seq_b`,
 
 The standard chunking strategy operates on a "best-effort" principle. In this example, this strategy tries to generate a `chunk_1` of 8,192 tokens by splitting the `b2` unit of `seq_b` into two smaller parts. This can cause inconsistent truncation points since the length of `b2` after splitting depends on the length of `seq_a`. To address this, we adapted the chunking logic to **align the truncation point with an integer multiple of the split_kv_size**. This adjustment ensures that the processing of `b2` is deferred to a subsequent chunk, allowing it to be computed as a complete unit by the attention kernel. 
 
+### Attention Backends
+
+Attention kernel is an important part of determinism. For different attention backends, we modified them in different ways to satisfy their usage requirements.
+- For Flashinfer backend, we utilize the `fixed_split_size` and `disable_kv_split` arguments from [batch invariant FA2 kernels](https://github.com/flashinfer-ai/flashinfer/pull/1675) to fix split sizes during kernel planning. Truncation of chunked prefill is aligned to the prefill split size. ([PR link](https://github.com/sgl-project/sglang/pull/10645))
+- For FlashAttention-3 backend, num-splits of flash attention kernel are fixed to 1 to ensure determinism. ([PR link](https://github.com/sgl-project/sglang/pull/10651))
+- For Triton backend, we fix the split size of decoding, and manually set the alignment size of chunked prefill. ([PR link](https://github.com/sgl-project/sglang/pull/10694))
+
 
 ### Reproducible Non-Greedy Sampling
-To extend determinism beyond greedy decoding, we introduce a new sampling function: `multinomial_with_seed`.
+To extend determinism beyond greedy decoding, we introduce a new sampling function: [multinomial_with_seed](https://github.com/sgl-project/sglang/blob/fb1e8acd2954b6267c73a199427976d89887ff0e/python/sglang/srt/layers/sampler.py#L263).
 
 Instead of relying on `torch.multinomial`, which is inherently nondeterministic under batching, this operator perturbs logits with Gumbel noise generated from a **seeded hash function**. As a result, the same `(inputs, seed)` pair always yields the same sample, even when temperature > 0.
 
@@ -159,7 +166,8 @@ Our future efforts will focus on enhancing deterministic inference by addressing
 - **True On-Policy RL**: We plan to further integrate deterministic inference into reinforcement learning frameworks (e.g., [slime](https://github.com/THUDM/slime)) to enable reproducible sampling, with the ultimate goal of achieving true on-policy training.
 - **Enhancing Radix Cache Functionality**: We will improve the radix tree to enable compatibility with a wider variety of attention kernels, moving beyond current limitation to the FlashAttention 3 backend.
 - **Tensor Parallelism**: TP1 and TP2 are deterministic due to consistent floating-point addition order; larger TP setups require modifications to reduce kernels for determinism.
-- A roadmap for deterministic inference features can be found in [this issue](https://github.com/sgl-project/sglang/issues/10278). 
+- **FlexAttention Integration**: Besides currently supported attention backends, we plan to extend our support of deterministic inference to FlexAttention in the future.
+- A **roadmap** for deterministic inference features can be found in [this issue](https://github.com/sgl-project/sglang/issues/10278). 
 
 ## Acknowledgement
 We would like to extend our heartfelt gratitude to the following teams and collaborators: