|
| 1 | +--- |
| 2 | +title: "SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention" |
| 3 | +author: "The SGLang Team" |
| 4 | +date: "September 29, 2025" |
| 5 | +previewImg: /images/blog/deepseek_v32/ds_x_sgl_v2_2.png |
| 6 | +--- |
| 7 | +We are excited to announce that **SGLang supports DeepSeek-V3.2 on Day 0**! According to the DeepSeek [tech report](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf), it equips DeepSeek-V3.1-Terminus with [DeepSeek Sparse Attention (DSA)](https://arxiv.org/pdf/2502.11089) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. For more details about upcoming features, please check our [Roadmap](https://github.com/sgl-project/sglang/issues/11060). |
| 8 | + |
| 9 | + |
| 10 | +## Installation and QuickStart |
| 11 | + |
| 12 | +To get started, simply pull the container and launch SGLang as follows: |
| 13 | + |
| 14 | +```bash |
| 15 | +docker pull lmsysorg/sglang:dsv32 |
| 16 | + |
| 17 | +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention |
| 18 | +``` |
| 19 | + |
| 20 | +For AMD (MI350X): |
| 21 | + |
| 22 | +```bash |
| 23 | +docker pull lmsysorg/sglang:dsv32-rocm |
| 24 | + |
| 25 | +SGLANG_NSA_KV_CACHE_STORE_FP8=false SGLANG_NSA_USE_REAL_INDEXER=true SGLANG_NSA_USE_TILELANG_PREFILL=true python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2-Exp --disable-cuda-graph --tp 8 --mem-fraction-static 0.85 --page-size 64 --nsa-prefill "tilelang" --nsa-decode "tilelang" |
| 26 | +``` |
| 27 | + |
| 28 | + |
| 29 | +For NPU: |
| 30 | + |
| 31 | +```bash |
| 32 | +# NPU A2 |
| 33 | +docker pull lmsysorg/sglang:dsv32-a2 |
| 34 | +# NPU A3 |
| 35 | +docker pull lmsysorg/sglang:dsv32-a3 |
| 36 | + |
| 37 | +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2-Exp --trust-remote-code --attention-backend ascend --mem-fraction-static 0.85 --chunked-prefill-size 32768 --disable-radix-cache --tp-size 16 --quantization w8a8_int8 |
| 38 | +``` |
| 39 | + |
| 40 | + |
| 41 | +## Description |
| 42 | + |
| 43 | +### DeepSeek Sparse Attention: Long-Context Efficiency Unlocked |
| 44 | + |
| 45 | +At the heart of DeepSeek-V3.2 is **DeepSeek Sparse Attention (DSA)**, a fine-grained sparse attention mechanism that redefines long-context efficiency. |
| 46 | + |
| 47 | + |
| 48 | + |
| 49 | +Instead of performing quadratic full attention over all tokens, DSA introduces: |
| 50 | + |
| 51 | +* **Lightning Indexer** (ultra-light FP8 scorer) to identify the most relevant tokens for each query. |
| 52 | +* **Top-k Token Selection** to focus computation only on the most impactful key-value entries. |
| 53 | + |
| 54 | +This design reduces the complexity of core attention from **O(L^2) to O(Lk)**, delivering dramatic improvements in both training and inference efficiency at up to **128K** context length, with negligible loss of model quality. |
| 55 | + |
| 56 | +To support this breakthrough, SGLang implements and integrates: |
| 57 | + |
| 58 | +* **Lightning Indexer Support** – with a dedicated `key&key_scale` cache in the memory pool for ultra-fast token scoring. |
| 59 | +* **Native Sparse Attention (NSA) Backend** – a new backend purpose-built for sparse workloads, featuring: |
| 60 | + * **FlashMLA** (DeepSeek’s optimized multi-query attention kernel) |
| 61 | + * **FlashAttention-3 Sparse** (adapted for compatibility and maximum kernel reuse) |
| 62 | +* Additional work: supporting different page sizes within one attention backend: |
| 63 | + * Indexer `key&key_scale` cache requires page size = 64 (from the kernels provided in DeepSeek) |
| 64 | + * Token-level sparse forward operator requires page size = 1 |
| 65 | + |
| 66 | +Together, these innovations enable DeepSeek-V3.2-Exp to deliver **GPU-optimized sparse attention** and **dynamic cache management**, cutting memory overhead while scaling seamlessly to 128K contexts. |
| 67 | + |
| 68 | +The result is a runtime that preserves state-of-the-art reasoning quality, while **dramatically lowering inference costs**—making long-context LLM deployment not only possible, but also practical at scale. |
| 69 | + |
| 70 | +## Future Work |
| 71 | + |
| 72 | +Future work will be tracked [here](https://github.com/sgl-project/sglang/issues/11060). More specifically, we plan to: |
| 73 | + |
| 74 | +* **Multi-token prediction (MTP)** support coming soon: The MTP will speed up decoding, especially when the batch size is not large. |
| 75 | +* **FP8 KV Cache**: Compared to traditional BF16 KV cache, this will almost double the number of tokens in KV cache as well as halving the memory access pressure of attention kernels, making it possible to serve longer or more requests faster. |
| 76 | +* **TileLang** support: TileLang kernels are useful for flexible development. |
| 77 | + |
| 78 | +## Acknowledgments |
| 79 | + |
| 80 | +We sincerely thank the DeepSeek team for their outstanding contributions to open model research, which have greatly benefited the open-source community, as well as for their highly efficient kernels that are now integrated into the SGLang inference engine. |
| 81 | + |
| 82 | +From the SGLang community, we thank Tom Chen, Ziyi Xu, Liangsheng Yin, Biao He, Baizhou Zhang, Henry Xiao, Hubert Lu, Wun-guo Huang, and Zhengda Qin for their contributions to DeepSeek-V3.2-Exp support. |
| 83 | + |
| 84 | +We also thank NVIDIA, AMD, and Nebius Cloud for sponsoring the GPU machines used in the development of this work. |
0 commit comments