Skip to content

Commit 916d0fd

Browse files
PierreLeGuenclaude
andauthored
perf: optimize Qwen3.5-122B SGLang serving for H200 (#5)
* perf: optimize Qwen3.5-122B SGLang serving for H200 Add performance flags to improve throughput and memory efficiency: - FP8 KV cache quantization for ~2x concurrent requests - Chunked prefill to prevent OOM on 128K context - FlashInfer attention backend - Aggressive scheduling for better batching Made-with: Cursor * chore: increase context length to 262K (native max) Made-with: Cursor * fix: use H200 default chunked-prefill-size of 8192 4096 was below SGLang's auto-detected default for H200 GPUs (<160GB), which unnecessarily limited prefill throughput. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 8e390a9 commit 916d0fd

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

Qwen3.5-122B.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,12 @@ x-sglang-qwen35-122b-common: &sglang-qwen35-122b-common
5151
sglang serve
5252
--model-path Qwen/Qwen3.5-122B-A10B
5353
--tp 4
54-
--mem-fraction-static 0.80
55-
--context-length 131072
54+
--mem-fraction-static 0.88
55+
--context-length 262144
56+
--kv-cache-dtype fp8_e4m3
57+
--chunked-prefill-size 8192
58+
--attention-backend flashinfer
59+
--schedule-conservativeness 0.5
5660
--reasoning-parser qwen3
5761
--tool-call-parser qwen3_coder
5862
--log-requests-level 0

0 commit comments

Comments
 (0)