Commit 916d0fd
perf: optimize Qwen3.5-122B SGLang serving for H200 (#5)
* perf: optimize Qwen3.5-122B SGLang serving for H200
Add performance flags to improve throughput and memory efficiency:
- FP8 KV cache quantization for ~2x concurrent requests
- Chunked prefill to prevent OOM on 128K context
- FlashInfer attention backend
- Aggressive scheduling for better batching
Made-with: Cursor
* chore: increase context length to 262K (native max)
Made-with: Cursor
* fix: use H200 default chunked-prefill-size of 8192
4096 was below SGLang's auto-detected default for H200 GPUs (<160GB),
which unnecessarily limited prefill throughput.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 8e390a9 commit 916d0fd
1 file changed
+6
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
55 | | - | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
56 | 60 | | |
57 | 61 | | |
58 | 62 | | |
| |||
0 commit comments