Skip to content

perf: optimize Qwen3.5-122B SGLang serving for H200#5

Merged
PierreLeGuen merged 3 commits intomainfrom
perf/qwen35-sglang-optimizations
Mar 6, 2026
Merged

perf: optimize Qwen3.5-122B SGLang serving for H200#5
PierreLeGuen merged 3 commits intomainfrom
perf/qwen35-sglang-optimizations

Conversation

@PierreLeGuen
Copy link
Contributor

Summary

  • Enable FP8 KV cache quantization (fp8_e4m3) to halve memory usage and allow ~2x concurrent requests
  • Add chunked prefill (4096 tokens) to prevent OOM on 128K context prompts
  • Set FlashInfer attention backend for optimized kernels on H200
  • Increase mem-fraction-static to 0.88 and tune scheduler for better throughput

Test plan

  • Deploy to a single H200 node and verify model loads without OOM
  • Run load test comparing throughput vs previous config
  • Validate 128K context prompts complete without memory errors
  • Monitor /metrics endpoint for KV cache utilization improvements

Made with Cursor

PierreLeGuen and others added 3 commits March 6, 2026 14:03
Add performance flags to improve throughput and memory efficiency:
- FP8 KV cache quantization for ~2x concurrent requests
- Chunked prefill to prevent OOM on 128K context
- FlashInfer attention backend
- Aggressive scheduling for better batching

Made-with: Cursor
4096 was below SGLang's auto-detected default for H200 GPUs (<160GB),
which unnecessarily limited prefill throughput.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@PierreLeGuen PierreLeGuen merged commit 916d0fd into main Mar 6, 2026
2 checks passed
@PierreLeGuen PierreLeGuen deleted the perf/qwen35-sglang-optimizations branch March 6, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants