perf: optimize Qwen3.5-122B SGLang serving for H200 by PierreLeGuen · Pull Request #5 · nearai/cvm-compose-files

PierreLeGuen · 2026-03-06T22:04:14Z

Summary

Enable FP8 KV cache quantization (fp8_e4m3) to halve memory usage and allow ~2x concurrent requests
Add chunked prefill (4096 tokens) to prevent OOM on 128K context prompts
Set FlashInfer attention backend for optimized kernels on H200
Increase mem-fraction-static to 0.88 and tune scheduler for better throughput

Test plan

Deploy to a single H200 node and verify model loads without OOM
Run load test comparing throughput vs previous config
Validate 128K context prompts complete without memory errors
Monitor /metrics endpoint for KV cache utilization improvements

Made with Cursor

Add performance flags to improve throughput and memory efficiency: - FP8 KV cache quantization for ~2x concurrent requests - Chunked prefill to prevent OOM on 128K context - FlashInfer attention backend - Aggressive scheduling for better batching Made-with: Cursor

Made-with: Cursor

4096 was below SGLang's auto-detected default for H200 GPUs (<160GB), which unnecessarily limited prefill throughput. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PierreLeGuen and others added 3 commits March 6, 2026 14:03

chore: increase context length to 262K (native max)

2be0dac

Made-with: Cursor

fix: use H200 default chunked-prefill-size of 8192

bc09379

4096 was below SGLang's auto-detected default for H200 GPUs (<160GB), which unnecessarily limited prefill throughput. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

henrypark133 approved these changes Mar 6, 2026

View reviewed changes

PierreLeGuen merged commit 916d0fd into main Mar 6, 2026
2 checks passed

PierreLeGuen deleted the perf/qwen35-sglang-optimizations branch March 6, 2026 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize Qwen3.5-122B SGLang serving for H200#5

perf: optimize Qwen3.5-122B SGLang serving for H200#5
PierreLeGuen merged 3 commits intomainfrom
perf/qwen35-sglang-optimizations

PierreLeGuen commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PierreLeGuen commented Mar 6, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants