- Qwen3.5-397B Benchmarks
- Kimi K2.5 Benchmarks
- GLM-5 Benchmarks
- MiniMax-M2.5 Benchmarks
- Cross-Model Comparison
- Wattage-Performance Scaling
- NCCL AllReduce Benchmarks
- P2P Interconnect Benchmarks
- Benchmark Tools
All numbers are decode tok/s unless noted.
| GPUs | Quant | Engine | MTP | Decode tok/s | Notes |
|---|---|---|---|---|---|
| 4x | AWQ-INT4 | SGLang | MTP=5 | 152 | QuantTrio, best quality+speed (details) |
| 4x | NVFP4 | SGLang | MTP=5 | 132 | lukealonso (details) |
| 4x | NVFP4 | SGLang | No | 42-51 | kcramp, Ixtrix |
| 4x | NVFP4 | SGLang | Yes (3-step) | 85 | Festr (unstable) |
| 4x | NVFP4 | vLLM | No | 70-86 | Multiple users, stable |
| 4x | NVFP4 | vLLM | MTP=2 | 130 | malaiwah |
| 4x | NVFP4 | vLLM | MTP=5 | 150-250 | Festr, orangezed (peaks in code gen) |
| 8x | NVFP4 | SGLang | Yes | 350 | luke (EP=8, switches, heavily patched) |
| 8x | FP8 | SGLang | Yes | 75-125 | CyySky |
| 8x | FP8 | vLLM | MTP=2 | -- | Expected similar to SGLang |
vLLM, nvidia NVFP4, 4x RTX 6000 Pro, MTP=2:
| Concurrency | No MTP (tok/s) | MTP=2 (tok/s) | Improvement |
|---|---|---|---|
| 1 | 85.8 | 130.0 | +51.5% |
| 2 | 137.1 | 212.7 | +55.1% |
| 5 | 234.2 | 358.6 | +53.1% |
| 10 | 334.3 | 573.5 | +71.6% |
| 20 | 491.5 | 744.1 | +51.4% |
| 32 | 605.9 | 922.6 | +52.3% |
Peak throughput: 1127.1 tok/s at 50 users with 1K context.
MTP acceptance stats: 89.2% acceptance rate, mean acceptance length 2.73-3.69 tokens.
Does MTP speculative decoding affect accuracy? SGLang 0.5.9, 8x RTX PRO 6000 Blackwell, TP8, GPQA Diamond (198 questions), 8 repeats, temperature=0.0, thinking enabled, simple-evals framework.
| Run | lukealonso MTP | lukealonso No MTP | nvidia MTP | nvidia No MTP |
|---|---|---|---|---|
| 1 | 0.889 | 0.864 | 0.859 | 0.864 |
| 2 | 0.879 | 0.874 | 0.904 | 0.859 |
| 3 | 0.869 | 0.894 | 0.859 | 0.869 |
| 4 | 0.884 | 0.869 | 0.884 | 0.864 |
| 5 | 0.904 | 0.889 | 0.879 | 0.848 |
| 6 | 0.884 | 0.864 | 0.859 | 0.864 |
| 7 | 0.879 | 0.874 | 0.874 | 0.869 |
| 8 | 0.874 | 0.874 | 0.879 | 0.889 |
| Mean | 0.8826 | 0.8750 | 0.8744 | 0.8655 |
| Wall time | ~1h 29m | ~1h 48m | ~1h 43m | ~2h 15m |
MTP impact: lukealonso +0.76pp (p=0.18, not significant), nvidia +0.89pp. MTP provides 18-24% faster inference with no quality loss.
Checkpoint impact: lukealonso consistently outperforms nvidia (+0.82-0.95pp on GPQA, +5pp on GSM8K without thinking). Consistent with KLD results (0.035 vs 0.109).
Full analysis with GSM8K and Hard Math results: mtp-quality-evaluation.md.
| Context | Decode tok/s |
|---|---|
| 10K | 77 |
| 100K | 75 |
| 300K | 73 |
| 500K | 67 |
| 900K | 56 |
| Context | CARVE tok/s | NVIDIA REF tok/s | Winner |
|---|---|---|---|
| 10K | 76.9 | 92.3 | REF +20% |
| 50K | 75.5 | 91.3 | REF +21% |
| 100K | 74.8 | 73.6 | ~tied |
| 200K | 74.3 | 95.5 | REF +29% |
| 300K | 73.3 | 43.8 | CARVE +67% |
| 400K | 67.9 | 42.3 | CARVE +61% |
| 500K | 67.0 | 42.2 | CARVE +59% |
Key finding: CARVE model maintains much better performance at >300K context than the NVIDIA reference NVFP4.
67 tok/s single stream at 256K context.
vLLM, MTP=2, nvidia NVFP4, 4x RTX 6000 Pro:
Peak Throughput: 1127.1 tok/s 50 users @ 1K context
Best Efficiency: 120.0 tok/s/user 1 user @ 1K context
Lowest Latency: 12.30s 1 user @ 1K context
At 32 concurrent requests:
Avg generation throughput: 1287.2 tokens/s
SpecDecoding metrics: Mean acceptance length: 2.82
Accepted throughput: 830.21 tokens/s
All on 8x RTX 6000 Pro unless noted.
| System | Engine | KV Cache | DCP | Decode tok/s (0K ctx) | Notes |
|---|---|---|---|---|---|
| luke (switches) | SGLang | BF16 | -- | 101 | INT4, EP=8, custom AR, overclocked GDDR7 |
| CyySky | SGLang | BF16 | -- | 90 | INT4, 232K context |
| Festr Turin | vLLM | BF16 | 1 | 90 | INT4, FA2, best single batch on vLLM |
| Festr Turin | vLLM | FP8 | 1 | 79 | INT4, Triton MLA |
| Festr Turin | vLLM | FP8 | 8 | 65 | INT4, Triton MLA, 3.6M context |
| Grimulkan (switches) | vLLM | FP8 | 8 | 62 | INT4, normal NCCL |
| nvidia checkpoint | SGLang | FP8 | -- | 53-55 | NVFP4, ~450K context |
| Festr Turin | vLLM | FP8 | 8 (no P2P) | 44 | INT4, NCCL_P2P_DISABLE=1 |
| orangezed | vLLM | FP8 | 8 | 32-35 | INT4, Genoa 5-ch DIMM, 2x xGMI |
vLLM, INT4, FP8 KV, DCP=8, 8x RTX 6000 Pro:
| System | 0K Context | 100K Context | 200K Context |
|---|---|---|---|
| Festr Turin (P2P) | 65 tok/s | 36 tok/s | 27 tok/s |
| Festr Turin (no P2P) | 44 tok/s | 29 tok/s | 23 tok/s |
| Festr Genoa | ~32 tok/s | ~32 tok/s | -- |
| Grimulkan (switches) | 62 tok/s | 32 tok/s | 21 tok/s |
| orangezed (Genoa) | 32-35 tok/s | 30-35 tok/s* | 19-20 tok/s |
*orangezed initially reported 8.6-10.2 tok/s at 100K, but this was wall-clock time including prefill. Actual decode throughput from vLLM stats was 30-35 tok/s.
Without DCP at 150K context: 6-7 tok/s (unusable). With DCP=8: 28-35 tok/s.
vLLM, 8x RTX 6000 Pro:
| TP | DCP | KV Cache | KV Cache Space | Triton MLA tok/s | FA2 tok/s | XQA tok/s |
|---|---|---|---|---|---|---|
| 8 | 1 | FP8 | 380K tok | 79 | N/A | N/A |
| 8 | 8 | FP8 | 3M tok | 68 | N/A | N/A |
| 8 | 1 | BF16 | 190K tok | 78 | 90 | WIP |
| 8 | 8 | BF16 | 1.5M tok | 67 | 72 | N/A |
| Config | Total KV Cache Tokens |
|---|---|
| FP8 KV, DCP=1 | ~449,600 |
| FP8 KV, DCP=8 | ~3,621,504 |
| BF16 KV, DCP=1 | ~190,000 |
| BF16 KV, DCP=8 | ~1,500,000 |
Festr, 100 concurrent requests at 40K context each:
- 900 tok/s total with vLLM, FP8 KV, DCP=8, TP=8
P2P vs No-P2P at high concurrency (MiniMax M2.5 test as proxy):
- P2P enabled: 5000 tok/s
- P2P disabled: 10000 tok/s
For low concurrency, P2P generates faster per-token; for high concurrency, DRAM routing wins.
All on 8x RTX 6000 Pro, SGLang, NVFP4.
| Configuration | 0K Context | 15K Context | 100K Context | 200K Context |
|---|---|---|---|---|
| NVFP4 no MTP (early, luke) | ~50 | -- | -- | -- |
| NVFP4 no MTP (Festr/JTazz) | 35-44 | 30 | -- | -- |
| NVFP4 + MTP (EAGLE) | 70-105 | -- | 60-80 | -- |
| NVFP4 + MTP (latest, Festr) | ~100 | -- | 60-80 | ~50 |
| NVFP4 + MTP (orangezed) | 97.2 | -- | -- | -- |
- Accept rate: 0.55-0.94 (varies by context)
- Accept length: 2.19-2.80 tokens
- Speed improvement: roughly 2x over non-MTP baseline
3 running requests with MTP: 133-135 tok/s generation throughput.
Per-GPU breakdown (8x TP8, NVFP4 + MTP):
| Component | Size |
|---|---|
| Weights (NVFP4) | 57.06 GB |
| KV Cache (bf16) | 29.32 GB |
| Total allocated | ~86.38 GB |
| Available | 7.43-7.53 GB |
KV cache capacity with --mem-fraction-static 0.92: 314,304 tokens total, context_len 202,752.
| Phase | Duration |
|---|---|
| Model load (multithread, 8 threads) | ~36 sec |
| CUDA graph capture | ~208 sec |
| Total | ~7-8 min |
| GPUs | Quant | Engine | Decode tok/s | Notes |
|---|---|---|---|---|
| 1× | REAP NVFP4 | SGLang | ~70 | luke, pruned 139B model |
| 2× | NVFP4 | SGLang | 85-89 | Festr, destroyed |
| 2× | NVFP4 | vLLM | ~85 | Festr, TP2 |
| 2× | AWQ | vLLM | ~114 (low ctx) | Marky, faster at low context |
| 2× | AWQ | vLLM | ~50 (130K+ ctx) | Marky, slower at high context |
| 4× | FP8 | SGLang | ~71 | Ixtrix, defaults |
| 4× | FP8 | vLLM | ~81 (20K ctx) | chisleu |
| 8× | FP8 (EP) | SGLang | ~86 | CyySky, tuned MoE kernels |
| Metric | Value |
|---|---|
| Output throughput (64 concurrent) | 930 tok/s |
| Peak output throughput | 1551 tok/s |
| Mean TTFT | 340 ms |
| Mean TPOT | 56 ms |
At high concurrency with 500W power limit, NVFP4 on 2× GPUs nearly matches FP8 on 4× GPUs at 300W — strong value proposition for 2-GPU builds.
| Concurrency | 300W tok/s | 500W tok/s | Improvement |
|---|---|---|---|
| 64 | 1206 | 1558 | +29% |
| 32 | -- | -- | ~25% |
| 16 | -- | -- | ~16% |
| 4 or below | -- | -- | ~0% |
| Model | GPUs | Quant | Engine | MTP | Best tok/s |
|---|---|---|---|---|---|
| Qwen3.5-397B | 4x | AWQ-INT4 | SGLang | MTP=5 | 152 |
| Qwen3.5-397B | 4x | NVFP4 | SGLang | MTP=5 | 132 |
| Qwen3.5-397B | 4x | NVFP4 | vLLM | MTP=2 | 130 |
| Qwen3.5-397B | 8x | NVFP4 | SGLang | Yes | 350 |
| Kimi K2.5 | 8x | INT4 | SGLang | No (no MTP) | 101 |
| Kimi K2.5 | 8x | INT4 | vLLM | No | 90 |
| GLM-5 | 8x | NVFP4 | SGLang | MTP | ~100 |
| MiniMax-M2.5 | 2x | NVFP4 | SGLang | No | 85-89 |
| MiniMax-M2.5 | 2x | AWQ | vLLM | No | 114 (low ctx) |
| MiniMax-M2.5 | 4x | FP8 | SGLang | No | 71 |
| GPUs | NVFP4 Models | FP8 Models |
|---|---|---|
| 1x 96GB | Qwen3.5-27B, MiniMax-M2.5-REAP NVFP4 | -- |
| 2x 96GB | MiniMax-M2.5 NVFP4, Qwen3.5-122B NVFP4 | -- |
| 4x 96GB | Qwen3.5-397B NVFP4, GLM-4.7 NVFP4, MiniMax-M2.5 FP8 | MiniMax-M2.5 FP8 |
| 6x 96GB | GLM-5 NVFP4 (TP2 PP3) | -- |
| 8x 96GB | All current models | GLM-4.7 FP8, Qwen3.5-397B FP8, Kimi K2.5 INT4 |
| 16x 96GB | All models with massive KV cache | GLM-5 FP8 |
Based on wattage-performance benchmarks at https://shihanqu.github.io/Blackwell-Wattage-Performance/
- 500W vs 600W: Nearly identical performance.
- 300W vs 500W: 4% loss at single-user, up to 30% loss at 64 concurrent users.
- 400W to 300W: Significant performance drop at high concurrency.
- 300W: Almost no penalty at 4 concurrent users or below.
MaxQ (300W) vs Workstation (600W): ~20% faster prefill on WS, similar decode speed (VRAM/PCIe limited).
| System | Config | Avg Bus BW (GB/s) |
|---|---|---|
| luke (8x MaxQ, 3 switches) | NCCL_MIN_NCHANNELS=8 | 41.1 |
| Grimulkan (8x, 4 switches) | NCCL_MIN_NCHANNELS=8 | ~39.4 |
| Festr (8x Server, dual Turin) | NCCL_MIN_NCHANNELS=8 | 37.6 |
| Festr (8x Server, dual Turin) | Default | 22.2 |
| Message Size | Without XML | With XML | Speedup |
|---|---|---|---|
| 32 KB | 48.16 us | 26.20 us | 1.84x |
| 64 KB | 48.69 us | 25.59 us | 1.90x |
| 128 KB | 51.56 us | 32.09 us | 1.61x |
| 256 KB | 56.48 us | 37.26 us | 1.52x |
| Metric | Value |
|---|---|
| P2P unidirectional write bandwidth | ~55-56 GB/s |
| P2P bidirectional write bandwidth | ~111 GB/s |
| P2P enabled latency (same switch/NUMA) | 0.36-0.45 us |
| P2P disabled latency | ~14 us |
| System | PCIe Link Score | Dense Interconnect Score | Effective Latency |
|---|---|---|---|
| luke (switches) | 0.86 (54.3 GB/s) | 0.44 (191.8/434.7 GB/s) | 6.79 us |
| Festr Turin (dual CPU) | 0.84 (52.7 GB/s) | 0.41 (173.1/421.3 GB/s) | 6.03 us |
| Grimulkan (switches) | 0.86 (53.9 GB/s) | 0.38 (164.3/431.2 GB/s) | 7.04 us |
| Size | Custom (us) | NCCL (us) | Winner |
|---|---|---|---|
| 256 B | 7.5 | 24.6 | Custom 3.3x |
| 1 KB | 7.5 | 24.1 | Custom 3.2x |
| 8 KB | 9.2 | 24.2 | Custom 2.6x |
| 32 KB | 16.5 | 24.5 | Custom 1.5x |
| 64 KB | 25.9 | 24.1 | NCCL 1.1x |
| 256 KB | 73.6 | 28.0 | NCCL 2.6x |
Custom allreduce is optimized for PCIe switch topologies. On dual-CPU systems without switches, it is slower than default NCCL.
- URL: https://github.com/shihanqu/vllm-benchmark-suite
- Setup:
uv venv vllm-benchmark-suite --python 3.12 source vllm-benchmark-suite/bin/activate git clone https://github.com/notaDestroyer/vllm-benchmark-suite.git cd vllm-benchmark-suite.git uv pip install -r requirements.txt uv pip install transformers torch # Edit vllm_benchmark_suitev2.py and change API_BASE_URL python vllm_benchmark_suitev2.py
- Model name must be full HuggingFace name
HF_HUB_OFFLINE=1helps avoid tokenizer download issues
- Guide: https://github.com/nvjullin/sglang/blob/update-benchmark-doc/docs/developer_guide/bench_serving.md
- Built-in benchmarking via
sglang.bench_one_batch_server
- URL: https://pinchbench.com/
- Repo: https://github.com/pinchbench/skill
- Requires OpenClaw CLI for task execution and LLM-judge grading
- URL: https://github.com/EleutherAI/lm-evaluation-harness
- Standard eval suite (MMLU-Pro, GPQA, IFEval, etc.)
# Located at /usr/src/nccl-tests (in NVIDIA containers)
NCCL_P2P_LEVEL=SYS NCCL_NET_GDR_LEVEL=SYS ./all_reduce_perf -b 32M -g 8 -c 0
NCCL_NET_GDR_LEVEL=SYS NCCL_MIN_NCHANNELS=8 ./all_reduce_perf -b 8M -e 2G -f 2 -g 8 -n 50- URL: https://github.com/lukealonso/p2pmark
- Commands:
./p2pmark # bandwidth and topology ./p2pmark --latency # P2P latency ./p2pmark --allreduce # custom vs NCCL allreduce comparison
- URL: https://github.com/voipmonitor/amd-epyc-gpu-fabric-monitor
- Real-time monitoring of AMD EPYC GPU fabric transfers
- URL: https://shihanqu.github.io/Blackwell-Wattage-Performance/
- Tests MiniMax-M2.5 NVFP4 at various power limits and concurrency levels
| Benchmark | Use Case | Notes |
|---|---|---|
| MMLU-Pro | Knowledge testing | Use temp=0.01 |
| GPQA | Long-context reasoning | Traces can reach 64K tokens |
| AIME 2025 | Math reasoning | Requires nemo-skill install |
| WikiText perplexity | Quant quality assessment | Test across context lengths |