Speculative decoding is the single largest throughput optimization for single-user inference on RTX PRO 6000 Blackwell systems, providing 50-100% speedup at low concurrency.
- Multi-Token Prediction (MTP) Explained
- MTP Configuration
- MTP=2 as the Sweet Spot
- EAGLE Speculative Decoding
- Acceptance Rate Data
- Throughput Improvements
- Known Issues and Limitations
- Model-Specific Notes
MTP is a speculative decoding technique where the model predicts multiple tokens ahead in parallel, then verifies them. Instead of generating one token per forward pass, the model drafts N candidate tokens and verifies them in a single pass, accepting those that match what the model would have generated autoregressively.
How it works:
- The main model generates token T
- MTP heads (lightweight layers trained alongside the model) draft tokens T+1, T+2, ..., T+N
- The main model verifies all drafted tokens in a single forward pass
- Accepted tokens are emitted; rejected tokens are discarded and generation continues from the last accepted token
Key advantage: MTP heads are part of the model itself (not a separate draft model), so there is no additional VRAM overhead for a draft model -- only the MTP layer weights (~19 GB for GLM-5's layer 78 in BF16).
Models with native MTP support:
- Qwen3.5 (all sizes)
- GLM-5
- GLM-4.7
Models WITHOUT MTP:
- Kimi K2.5 (uses EAGLE3 instead)
- MiniMax-M2.5
# Standard MTP (recommended for most models):
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
# Qwen3.5-specific (newer vLLM versions):
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
# Higher token count (requires PR #35615 for >1, unstable at >3):
--speculative-config '{"method":"mtp","num_speculative_tokens":5}'# Environment variable (MANDATORY):
export SGLANG_ENABLE_SPEC_V2=True
# Launch flags:
--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4CRITICAL: SGLANG_ENABLE_SPEC_V2=True is mandatory. Without it, SGLang silently converts NEXTN to EAGLE and tries to load the full model a second time as a draft model, causing instant OOM (e.g., 57 GB x 2 = 114 GB per GPU on a 96 GB card for GLM-5).
export SGLANG_ENABLE_SPEC_V2=1
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4Community testing consistently shows MTP=2 as the optimal setting for the NVIDIA NVFP4 checkpoint on vLLM:
| MTP Tokens | Stability | Performance | Recommendation |
|---|---|---|---|
| 0 (disabled) | Stable | Baseline | Safe fallback |
| 1 | Stable | ~30% improvement | Conservative |
| 2 | Stable | ~50-55% improvement | Recommended |
| 3 | Mostly stable | ~55-60% improvement | Risky |
| 5 | Unstable at long context | ~70%+ improvement | Not recommended for production |
| Position | Acceptance Rate |
|---|---|
| T+1 | ~85% |
| T+2 | ~67% |
| T+3 | ~51% |
| T+4 | ~36% |
| T+5 | ~31% |
The diminishing returns past position 2-3, combined with increased instability, make MTP=2 the practical sweet spot.
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) uses a separate lightweight draft model rather than built-in MTP heads.
Kimi K2.5 does not have native MTP, so EAGLE3 is the speculative decoding path:
- Draft model:
AQ-MedAI/Kimi-K2-Instruct-eagle3 - SGLang PR: sgl-project/sglang#19689
- vLLM PR: vllm-project/vllm#35966
- Current status: Grimulkan reported speculative decoding was "strictly worse than no speculation" for Kimi K2.5 in all vLLM configurations tried. Potential for 30% improvement once EAGLE3 + FA2 is working.
boo's optimized EAGLE config for Qwen3.5-397B NVFP4:
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path nvidia/Qwen3.5-397B-A17B-NVFP4 \
--tp 4 \
--trust-remote-code \
--attention-backend triton \
--moe-runner-backend cutlass \
--kv-cache-dtype fp8_e4m3 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--quantization modelopt_fp4 \
--mamba-scheduler-strategy extra_buffer \
--fp8-gemm-backend triton \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}'Result: ~130 TPS with MTP=3 on Qwen3.5 NVFP4.
| Metric | Value |
|---|---|
| Draft acceptance rate | 89.2% |
| Tokens drafted | 165,550 |
| Tokens accepted | 147,689 |
| Mean acceptance length | 2.73-3.69 tokens |
| Metric | Value |
|---|---|
| Accept rate | 0.55-0.94 (varies by context) |
| Accept length | 2.19-2.80 tokens |
| Speed improvement | ~2x over non-MTP baseline |
| Position | Acceptance Rate |
|---|---|
| 1 | ~0.85 |
| 2 | ~0.67 |
| 3 | ~0.51 |
| 4 | ~0.36 |
| 5 | ~0.31 |
| Concurrent Users | No MTP (tok/s) | MTP=2 (tok/s) | Improvement |
|---|---|---|---|
| 1 | 85.8 | 130.0 | +51.5% |
| 2 | 137.1 | 212.7 | +55.1% |
| 5 | 234.2 | 358.6 | +53.1% |
| 10 | 334.3 | 573.5 | +71.6% |
| 20 | 491.5 | 744.1 | +51.4% |
| 32 | 605.9 | 922.6 | +52.3% |
Peak throughput with MTP=2: 1,127 tok/s at 50 concurrent users (1K context each).
| Engine | MTP Config | tok/s | Notes |
|---|---|---|---|
| vLLM | None | 70-86 | Multiple users, stable |
| vLLM | MTP=2 | 130 | malaiwah |
| vLLM | MTP=5 | 150-250 | Unstable at long context |
| SGLang | None | 42-51 | kcramp, Ixtrix |
| SGLang | MTP=3 (EAGLE) | 85 | Unstable |
| SGLang | MTP=3 (NEXTN) | ~130 | boo's config |
| Configuration | 0 Context | 100K Context | 200K Context |
|---|---|---|---|
| NVFP4, no MTP | 35-44 tok/s | -- | -- |
| NVFP4 + MTP (NEXTN) | 97-105 tok/s | 60-80 tok/s | ~50 tok/s |
- 3 running requests with MTP: 133-135 tok/s total generation throughput
MTP with num_speculative_tokens > 3 causes illegal memory access errors, especially at long context.
Error:
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Workaround: Keep num_speculative_tokens at 2 (stable) or 3 (mostly stable). PR #35615 partially fixes this.
When MTP is enabled with tool_choice='required', the model often outputs XML tool calls instead of JSON:
<tool_call>
<function>ask_wiki</function>
<parameters>{"question": "..."}</parameters>
</tool_call>Impact: 50-70% of tool calls fail.
Fix: PR #35936. Using tool_choice='auto' can handle both XML and JSON formats.
Key insight: "if thinking is false, even with mtp there is no problem" -- the combination of thinking mode + MTP causes the most tool call issues.
Without SGLANG_ENABLE_SPEC_V2=True, SGLang converts NEXTN to EAGLE and attempts to load the model twice, causing OOM.
MTP heads were trained on censored content. Using MTP with abliterated (CARVE) models produces inconsistent behavior.
EAGLE V2 speculative decoding crashes when a request hits the radix cache prefix. NaN in logits from flashinfer CUTLASS race condition propagates to the verify step.
Error:
eagle_worker_v2.py:510 _zero_fill_draft_kv_for_cached_prefix
torch.AcceleratorError: CUDA error: device-side assert triggered
Fix: SGLang PR #19897. Root cause is the flashinfer CUTLASS race condition (flashinfer#2708).
vLLM PR #35581 fixes the MTP fused kernel, providing ~6% throughput boost. This is a sed one-liner fix.
- MTP=2 is the production-safe setting on vLLM
- MTP=5 works with PR #35615 but crashes at long context
- SGLang uses NEXTN algorithm with
speculative-num-steps 3 --speculative-draft-model-quantization unquantmay be needed for some checkpoints
- MTP layer is layer 78, kept in BF16 precision (~19 GB)
- Use
festr2/GLM-5-NVFP4-MTP(includes MTP weights) lukealonso/GLM-5-NVFP4does NOT include MTP weights--moe-runner-backend cutlassis fastest for MTP on SM120- FP8 MTP is possible but decreases accept rate; only useful for high-throughput batch scenarios
- No native MTP support
- EAGLE3 is the speculative decoding path (in development)
- Draft model:
AQ-MedAI/Kimi-K2-Instruct-eagle3 - Currently, speculative decoding is "strictly worse" in vLLM for this model
- No MTP or speculative decoding support
- Single-stream speed is 74-100 tok/s without speculation
| Model | Engine | Spec Method | Flags | Expected Speedup |
|---|---|---|---|---|
| Qwen3.5 | vLLM | MTP | '{"method":"mtp","num_speculative_tokens":2}' |
+50-55% |
| Qwen3.5 | SGLang | NEXTN | --speculative-algorithm NEXTN --speculative-num-steps 3 |
+50-100% |
| GLM-5 | SGLang | NEXTN | --speculative-algorithm NEXTN --speculative-num-steps 3 |
~2x |
| Kimi K2.5 | SGLang | EAGLE3 | --speculative-algorithm EAGLE (+ draft model) |
WIP |
| Kimi K2.5 | vLLM | EAGLE3 | PR #35966 | WIP |