You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does MTP (Multi-Token Prediction) speculative decoding affect model accuracy? We ran GPQA Diamond and GSM8K with and without MTP across two NVFP4 checkpoints on two inference engines (SGLang and vLLM) to find out.
MTP does not degrade quality — no statistically significant difference on either engine (Welch t-test p>0.05 for all pairs)
MTP provides 18-24% faster inference on SGLang — a free speedup
Inference engine can matter — nvidia NVFP4 on vLLM (88.53%) scores comparably to lukealonso on SGLang (88.28%), though the difference is not significant
lukealonso checkpoint outperforms nvidia on SGLang across all benchmarks
Test Environment
SGLang
Parameter
Value
GPU
8x NVIDIA RTX PRO 6000 Blackwell Server Edition (98GB each)
For 8-repeat tests, each repeat was run sequentially (not parallel) to avoid server overload.
Known warning (NVFP4 only)
DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0.
This might cause accuracy degradation on Blackwell.
GPQA Diamond Results
SGLang Results
Per-Run Scores
Run
lukealonso MTP
lukealonso No MTP
nvidia MTP
nvidia No MTP
1
88.9
86.4
85.9
86.4
2
87.9
87.4
90.4
85.9
3
86.9
89.4
85.9
86.9
4
88.4
86.9
88.4
86.4
5
90.4
88.9
87.9
84.8
6
88.4
86.4
85.9
86.4
7
87.9
87.4
87.4
86.9
8
87.4
87.4
87.9
88.9
Aggregate Statistics (SGLang)
Metric
lukealonso MTP
lukealonso No MTP
nvidia MTP
nvidia No MTP
Mean
88.28%
87.53%
87.46%
86.58%
Std
1.06
1.09
1.57
1.15
Min
86.9
86.4
85.9
84.8
Max
90.4
89.4
90.4
88.9
Wall time
~1h 29m
~1h 48m
~1h 43m
~2h 15m
MTP Impact (SGLang)
Checkpoint
MTP ON
MTP OFF
Delta
t-stat
p-value
lukealonso
88.28%
87.53%
+0.75pp
1.41
>0.05 (ns)
nvidia
87.46%
86.58%
+0.88pp
1.28
>0.05 (ns)
vLLM Results
Per-Run Scores
Run
nvidia MTP
nvidia No MTP
1
91.9
87.4
2
87.9
86.4
3
89.4
86.9
4
85.9
86.9
5
86.4
85.9
6
88.4
85.9
7
89.9
89.4
8
88.4
86.4
Aggregate Statistics (vLLM)
Metric
nvidia MTP
nvidia No MTP
Mean
88.53%
86.90%
Std
1.92
1.13
Min
85.9
85.9
Max
91.9
89.4
MTP Impact (vLLM)
Checkpoint
MTP ON
MTP OFF
Delta
t-stat
p-value
nvidia
88.53%
86.90%
+1.62pp
2.06
>0.05 (ns)
The +1.62pp delta on vLLM is larger than SGLang's +0.88pp for the same checkpoint, but still not statistically significant (p=0.06, just below the α=0.05 threshold). The wider spread is partly driven by one high outlier (91.9%) in the MTP ON run.
Cross-Engine Comparison
Configuration
Engine
GPQA Mean
Std
nvidia NVFP4 + MTP
vLLM
88.53%
1.92
lukealonso NVFP4 + MTP
SGLang
88.28%
1.06
nvidia NVFP4 + MTP
SGLang
87.46%
1.57
nvidia NVFP4, no MTP
vLLM
86.90%
1.13
nvidia NVFP4, no MTP
SGLang
86.58%
1.15
lukealonso NVFP4, no MTP
SGLang
87.53%
1.09
Statistical significance (Welch t-test, all pairs):
Comparison
Delta
t-stat
Significant?
nvidia vLLM+MTP vs nvidia SGLang+MTP
+1.06pp
1.21
No (p>0.05)
nvidia vLLM+MTP vs lukealonso SGLang+MTP
+0.25pp
0.29
No (p>0.05)
nvidia vLLM-MTP vs nvidia SGLang-MTP
+0.31pp
0.61
No (p>0.05)
nvidia vLLM+MTP vs nvidia vLLM-MTP
+1.62pp
2.06
No (p>0.05)
lukealonso SGLang+MTP vs nvidia SGLang+MTP
+0.81pp
1.21
No (p>0.05)
No pair reaches statistical significance. All configurations produce GPQA scores in the 86-89% range, and with 8 repeats the test lacks power to distinguish differences below ~2pp.
GSM8K Results
With thinking mode
Model
Engine
MTP
Score
Config
lukealonso NVFP4
SGLang
ON
99.0%
200 examples, max-tokens 16000
nvidia NVFP4
vLLM
OFF
98.5%
200 examples, max-tokens 16000
nvidia NVFP4
SGLang
ON
97.5%
200 examples, max-tokens 16000
nvidia on vLLM without MTP (98.5%) outperforms nvidia on SGLang with MTP (97.5%), again suggesting the inference engine matters.
Without thinking (5-shot, SGLang only)
Model
Score
Config
lukealonso
44%
200 examples, max-tokens 2048
nvidia
39%
200 examples, max-tokens 2048
Without chain-of-thought reasoning, the quantization quality gap is much more pronounced (+5pp).
Hard Math Test
19 custom math questions, no thinking mode. SGLang only (vLLM server crashed before Hard Math could run).
Neither SGLang nor vLLM show statistically significant quality loss with MTP enabled. The verification mechanism in speculative decoding guarantees output fidelity. MTP provides 18-24% inference speedup on SGLang — a free speedup.
2. Inference engine choice can matter as much as quantization
nvidia NVFP4 on vLLM (88.53%) scores comparably to lukealonso NVFP4 on SGLang (88.28%) on GPQA. nvidia on vLLM GSM8K (98.5%) also outperforms nvidia on SGLang (97.5%). The differences are not statistically significant, but the trend suggests engine-level differences in numerics, scheduling, or attention implementation may influence results.
3. lukealonso outperforms nvidia on SGLang
On SGLang, lukealonso/Qwen3.5-397B-A17B-NVFP4 consistently outperforms nvidia/Qwen3.5-397B-A17B-NVFP4 across all benchmarks (+0.8pp to +5.3pp). The advantage is especially pronounced without thinking mode. This aligns with KLD measurements (0.035 vs 0.109, see KLD evaluation) and community reports (vLLM Issue #36094).