You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-07-17-mtp.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,6 @@ MTP works by dividing the generation into two stages:
28
28
-**Drafting:** The lightweight draft model predicts one or more short sequence candidate(s) of n tokens in a single fast pass. Here we use one sequence candidate as an example.
29
29
(1) *“Today is a sunny”* is the current prefix produced by the target model.
30
30
(2) *“day” is first generated by the target model's extend/prefill stage.*
31
-
32
31
(3) *“and” is the first draft token generated by the draft model's extend/prefill stage.*
33
32
(4) *“it’s so hot” are the three extra draft tokens generated by the draft model decoding iterations; In the example case, n=4 for “and it’s so hot”.*
34
33
@@ -76,7 +75,7 @@ The small-scale deployment configuration was selected based on production requir
76
75
In this scenario, we deploy two decoding nodes across a total of 16 H200 GPUs, running 2 concurrent requests per rank with input sequence length of 65,536 tokens and output sequence length of 4,096 tokens. As baseline, we tested the case with no MTP and no overlap scheduling, and the system achieves an output throughput of 51 tokens/sec per rank. Using overlap scheduling alone, a feature introduced in SGLang v0.4, we achieved 60.4 tokens/sec per rank, meeting the production threshold without the need for MTP.
77
76
When MTP is enabled, the system significantly surpasses this benchmark:
78
77
* With a 3-token MTP window and topk=1, the system achieves a throughput of 81.5 tokens/sec per rank, with an average acceptance length of 2.18 tokens.
79
-
*With a 4-token MTP window and topk=1, throughput increases to 82.0 tokens/sec per rank, with an average acceptance length of 2.44 tokens.
78
+
*With a 4-token MTP window and topk=1, throughput increases to 82.0 tokens/sec per rank, with an average acceptance length of 2.44 tokens.
@@ -117,9 +116,9 @@ You can monitor acceptance rates in logs to fine-tune this parameter over time.
117
116
118
117
We would like to express our heartfelt gratitude to the following teams and collaborators. In particular, we extend our sincere thanks to the NVIDIA DGX Cloud team for providing powerful GPUs and for their exceptional support in ensuring operational excellence:
119
118
120
-
**Eigen AI Team** - Jinglei Cheng, Jiaqi Gu, Yipin Guo, Di Jin, Uill Liu, Zhijian Liu, Zilin Shen, Ryan Hanrui Wang, Wei-Chen Wang, Junyao Zhang and many others.
119
+
**Eigen AI Team** - Jinglei Cheng, Yipin Guo, Zilin Shen, Ryan Hanrui Wang, Wei-Chen Wang, Junyao Zhang and many others.
121
120
122
-
**SGLang Team and Community** - Kavio Yu, Qiaolin Yu, Boxin Zhang, Shangming Cai, Jinfu Deng, Yineng Zhang and many others.
121
+
**SGLang Team and Community** - Kavio Yu, Qiaolin Yu, Boxin Zhang, Shangming Cai, Jinfu Deng, Jiaqi Gu, Di Jin, Uill Liu, Yineng Zhang and many others.
123
122
124
123
**xAI Team** - Sehoon Kim, Ying Sheng, Lianmin Zheng, Sangbin Cho, Hanming Lu, Byron Hsu, Pranjal Shankhdhar, Cheng Wan and many others.
0 commit comments