Skip to content

Commit 8a75c04

Browse files
authored
new mtp update (#165)
1 parent c4b20ed commit 8a75c04

File tree

1 file changed

+3
-4
lines changed

1 file changed

+3
-4
lines changed

blog/2025-07-17-mtp.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ MTP works by dividing the generation into two stages:
2828
-**Drafting:** The lightweight draft model predicts one or more short sequence candidate(s) of n tokens in a single fast pass. Here we use one sequence candidate as an example.
2929
(1) *“Today is a sunny”* is the current prefix produced by the target model.
3030
(2) *“day” is first generated by the target model's extend/prefill stage.*
31-
3231
(3) *“and” is the first draft token generated by the draft model's extend/prefill stage.*
3332
(4) *“it’s so hot” are the three extra draft tokens generated by the draft model decoding iterations; In the example case, n=4 for “and it’s so hot”.*
3433

@@ -76,7 +75,7 @@ The small-scale deployment configuration was selected based on production requir
7675
In this scenario, we deploy two decoding nodes across a total of 16 H200 GPUs, running 2 concurrent requests per rank with input sequence length of 65,536 tokens and output sequence length of 4,096 tokens. As baseline, we tested the case with no MTP and no overlap scheduling, and the system achieves an output throughput of 51 tokens/sec per rank. Using overlap scheduling alone, a feature introduced in SGLang v0.4, we achieved 60.4 tokens/sec per rank, meeting the production threshold without the need for MTP.
7776
When MTP is enabled, the system significantly surpasses this benchmark:
7877
* With a 3-token MTP window and topk=1, the system achieves a throughput of 81.5 tokens/sec per rank, with an average acceptance length of 2.18 tokens.
79-
*With a 4-token MTP window and topk=1, throughput increases to 82.0 tokens/sec per rank, with an average acceptance length of 2.44 tokens.
78+
* With a 4-token MTP window and topk=1, throughput increases to 82.0 tokens/sec per rank, with an average acceptance length of 2.44 tokens.
8079

8180
![Small-scale throughput graph](/images/blog/mtp/small_scale_throughput_hr.png)
8281

@@ -117,9 +116,9 @@ You can monitor acceptance rates in logs to fine-tune this parameter over time.
117116

118117
We would like to express our heartfelt gratitude to the following teams and collaborators. In particular, we extend our sincere thanks to the NVIDIA DGX Cloud team for providing powerful GPUs and for their exceptional support in ensuring operational excellence:
119118

120-
**Eigen AI Team** - Jinglei Cheng, Jiaqi Gu, Yipin Guo, Di Jin, Uill Liu, Zhijian Liu, Zilin Shen, Ryan Hanrui Wang, Wei-Chen Wang, Junyao Zhang and many others.
119+
**Eigen AI Team** - Jinglei Cheng, Yipin Guo, Zilin Shen, Ryan Hanrui Wang, Wei-Chen Wang, Junyao Zhang and many others.
121120

122-
**SGLang Team and Community** - Kavio Yu, Qiaolin Yu, Boxin Zhang, Shangming Cai, Jinfu Deng, Yineng Zhang and many others.
121+
**SGLang Team and Community** - Kavio Yu, Qiaolin Yu, Boxin Zhang, Shangming Cai, Jinfu Deng, Jiaqi Gu, Di Jin, Uill Liu, Yineng Zhang and many others.
123122

124123
**xAI Team** - Sehoon Kim, Ying Sheng, Lianmin Zheng, Sangbin Cho, Hanming Lu, Byron Hsu, Pranjal Shankhdhar, Cheng Wan and many others.
125124

0 commit comments

Comments
 (0)