-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi, thanks for your amazing work.
I am experiencing significant performance degradation when using speculative decoding on H20. I observed a drop in throughput in the sglang metric log and a noticeable increase in forward time (I have not yet rerun experiments with the Base model, but it only affect response length and accept rate). Below is a description of the situation I encountered:
-
The machine is H20 GPUs, running the Qwen2.5-7B-Instruct model.
-
Experiment 1:
- TP=2, Strategy fixed at 2_2_2.
-
Batchsize forward_time_w_sd forward_time_wo_sd 1 0.00845s 0.0039 s 6 0.00885 s 0.0045 s 12 0.00899 s 0.0049 s - It was observed that after enabling SD, the forward time became abnormally long, and the increased forward time overshadowed the benefits from the accept rate.
-
Experiment 2:
- TP=2, Strategy fixed at 1_1_1, meaning no additional verify tokens are added.
-
Batchsize draft_time verify_time forward_draft_extend_after_decode total_forward_time 4 0.00034s 0.00526s 0.00089 s 0.00649 s - It can be observed that even without adding extra verify tokens (i.e., the target model only processes 4 tokens), the verify time is still greater than the wo_sd time for batchsize=6 (0.0045 s).
-
I suspect the cause of this issue may be that when SD is enabled, the overlap schedule cannot be activated, leading to CPU overhead affecting performance. Therefore, I conducted experiments with the overlap schedule disabled.
-
Experiment 3:
-
TP=1, Strategy fixed at 1_1_1, disable overlap schedule.
-
The table below shows the target model forward time.
-
Batchsize w_sd_wo_overlap wo_sd_wo_overlap 4 0.00613 s 0.0063 s
-
-
The times are similar, leading to the conclusion that enabling SD prevents the activation of the schedule overlap, resulting in performance degradation.
-
Additionally, I found that enabling SD introduces other overheads, such as draft model forward, verify, and draft model extend after decode, which contribute to the overall extended time. For example, with batchsize=4 under the above settings, the total forward time is 0.00869s, meaning that other overheads besides the target model forward also occupy a significant portion of the forward time.
-
-
Conclusion: On H20, enabling SD introduces excessive CPU overhead, leading to substantial performance degradation.
-
Do you have any plans to support SD with overlap schedule in the future?