Significant Performance Degradation with Speculative Decoding on H20 GPUs

Hi, thanks for your amazing work.

I am experiencing significant performance degradation when using speculative decoding on H20. I observed a drop in throughput in the sglang metric log and a noticeable increase in forward time (I have not yet rerun experiments with the Base model, but it only affect response length and accept rate). Below is a description of the situation I encountered:
* The machine is H20 GPUs, running the Qwen2.5-7B-Instruct model.
 * Experiment 1:

    - TP=2, Strategy fixed at 2_2_2.
    - | Batchsize | forward_time_w_sd | forward_time_wo_sd |
      | ----------- | ------------------- | -------------------- |
      | 1         | 0.00845s          | 0.0039 s           |
      | 6<br />       | 0.00885 s         | 0.0045 s           |
      | 12        | 0.00899 s         | 0.0049 s           |
    - It was observed that after enabling SD, the forward time became abnormally long, and the increased forward time overshadowed the benefits from the accept rate.
  - Experiment 2:

    - TP=2, Strategy fixed at 1_1_1, meaning no additional verify tokens are added.
    - | Batchsize | draft_time | verify_time | forward_draft_extend_after_decode | total_forward_time |
      | ----------- | ------------ | ------------- | ----------------------------------- | -------------------- |
      | 4         | 0.00034s   | 0.00526s    | 0.00089 s                         | 0.00649 s          |
    - It can be observed that even without adding extra verify tokens (i.e., the target model only processes 4 tokens), the verify time is still greater than the wo_sd time for batchsize=6 (0.0045 s).
- I suspect the cause of this issue may be that when SD is enabled, the overlap schedule cannot be activated, leading to CPU overhead affecting performance. Therefore, I conducted experiments with the overlap schedule disabled.

 - Experiment 3:

    - TP=1, Strategy fixed at 1_1_1, disable overlap schedule.
    - The table below shows the target model forward time.

      - | Batchsize | w_sd_wo_overlap | wo_sd_wo_overlap |
        | ----------- | --------------------- | ---------------------- |
        | 4         | 0.00613 s           | 0.0063 s             |
    - The times are similar, leading to the conclusion that **enabling SD prevents the activation of the schedule overlap, resulting in performance degradation**.
    - Additionally, I found that enabling SD introduces other overheads, such as draft model forward, verify, and draft model extend after decode, which contribute to the overall extended time. For example, with batchsize=4 under the above settings, the total forward time is 0.00869s, meaning that **other overheads besides the target model forward also occupy a significant portion of the forward time**.
- **Conclusion: On H20, enabling SD introduces excessive CPU overhead, leading to substantial performance degradation.**
- Do you have any plans to support SD with overlap schedule in the future?








Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Performance Degradation with Speculative Decoding on H20 GPUs #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significant Performance Degradation with Speculative Decoding on H20 GPUs #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions