Skip to content

Significant Performance Degradation with Speculative Decoding on H20 GPUs #7

@zzhbrr

Description

@zzhbrr

Hi, thanks for your amazing work.

I am experiencing significant performance degradation when using speculative decoding on H20. I observed a drop in throughput in the sglang metric log and a noticeable increase in forward time (I have not yet rerun experiments with the Base model, but it only affect response length and accept rate). Below is a description of the situation I encountered:

  • The machine is H20 GPUs, running the Qwen2.5-7B-Instruct model.

  • Experiment 1:

    • TP=2, Strategy fixed at 2_2_2.
    • Batchsize forward_time_w_sd forward_time_wo_sd
      1 0.00845s 0.0039 s
      6
      0.00885 s 0.0045 s
      12 0.00899 s 0.0049 s
    • It was observed that after enabling SD, the forward time became abnormally long, and the increased forward time overshadowed the benefits from the accept rate.
  • Experiment 2:

    • TP=2, Strategy fixed at 1_1_1, meaning no additional verify tokens are added.
    • Batchsize draft_time verify_time forward_draft_extend_after_decode total_forward_time
      4 0.00034s 0.00526s 0.00089 s 0.00649 s
    • It can be observed that even without adding extra verify tokens (i.e., the target model only processes 4 tokens), the verify time is still greater than the wo_sd time for batchsize=6 (0.0045 s).
  • I suspect the cause of this issue may be that when SD is enabled, the overlap schedule cannot be activated, leading to CPU overhead affecting performance. Therefore, I conducted experiments with the overlap schedule disabled.

  • Experiment 3:

    • TP=1, Strategy fixed at 1_1_1, disable overlap schedule.

    • The table below shows the target model forward time.

      • Batchsize w_sd_wo_overlap wo_sd_wo_overlap
        4 0.00613 s 0.0063 s
    • The times are similar, leading to the conclusion that enabling SD prevents the activation of the schedule overlap, resulting in performance degradation.

    • Additionally, I found that enabling SD introduces other overheads, such as draft model forward, verify, and draft model extend after decode, which contribute to the overall extended time. For example, with batchsize=4 under the above settings, the total forward time is 0.00869s, meaning that other overheads besides the target model forward also occupy a significant portion of the forward time.

  • Conclusion: On H20, enabling SD introduces excessive CPU overhead, leading to substantial performance degradation.

  • Do you have any plans to support SD with overlap schedule in the future?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions