Skip to content

[Question] Contiguous Grouped Gemm for CUDA Graph #235

@yuhyao

Description

@yuhyao

Hi DeepSeek team,

I noticed that DeepEP provides an optional num_worst_tokens argument for CUDA Graph usage. Based on that, I assume contiguous Grouped GEMM is also intended to support CUDA Graph. However, it seems there is no input parameter to supply a valid m for the actual workload. While we can set m_indices to -1 for padded tokens, performance degrades significantly because is_computation_valid only skips the compute blocks, whereas the TMA loads are still issued for those padded regions.

Did I misunderstand something here, or is there a plan to address this limitation?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions