[Question] Contiguous Grouped Gemm for CUDA Graph

Hi DeepSeek team,

I noticed that DeepEP provides an optional `num_worst_tokens` argument for CUDA Graph usage. Based on that, I assume contiguous Grouped GEMM is also intended to support CUDA Graph. However, it seems there is no input parameter to supply a valid m for the actual workload. While we can set `m_indices` to -1 for padded tokens, performance degrades significantly because `is_computation_valid` only skips the compute blocks, whereas the TMA loads are still issued for those padded regions.

Did I misunderstand something here, or is there a plan to address this limitation?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Contiguous Grouped Gemm for CUDA Graph #235

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Contiguous Grouped Gemm for CUDA Graph #235

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions