Hi DeepSeek team,
I noticed that DeepEP provides an optional num_worst_tokens argument for CUDA Graph usage. Based on that, I assume contiguous Grouped GEMM is also intended to support CUDA Graph. However, it seems there is no input parameter to supply a valid m for the actual workload. While we can set m_indices to -1 for padded tokens, performance degrades significantly because is_computation_valid only skips the compute blocks, whereas the TMA loads are still issued for those padded regions.
Did I misunderstand something here, or is there a plan to address this limitation?
Thanks!