Skip to content

Conversation

@petercad
Copy link

Currently make_block_2d_copy_C always selects a block 2D store, but we really need separate APIs for C (loads) and D (stores). The default value type can also be different, in case of MMA atoms with different C/D types.

This PR introduces C/D APIs. Some APIs are common between C/D, and are named make_block_2d_copy_CD to avoid duplication.

@tdeng5 tdeng5 added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 22, 2025
rolandschulz pushed a commit that referenced this pull request Oct 22, 2025
Modify 00_bmg_gemm to include new mma and copy atoms
(#477).
00_bmg_gemm combines two parts: mma and epilogue. To add new atom
changes, we need to update both parts since they currently use old
atoms. As starting we will:

> Keep CollectiveEpilogue unchanged for now
> Only modify CollectiveMma first

Old Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance:     [96.448]TFlop/s  (1.7813)ms

New Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms

Also depend on new copy_c/copy_d apis for load/store
#572

---------

Co-authored-by: Anamika Chatterjee <[email protected]>
@rolandschulz rolandschulz enabled auto-merge (squash) October 22, 2025 23:08
Copy link

@sanchitintel sanchitintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#573 reported some performance regressions when this API is used instead of manually selecting copy atoms (or compared to the legacy API) .

But performance in examples/cute/tutorial/xe_gemm.cpp example wasn't adversely affected by these changes, so the performance degradation observed in #573 is likely to be unrelated to this PR (except if the perf degradation is due to loading the canonical C matrix in #573, but if beta is 0, would the compiler still load C? Not sure about that part).

@rolandschulz rolandschulz merged commit 33cae04 into main Oct 23, 2025
2 of 11 checks passed
@sanchitintel
Copy link

#540's commit is newer than the base commit of this PR's branch. #540 had some changes that conflict with this one, so CI is failing.

rolandschulz pushed a commit that referenced this pull request Oct 23, 2025
#540 landed first.
#572 had an older base commit than #540's commit when it was merged.
Resolving conflict
@sanchitintel
Copy link

sanchitintel commented Oct 23, 2025

In retrospect, my reasoning about make_block_2d_copy_C or make_block_2d_copy_D not affecting epilogue performance is incorrect - examples/cute/tutorial/xe_gemm.hpp performs worse than examples/00_bmg_gemm/00_bmg_gemm.hpp, which, in turn, uses legacy copy atoms to load C (not sure it's loaded when beta is 0, though) & D. So, I shouldn't have drawn that conclusion.

Although make_block_2d_copy_C or make_block_2d_copy_D may need some tuning, that seems orthogonal to this PR.

EDIT: Some other dtypes perform well with float output, so maybe these APIs don't need further tuning? Not sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

urgent PR requires a urgent attention (for release or blocking another PR)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants