- 
                Notifications
    You must be signed in to change notification settings 
- Fork 64
[CuTe] [Xe] Separate make_block_2d_copy_{C,D} APIs for loads/stores #572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Modify 00_bmg_gemm to include new mma and copy atoms (#477). 00_bmg_gemm combines two parts: mma and epilogue. To add new atom changes, we need to update both parts since they currently use old atoms. As starting we will: > Keep CollectiveEpilogue unchanged for now > Only modify CollectiveMma first Old Atom: Problem Size: 5120x4096x4096x1 Cutlass GEMM Performance: [96.448]TFlop/s (1.7813)ms New Atom: Problem Size: 5120x4096x4096x1 Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms Also depend on new copy_c/copy_d apis for load/store #572 --------- Co-authored-by: Anamika Chatterjee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#573 reported some performance regressions when this API is used instead of manually selecting copy atoms (or compared to the legacy API) .
But performance in examples/cute/tutorial/xe_gemm.cpp example wasn't adversely affected by these changes, so the performance degradation observed in #573 is likely to be unrelated to this PR (except if the perf degradation is due to loading the canonical ).C matrix in #573, but if beta is 0, would the compiler still load C? Not sure about that part
| In retrospect, my reasoning about  Although  EDIT: Some other dtypes perform well with float output, so maybe these APIs don't need further tuning? Not sure | 
Currently
make_block_2d_copy_Calways selects a block 2D store, but we really need separate APIs for C (loads) and D (stores). The default value type can also be different, in case of MMA atoms with different C/D types.This PR introduces C/D APIs. Some APIs are common between C/D, and are named
make_block_2d_copy_CDto avoid duplication.