-
Notifications
You must be signed in to change notification settings - Fork 76
Improve GEMM perf when one matrix is transposed #2347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
third_party/intel/lib/TritonIntelGPUTransforms/MaterializeBlockPointer.cpp
Outdated
Show resolved
Hide resolved
|
@alexbaden |
Follows the Triton llvm debug syntax. Allows you to dump various parameters when running `triton-opt` with `-debug`. cc #2347
Per the Triton slack, `order` is unused on architecture below Hopper. But more importantly, order provides information that stride already has. In fact, order can be completely different from stride (i.e. wrong) and we still generate correct code. I think it is better to use the stride assuming the logic I added here makes sense. Note this depends on #2348, I'd like to land the debug logging separately, so we have it even if we decide to modify this approach. It was very useful in debugging this problem. cc #2347
63cb029 to
835e8a0
Compare
whitneywhtsang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is to remove RewriteTensorPointer pass after #2181 is landed.
Would that work for your test case?
|
Yes, this change is only to keep rewrite tensor pointer from removing the blocked load. |
835e8a0 to
4705bde
Compare
|
Merging #2181 resolved the fused attention issue because the transposed load is lowered to a gather load instead of failing to lower to 2D blocked load. |
|
I think that removing the If we land PR #2359 this PR can be rebased presumably and the LIT test kept. |
|
@chengjunlu @whitneywhtsang I modified the RewriteTensorPointer pass to remove
main: matrix performance is unchanged |
third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/RewriteTensorPointer.cpp
Outdated
Show resolved
Hide resolved
whitneywhtsang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking that we could check the blockio attribute of the users of the MakeTensorPtrOp, that way we don't need to maintain a copy of the same condition checks in RewriteTensorPointer. WDYT?
Interesting - I can try it. But let's make it a separate PR. I want to show some progress on the original transpose issue, and I am not 100% confident that I understand the logic well enough to handle all cases, so it will need some testing/review. |
Sure, let's do that in a separate PR. |
The 2D block load/store does not work when one of the input matrices to a
tt.dotis transposed inside the Triton kernel using thestrideparameter. In the user example, the block pointer is transposed using stride but theorderparameter is left unchanged. This results inmaterialize-block-pointerbeing unable to detect that ablock_ioattributecolumn-majorshould be added to the matrix. Even if this attribute were added,rewrite-tensor-pointerwould remove the block pointer because column major was not supported.This PR adds support for detecting
column-majorbased onstrideinstead oforderand also brings the same logic torewrite-tensor-pointerto allow for the column major load to be preserved and eventually lowered to a 2D block load. With this, transpose matrix performance is more inline with the non-transposed version:I know we have plans to remove
rewrite-tensor-pointereventually, but column major support did not appear difficult, and the performance is nearly 3x better.I am planning to PR each commit individually starting with the debug logging, but wanted to open this umbrella PR to show how the entire pipeline fits together.
Close #1795