-
Notifications
You must be signed in to change notification settings - Fork 76
[TritonIntelGPUToLLVM] Detect sub-group transpose convert_layout cases
#2511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…near layout Detect sub-group transpose cases as those in which warp and lane dimensions get swapped and no transfer within block-groups is needed. Use sub-group write operations to store the contents in local memory and vector operations to write back. These will be translated to non-transposed and transposed store and loads respectively. As data is moved within sub-groups, no barriers are needed. For now, handle only the case of a `single sub_group_size^2` block being transposed. This may be split in the future by performing `N*M` iterations for matrices of size `N*sub_group_sizexM*sub_group_size`. Signed-off-by: victor-eds <[email protected]>
|
@hwnam831 can you please check my linear layout knowledge is on point? |
Nam has finished his internship, and he is back to school, so he may not be able to check this PR. |
|
Part of #2266. |
third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
|
LGTM. |
whitneywhtsang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change upstreamable?
third_party/intel/lib/TritonIntelGPUToLLVM/ConvertLayoutOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
I don't think so. Not sure if performance translates to NVIDIA too. |
|
Performance evaluation undergoing. |
|
Let's use the following code to compare the generated ASM: #blocked = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 1], order = [0, 1]}>
#blocked1 = #triton_gpu.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>
module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 1 : i32, triton_gpu.target = "xpu", "triton_gpu.threads-per-warp" = 16 : i32} {
tt.func @test_f16(%arg0: tensor<16x16xf16, #blocked>, %arg1: tensor<16x16x!tt.ptr<f16>, #blocked1>) {
%0 = triton_gpu.convert_layout %arg0 : tensor<16x16xf16, #blocked> -> tensor<16x16xf16, #blocked1>
tt.store %arg1, %0 : tensor<16x16x!tt.ptr<f16>, #blocked1>
tt.return
}
}Before this patch, we did not detect these were within the warp transposes, so we just generated basic stores and loads to SLM: whereas the optimized case leverages more efficient instructions: This also translates to less pointer arithmetic and less ALU instructions overall: Note arithmetic instruction count should be way less (I don't have an estimate), but base case reuses indices computed for SLM access for UGM access. Also, as we know data won't be shared across warps, we just do not need synchronization: Update: Optimized transpose code fits in a single screenshot: |
An interesting thing is that the assemble shows the store is transposed and the load is not transposed: The comments in the Triton code explain there is non-transpose store and transpose load. Any background for this? |
Yes: bad comments :) I will update later. Thanks for pointing out! |

Detect sub-group transpose cases as those in which warp and lane dimensions in the resulting linear layouts get swapped and no transfer within block-groups is needed. Use sub-group write operations to store the contents in local memory and vector operations to write back. These will be translated to non-transposed and transposed store and loads respectively. As data is moved within sub-groups, no barriers are needed.
For now, handle only the case of a
single sub_group_size^2block being transposed.Instead of using the
triton_genoperation directly, we could be using a higher level operation thus moving legalization to a common place, as we are also legalizing this same operation in other places in our codebase.This may be split in the future by performing
N*Miterations for matrices of sizeN*sub_group_sizexM*sub_group_size.