-
Notifications
You must be signed in to change notification settings - Fork 76
Make isExpensiveLoadOrStore consider blocked pointers load and stores
#2570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
| // CHECK: %[[VAL_41:.*]]:3 = scf.for %{{.*}} = %{{.*}} to %{{.*}} step %{{.*}} iter_args(%{{.*}} = %{{.*}}, %{{.*}} = %[[VAL_36]], %{{.*}} = %[[VAL_40]]) -> (tensor<64x256xf32, #[[DPAS]]>, !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>>, !tt.ptr<tensor<32x256xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[DPAS]], kWidth = 2}>>>) : i32 { | ||
| // CHECK: %[[VAL_46:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>> | ||
| // CHECK: %[[VAL_47:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<32x256xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[DPAS]], kWidth = 2}>>> | ||
| // CHECK: %[[VAL_46:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>, triton_intel_gpu.block_io = "row_major"} : !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: adding triton_intel_gpu.block_io is consistent with our optimization pipeline (in our pipeline this is done before the 2nd invocation of RemoveLayoutConversion)
| %1 = tt.get_program_id y : i32 | ||
| // CHECK: %[[VAL_0:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<256x32xbf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 2}>>> | ||
| // CHECK: %[[VAL_1:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<32x256xbf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[$DPAS]], kWidth = 2}>>> | ||
| // CHECK: %[[VAL_0:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<256x32xbf16, {{.*}}>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual layout is not important in these tests.
|
@chengjunlu can you do a code review please ? |
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The
isExpensiveLoadOrStorefunction (third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp) fails to consider block pointers and consequently always returnsfalsefor loads (and stores) operations that use a block pointer.In turn, this causes the
RemoveLayoutConversionpass to never consider loads using block pointers asanchoroperations.This PR changes
isExpensiveLoadOrStoreso that block pointer loads can be properly recognized. TheRemoveLayourConversionpass is then able to consider those loads as anchor operations and preserve their layout.Because
RemoveLayoutConversionis invoked at several points in the optimization pipeline, the change in third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp alone causes performance degradation in a couple of GEMM like benchmarks, specifically when operand A oftl.dotis transposed and when the input oftl.dotis first fed into an exponential.These 2 performance degradation have ben fixed by an enhancing the
MaterializeBlockPointerandMatmulLoopPipelineoptimizations, so that they can retrieve the dot layout of block pointer loads transitively from its users (in those benchmarks the blocked layout of block ptrs loads is transitively converted to a dot layout).