Skip to content

Conversation

@etiotto
Copy link
Contributor

@etiotto etiotto commented Oct 24, 2024

The isExpensiveLoadOrStore function (third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp) fails to consider block pointers and consequently always returns false for loads (and stores) operations that use a block pointer.
In turn, this causes the RemoveLayoutConversion pass to never consider loads using block pointers as anchor operations.

This PR changes isExpensiveLoadOrStore so that block pointer loads can be properly recognized. The RemoveLayourConversion pass is then able to consider those loads as anchor operations and preserve their layout.

Because RemoveLayoutConversion is invoked at several points in the optimization pipeline, the change in third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp alone causes performance degradation in a couple of GEMM like benchmarks, specifically when operand A of tl.dot is transposed and when the input of tl.dot is first fed into an exponential.

These 2 performance degradation have ben fixed by an enhancing the MaterializeBlockPointer and MatmulLoopPipeline optimizations, so that they can retrieve the dot layout of block pointer loads transitively from its users (in those benchmarks the blocked layout of block ptrs loads is transitively converted to a dot layout).

etiotto added 28 commits October 9, 2024 16:43
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
@etiotto etiotto self-assigned this Oct 24, 2024
Signed-off-by: Tiotto, Ettore <[email protected]>
// CHECK: %[[VAL_41:.*]]:3 = scf.for %{{.*}} = %{{.*}} to %{{.*}} step %{{.*}} iter_args(%{{.*}} = %{{.*}}, %{{.*}} = %[[VAL_36]], %{{.*}} = %[[VAL_40]]) -> (tensor<64x256xf32, #[[DPAS]]>, !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>>, !tt.ptr<tensor<32x256xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[DPAS]], kWidth = 2}>>>) : i32 {
// CHECK: %[[VAL_46:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>>
// CHECK: %[[VAL_47:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<32x256xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[DPAS]], kWidth = 2}>>>
// CHECK: %[[VAL_46:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>, triton_intel_gpu.block_io = "row_major"} : !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: adding triton_intel_gpu.block_io is consistent with our optimization pipeline (in our pipeline this is done before the 2nd invocation of RemoveLayoutConversion)

%1 = tt.get_program_id y : i32
// CHECK: %[[VAL_0:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<256x32xbf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 2}>>>
// CHECK: %[[VAL_1:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<32x256xbf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[$DPAS]], kWidth = 2}>>>
// CHECK: %[[VAL_0:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<256x32xbf16, {{.*}}>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual layout is not important in these tests.

@etiotto etiotto marked this pull request as ready for review October 28, 2024 20:17
@etiotto etiotto requested a review from leonling-ll October 28, 2024 20:17
@etiotto etiotto linked an issue Oct 28, 2024 that may be closed by this pull request
@etiotto etiotto requested review from a team and removed request for jopperm October 30, 2024 21:13
@etiotto
Copy link
Contributor Author

etiotto commented Oct 30, 2024

@chengjunlu can you do a code review please ?

@etiotto etiotto requested a review from Dewei-Wang-sh October 30, 2024 21:18
@chengjunlu
Copy link
Contributor

@chengjunlu can you do a code review please ?

LGTM.

Copy link
Contributor

@victor-eds victor-eds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@etiotto etiotto merged commit 1dbef57 into main Oct 31, 2024
4 checks passed
@etiotto etiotto deleted the etiottoremove_layout_conv branch October 31, 2024 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make isExpensiveLoadOrStore consider blocked pointers load and stores

4 participants