Make `isExpensiveLoadOrStore` consider blocked pointers load and stores #2570

etiotto · 2024-10-24T20:50:36Z

The isExpensiveLoadOrStore function (third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp) fails to consider block pointers and consequently always returns false for loads (and stores) operations that use a block pointer.
In turn, this causes the RemoveLayoutConversion pass to never consider loads using block pointers as anchor operations.

This PR changes isExpensiveLoadOrStore so that block pointer loads can be properly recognized. The RemoveLayourConversion pass is then able to consider those loads as anchor operations and preserve their layout.

Because RemoveLayoutConversion is invoked at several points in the optimization pipeline, the change in third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp alone causes performance degradation in a couple of GEMM like benchmarks, specifically when operand A of tl.dot is transposed and when the input of tl.dot is first fed into an exponential.

These 2 performance degradation have ben fixed by an enhancing the MaterializeBlockPointer and MatmulLoopPipeline optimizations, so that they can retrieve the dot layout of block pointer loads transitively from its users (in those benchmarks the blocked layout of block ptrs loads is transitively converted to a dot layout).

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto · 2024-10-28T19:34:28Z

All Triton benchmark are on par: http://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1&var-tag=ci%7Cetiotto2&var-bench=attn&var-bench=gemm&var-bench=gemm-preop-exp&var-bench=gemm-postop-gelu&var-bench=gemm-postop-addmatrix&var-bench=gemm-streamk&var-bench=gemm-splitk&var-bench=gemm-bt&var-bench=gemm-at&var-device=All&var-compiler=triton&var-backend=All&var-baseline_backend=triton-ci-XPU%201550&var-target_backend=triton-ci-XPU%201550

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto · 2024-10-28T20:00:07Z

test/TritonIntelGPU/backward_combine_dpas_dot_layout.mlir

    // CHECK: %[[VAL_41:.*]]:3 = scf.for %{{.*}} = %{{.*}} to %{{.*}} step %{{.*}} iter_args(%{{.*}} = %{{.*}}, %{{.*}} = %[[VAL_36]], %{{.*}} = %[[VAL_40]]) -> (tensor<64x256xf32, #[[DPAS]]>, !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>>, !tt.ptr<tensor<32x256xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[DPAS]], kWidth = 2}>>>)  : i32 {
-    // CHECK: %[[VAL_46:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>>
-    // CHECK: %[[VAL_47:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<32x256xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[DPAS]], kWidth = 2}>>>
+    // CHECK: %[[VAL_46:.*]] = tt.load %{{.*}} {boundaryCheck = array<i32: 0, 1>, triton_intel_gpu.block_io = "row_major"} : !tt.ptr<tensor<64x32xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[DPAS]], kWidth = 2}>>>


Note: adding triton_intel_gpu.block_io is consistent with our optimization pipeline (in our pipeline this is done before the 2nd invocation of RemoveLayoutConversion)

etiotto · 2024-10-28T20:01:26Z

test/TritonIntelGPU/combine.mlir

    %1 = tt.get_program_id y : i32
-    // CHECK:   %[[VAL_0:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<256x32xbf16, #triton_gpu.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 2}>>>
-    // CHECK:   %[[VAL_1:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<32x256xbf16, #triton_gpu.dot_op<{opIdx = 1, parent = #[[$DPAS]], kWidth = 2}>>>
+    // CHECK:   %[[VAL_0:.*]] = tt.make_tensor_ptr {{.*}} : <tensor<256x32xbf16, {{.*}}>>


The actual layout is not important in these tests.

etiotto · 2024-10-30T21:18:07Z

@chengjunlu can you do a code review please ?

chengjunlu · 2024-10-31T02:36:56Z

@chengjunlu can you do a code review please ?

LGTM.

victor-eds

LGTM

etiotto added 28 commits October 9, 2024 16:43

Improve axis analysis to handle tt.make_tensor_ptr

c7fe682

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into etiotto/axis_analysis_make_tensor_ptr

ad3888f

Merge branch 'main' into etiotto/axis_analysis_make_tensor_ptr

a7a9b06

Merge branch 'main' into etiotto/axis_analysis_make_tensor_ptr

6bddd5f

Merge branch 'main' into etiotto/axis_analysis_make_tensor_ptr

4ad4f1a

WIP: Coalescing for block ptrs

4dc1cf1

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix pre_commit

fa53ced

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into etiotto/coalesce_for_block_ptr

049ddb8

Merge branch 'main' into etiotto/coalesce_for_block_ptr

041e2da

Fix functional problem and add lit test

5a6cf81

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix pre_commit

2546665

Signed-off-by: Tiotto, Ettore <[email protected]>

Reenable rewrite tensor ptr

4d5dc49

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix test_core regression

c3fdbba

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix tutorial assertion

d9de8e7

Signed-off-by: Tiotto, Ettore <[email protected]>

Refactor

949256e

Signed-off-by: Tiotto, Ettore <[email protected]>

Cleanup

754ec70

Signed-off-by: Tiotto, Ettore <[email protected]>

Cleanup

469407b

Signed-off-by: Tiotto, Ettore <[email protected]>

Extend axis info analysis to more block ptrs

9f4f98d

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into etiotto/coalesce_for_block_ptr

a40844b

Address code review comments

bb9b4c3

Signed-off-by: Tiotto, Ettore <[email protected]>

Remove unrelated change

8d9a158

Signed-off-by: Tiotto, Ettore <[email protected]>

Remove unrelated change

6529f04

Signed-off-by: Tiotto, Ettore <[email protected]>

Remove unrelated change

0aa334b

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix pre_commit

547d6fa

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into etiotto/coalesce_for_block_ptr

6566f6c

Address code review comments

2f97c1a

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix pre_commit

95f5832

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into etiottoremove_layout_conv

0887245

etiotto self-assigned this Oct 24, 2024

Make isExpensiveLoadOrStore consider blocked pointers load and stores

3636bef

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto added 6 commits October 25, 2024 14:45

Make isExpensiveLoadOrStore consider blocked pointers load and stores

db2193e

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into etiottoremove_layout_conv

eeda8e9

MaterializeBlockPointer fix for GEMM with 1st operand transposed

7c9a0f9

Signed-off-by: Tiotto, Ettore <[email protected]>

MaterializeBlockPointer fix for GEMM with 1st operand transposed

cbc630b

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix unit tests

0215a16

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix performance regression for gemm-preop-exp

ae3d625

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto requested review from chengjunlu, jopperm and whitneywhtsang October 28, 2024 19:57

Reduce PR footprint

22b7ec9

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto commented Oct 28, 2024

View reviewed changes

etiotto marked this pull request as ready for review October 28, 2024 20:17

etiotto requested a review from leonling-ll October 28, 2024 20:17

etiotto linked an issue Oct 28, 2024 that may be closed by this pull request

Make isExpensiveLoadOrStore consider blocked pointers load and stores #2581

Closed

etiotto requested review from a team and removed request for jopperm October 30, 2024 21:13

etiotto requested a review from Dewei-Wang-sh October 30, 2024 21:18

chengjunlu approved these changes Oct 31, 2024

View reviewed changes

victor-eds approved these changes Oct 31, 2024

View reviewed changes

etiotto merged commit 1dbef57 into main Oct 31, 2024
4 checks passed

etiotto deleted the etiottoremove_layout_conv branch October 31, 2024 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `isExpensiveLoadOrStore` consider blocked pointers load and stores #2570

Make `isExpensiveLoadOrStore` consider blocked pointers load and stores #2570

Uh oh!

etiotto commented Oct 24, 2024 •

edited

Loading

Uh oh!

etiotto commented Oct 28, 2024

Uh oh!

etiotto Oct 28, 2024

Uh oh!

etiotto Oct 28, 2024

Uh oh!

etiotto commented Oct 30, 2024

Uh oh!

chengjunlu commented Oct 31, 2024

Uh oh!

victor-eds left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make isExpensiveLoadOrStore consider blocked pointers load and stores #2570

Make isExpensiveLoadOrStore consider blocked pointers load and stores #2570

Uh oh!

Conversation

etiotto commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etiotto commented Oct 28, 2024

Uh oh!

etiotto Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

etiotto Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

etiotto commented Oct 30, 2024

Uh oh!

chengjunlu commented Oct 31, 2024

Uh oh!

victor-eds left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make `isExpensiveLoadOrStore` consider blocked pointers load and stores #2570

Make `isExpensiveLoadOrStore` consider blocked pointers load and stores #2570

etiotto commented Oct 24, 2024 •

edited

Loading

victor-eds left a comment •

edited

Loading