[optimize-dot-operands]: Fuse load and trans operations - part 3 #4537

etiotto · 2025-06-18T23:16:58Z

Enhance the transformation to allow multiple load+transpose fusion opportunities in separate for loops when the def-use chains corresponding to the opportunities originate at the same make_tensor_ptr operation.

Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.1

Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.2

…tt.dot Signed-off-by: Tiotto, Ettore <[email protected]>

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto · 2025-06-18T23:18:08Z

Depends on: #4468

Copilot

Pull Request Overview

This PR enhances dot operands optimization by fusing load and transpose operations in separate loops when the def‑use chains originate from a make_tensor_ptr, and by refactoring cleanup routines.

Added a new optimization pass (optimize_dot_operands) in multiple backend components.
Introduced a new eraseOperations utility and refactored fusion logic in OptimizeDotOperands.cpp.
Updated test cases to validate proper fusion and non‐fused behavior.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
third_party/intel/triton_xpu.cc	Added optimize_dot_operands pass registration.
third_party/intel/lib/Utils/Utility.cpp	Added a new eraseOperations function for cleanup operations.
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp	Refactored fusion logic and propagation routines to support optimized chaining.
third_party/intel/lib/Dialect/Triton/Transforms/TensorDescToBlockPointer.cpp	Removed redundant finalize() in favor of using eraseOperations.
third_party/intel/include/Utils/Utility.h	Declared the new eraseOperations function.
third_party/intel/backend/compiler.py	Registered the new optimize_dot_operands pass in the compiler backend.
test/TritonIntelGPU/dot-operands.mlir	Updated test cases to reflect changes in fusion behavior and new pass functionality.

Comments suppressed due to low confidence (2)

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp:161

[nitpick] The singleUsersInChain function is quite complex; consider refactoring the logic or adding more inline comments to improve readability and maintainability.

  // Determine whether all operations in the def-use chain from \p start to

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp:112

[nitpick] Consider renaming the lambda 'usedByDotOp' to a more descriptive name such as 'isChainedToDotOp' to clarify its purpose.

    auto usedByDotOp = [](tt::TransOp transOp) {

…tt.dot Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.3

Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.3

etiotto · 2025-06-23T13:05:17Z

python/tutorials/06-fused-attention.py

@@ -68,7 +71,10 @@ def _attn_fwd_inner(acc, l_i, m_i, q,  #
    for start_n in tl.range(lo, hi, BLOCK_N, warp_specialize=warp_specialize):
        start_n = tl.multiple_of(start_n, BLOCK_N)
        # -- compute qk ----
-        k = desc_k.load([0, offsetk_y])
+        if dtype == tl.float8e5:


For fp16 we undo the source code changes we made and the code is now back to the original. For FP8 we keep the source code changes until we can issue DPAS instructions for them (after making 2 fp8 elems into a fp16).

etiotto · 2025-06-23T13:05:47Z

test/TritonIntelGPU/dot-operands.mlir

@@ -80,15 +79,6 @@ module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32} {
    %c1_i64 = arith.constant 1 : i64
    %c1024_i64 = arith.constant 1024 : i64
    %cst = arith.constant dense<0.000000e+00> : tensor<256x256xf32, #mma>
-    %0 = tt.get_program_id x : i32


Just making the test simpler here

etiotto · 2025-06-23T17:26:52Z

ping @whitneywhtsang, @chengjunlu, @LiyangLingIntel any comments ?

…th_trans.3

test/TritonIntelGPU/dot-operands.mlir

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp

whitneywhtsang · 2025-06-24T00:03:33Z

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp

+
+    // Prune candidate chains containing load/trans operations that cannot be
+    // safely fused.
+    prune(chains);


Do you think is worth pruning rootToChains with chains that contain at least one candidate first? That way we won't clone the chain if there will be no candidates.

Another thought is to have a flag to indicate if we want to clone for a particular root in rootToChains,
if no candidate or all candidates, then no need to clone.

Not sure I fully understand the question. We collect def-use chains terminated at a TransOp is that operation is a candidate (therefore not all TransOp are collected into a chain to start with). Then, if there is only one chain, no cloning is necessary. If there are 2 or more chains, we clone the root operation only if that operation is the start operation of more than one chain. After that we prune chains if we detect that operations in the "middle" of the chain have more than one user. The remaining chains are the final candidates for fusion.

After discussing offline I understand the suggestion. We agreed to improve the implementation in the next PR.

etiotto · 2025-06-24T12:30:51Z

Thanks @whitneywhtsang for the prompt review!

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto · 2025-06-24T21:11:08Z

Addresses first round of comments. I still have some comments to work on @whitneywhtsang.

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp

Signed-off-by: Tiotto, Ettore <[email protected]>

whitneywhtsang

should be fine in this PR as it is limited to one user per operations in chain, but in general need to be careful that there can be more than one chain with the same start and same end.

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto added 14 commits June 4, 2025 21:01

[WIP]: Fuse load and trans operations

0ca51af

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge remote-tracking branch 'origin/main' into etiotto.merge_load_wi…

cd406fb

…th_trans.1

Limit candidates to operations with no associated region.

07599e3

Signed-off-by: Tiotto, Ettore <[email protected]>

Allow candidates in for loop

b1a2c1f

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix precommit

5eafc6b

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into etiotto.merge_load_with_trans.2

0422ad6

Better traces

5181bb3

Signed-off-by: Tiotto, Ettore <[email protected]>

Allow fusing load+trans when load ptr is loop carried

2329dd7

Signed-off-by: Tiotto, Ettore <[email protected]>

Fix failing tutorial 09

475eef7

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge remote-tracking branch 'origin/main' into etiotto.merge_load_wi…

a2fa44c

…th_trans.2

Allow trans user to be any operation as long as def-use chain end is …

617dc0d

…tt.dot Signed-off-by: Tiotto, Ettore <[email protected]>

Address code review comments

dd8979d

Signed-off-by: Tiotto, Ettore <[email protected]>

Address code review comments

d3cb92b

Signed-off-by: Tiotto, Ettore <[email protected]>

Address code review comments

c1a6949

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto self-assigned this Jun 18, 2025

etiotto changed the title ~~Etiotto.merge load with trans.3~~ [optimize-dot-operands]: Fuse load and trans operations - part 3 Jun 19, 2025

etiotto requested a review from Copilot June 19, 2025 15:49

etiotto marked this pull request as ready for review June 19, 2025 15:49

Copilot AI reviewed Jun 19, 2025

View reviewed changes

etiotto added 2 commits June 19, 2025 17:13

Allow trans user to be any operation as long as def-use chain end is …

e344c13

…tt.dot Signed-off-by: Tiotto, Ettore <[email protected]>

Merge remote-tracking branch 'origin/main' into etiotto.merge_load_wi…

e7d0d74

…th_trans.3

etiotto linked an issue Jun 20, 2025 that may be closed by this pull request

[TransOp fusion]: Fuse tt.trans with tt.load to expoit 2D block read operations #4450

Closed

etiotto added 3 commits June 20, 2025 16:33

Merge remote-tracking branch 'origin/main' into etiotto.merge_load_wi…

8e6ee3e

…th_trans.3

Fix precommit

53b26ec

Signed-off-by: Tiotto, Ettore <[email protected]>

Enable tutorial 06 with tt.trans when data type is not fp8

23a1ef5

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto requested review from LiyangLingIntel, whitneywhtsang, chengjunlu and a team June 20, 2025 20:27

etiotto requested a review from anmyachev June 20, 2025 20:27

Merge remote-tracking branch 'origin/main' into etiotto.merge_load_wi…

ff7a5a8

…th_trans.3

etiotto commented Jun 23, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into etiotto.merge_load_wi…

79ee3e6

…th_trans.3

whitneywhtsang reviewed Jun 23, 2025

View reviewed changes

whitneywhtsang reviewed Jun 24, 2025

View reviewed changes

Address code review comments

22db300

Signed-off-by: Tiotto, Ettore <[email protected]>

whitneywhtsang reviewed Jun 24, 2025

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp Outdated Show resolved Hide resolved

etiotto requested a review from whitneywhtsang June 25, 2025 13:34

Address code review comments

0745f28

Signed-off-by: Tiotto, Ettore <[email protected]>

whitneywhtsang approved these changes Jun 25, 2025

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp Outdated Show resolved Hide resolved

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp Show resolved Hide resolved

Address code review comments

0d68f06

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto enabled auto-merge (squash) June 25, 2025 22:25

etiotto merged commit 81946f9 into main Jun 25, 2025
15 checks passed

etiotto deleted the etiotto.merge_load_with_trans.3 branch June 25, 2025 23:13

[optimize-dot-operands]: Fuse load and trans operations - part 3 #4537

[optimize-dot-operands]: Fuse load and trans operations - part 3 #4537

Uh oh!

Conversation

etiotto commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etiotto commented Jun 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

etiotto Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

etiotto Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

etiotto commented Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whitneywhtsang Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

etiotto Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

etiotto Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

etiotto commented Jun 24, 2025

Uh oh!

etiotto commented Jun 24, 2025

Uh oh!

Uh oh!

whitneywhtsang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etiotto commented Jun 18, 2025 •

edited

Loading