[Draft] [BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527

chengjunlu · 2025-06-18T05:44:25Z

The layout propagation across the scf.for op in RemoveLayout is not implemented well for these aspects:

There is not analysis on the cost model of using different layout for the operations. (Choosing different tiling pattern for Triton ops.). It only rely on the anchors in ad-hoc.
It is not implemented well for ops with multiple results ops.
It is not implemented well for ops with nested basic blocks.
The remove layout doesn't support to propagate the layout through the scf.for ops.

With the limitations, the scf.for operation is the bottle neck of the efficient after the remove layout pass.
This is not issue on NV GPU because the NV GPU convert the layout convert operations to async.cp in software pipeline.

But it is an issue for Intel GPU. We rely on the remove layout to get a simple program with less convert layout operations.

Plan to enhance the remove layout to enhance the limitations of the remove layout.

Refactor the implementation of remove layout to support ops with multiple results and nested basic blocks well.
Support the propagate layout through the scf.for ops on demand.
Add an cost model analysis pass to get an costs of the different tiling patterns across the kernel program.

This is an PR for CI.

…d values with different layout in scf.for. Signed-off-by: Lu,Chengjun <[email protected]>

Copilot

Pull Request Overview

This PR enhances the remove layout implementation to better handle layout propagation across scf.for operations, addressing limitations that create performance bottlenecks on Intel GPU. The changes focus on reducing duplicated layout conversion operations by improving support for multi-result operations and nested basic blocks.

Adds support for propagating layouts through scf.for operations with a new includeForOp parameter
Refactors mappedValues to handle multiple attribute mappings per value instead of single mappings
Includes debug output and unreachable code handling for scf.for operations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
Utility.h	Adds `includeForOp` parameter to `getConvertBackwardSlice` function signature
Utility.cpp	Implements `scf.for` layout propagation logic with early return check and debug output
RemoveLayoutConversions.cpp	Updates data structures to support multiple encodings per value and enables `scf.for` processing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-21T17:14:00Z

third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp

+          return failure();
+
+        continue;
+      }
      return failure();


This return statement makes the code below unreachable. The logic for handling initOperand and yieldOperand (lines 243-253) will never execute, which appears to be the main implementation for scf.for support.

Suggested change

return failure();

Copilot · 2025-08-21T17:14:01Z

third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp

+      llvm::outs() << "johnlu getBackward slice check scf.for initOperand: "
+                   << initOperand->get() << "\n";
+      llvm::outs() << "johnlu getBackward slice check scf.for yieldOperand: "
+                   << yieldOperand.get() << "\n";


Debug output should not be committed to production code. Consider using LLVM_DEBUG macro or removing these debug statements before merging.

Suggested change

<< yieldOperand.get() << "\n";

LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for initOperand: "

<< initOperand->get() << "\n");

LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for yieldOperand: "

<< yieldOperand.get() << "\n");

Copilot · 2025-08-21T17:14:01Z

third_party/intel/lib/TritonIntelGPUTransforms/Utility.cpp

+      llvm::outs() << "johnlu getBackward slice check scf.for initOperand: "
+                   << initOperand->get() << "\n";
+      llvm::outs() << "johnlu getBackward slice check scf.for yieldOperand: "
+                   << yieldOperand.get() << "\n";


Debug output should not be committed to production code. Consider using LLVM_DEBUG macro or removing these debug statements before merging.

Suggested change

<< yieldOperand.get() << "\n";

LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for initOperand: "

<< initOperand->get() << "\n");

LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for yieldOperand: "

<< yieldOperand.get() << "\n");

Copilot · 2025-08-21T17:14:01Z

third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp

@@ -1045,6 +1058,7 @@ void LayoutRematerialization::rewriteSlice(SetVector<Value> &slice,
      deadOps.push_back(forOp.getOperation());
      Block &loopBody = *newForOp.getBody();
      for (auto m : argMapping) {
+        mapping.map(newForOp.getResult(m.first), newForOp.getResult(m.second));


This line appears to map a result to itself when m.first equals m.second, which could be problematic. The mapping logic should ensure proper relationships between old and new ForOp results.

chengjunlu requested review from whitneywhtsang and etiotto June 18, 2025 05:44

chengjunlu linked an issue Jun 18, 2025 that may be closed by this pull request

[BACKEND] Enhance the remove layout for Intel GPU #4528

Open

chengjunlu force-pushed the chengjun/enhance_remove_layout branch from 486ed4a to f42bd66 Compare June 18, 2025 07:16

Temp enhance the remove layout implementation to reduce the duplicate…

f42bd66

…d values with different layout in scf.for. Signed-off-by: Lu,Chengjun <[email protected]>

etiotto marked this pull request as draft June 18, 2025 18:06

chengjunlu linked an issue Aug 7, 2025 that may be closed by this pull request

[FlexAttn] [BACKWARD] Improve the remove layout pass for flex attn backward kernel #4857

Open

etiotto requested a review from Copilot August 21, 2025 17:12

Copilot AI reviewed Aug 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] [BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527

[Draft] [BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527

Uh oh!

chengjunlu commented Jun 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 21, 2025

Uh oh!

Copilot AI Aug 21, 2025

Uh oh!

Copilot AI Aug 21, 2025

Uh oh!

Copilot AI Aug 21, 2025

Uh oh!

Uh oh!

-                   << yieldOperand.get() << "\n";
+      LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for initOperand: "
+                              << initOperand->get() << "\n");
+      LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for yieldOperand: "
+                              << yieldOperand.get() << "\n");

[Draft] [BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527

Are you sure you want to change the base?

[Draft] [BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527

Uh oh!

Conversation

chengjunlu commented Jun 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!