Skip to content

Conversation

chengjunlu
Copy link
Contributor

The layout propagation across the scf.for op in RemoveLayout is not implemented well for these aspects:

  1. There is not analysis on the cost model of using different layout for the operations. (Choosing different tiling pattern for Triton ops.). It only rely on the anchors in ad-hoc.
  2. It is not implemented well for ops with multiple results ops.
  3. It is not implemented well for ops with nested basic blocks.
  4. The remove layout doesn't support to propagate the layout through the scf.for ops.

With the limitations, the scf.for operation is the bottle neck of the efficient after the remove layout pass.
This is not issue on NV GPU because the NV GPU convert the layout convert operations to async.cp in software pipeline.

But it is an issue for Intel GPU. We rely on the remove layout to get a simple program with less convert layout operations.

Plan to enhance the remove layout to enhance the limitations of the remove layout.

  1. Refactor the implementation of remove layout to support ops with multiple results and nested basic blocks well.
  2. Support the propagate layout through the scf.for ops on demand.
  3. Add an cost model analysis pass to get an costs of the different tiling patterns across the kernel program.

This is an PR for CI.

@chengjunlu chengjunlu linked an issue Jun 18, 2025 that may be closed by this pull request
@chengjunlu chengjunlu force-pushed the chengjun/enhance_remove_layout branch from 486ed4a to f42bd66 Compare June 18, 2025 07:16
…d values with different layout in scf.for.

Signed-off-by: Lu,Chengjun <[email protected]>
@etiotto etiotto marked this pull request as draft June 18, 2025 18:06
@etiotto etiotto requested a review from Copilot August 21, 2025 17:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the remove layout implementation to better handle layout propagation across scf.for operations, addressing limitations that create performance bottlenecks on Intel GPU. The changes focus on reducing duplicated layout conversion operations by improving support for multi-result operations and nested basic blocks.

  • Adds support for propagating layouts through scf.for operations with a new includeForOp parameter
  • Refactors mappedValues to handle multiple attribute mappings per value instead of single mappings
  • Includes debug output and unreachable code handling for scf.for operations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
Utility.h Adds includeForOp parameter to getConvertBackwardSlice function signature
Utility.cpp Implements scf.for layout propagation logic with early return check and debug output
RemoveLayoutConversions.cpp Updates data structures to support multiple encodings per value and enables scf.for processing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

return failure();

continue;
}
return failure();
Copy link
Preview

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return statement makes the code below unreachable. The logic for handling initOperand and yieldOperand (lines 243-253) will never execute, which appears to be the main implementation for scf.for support.

Suggested change
return failure();

Copilot uses AI. Check for mistakes.

llvm::outs() << "johnlu getBackward slice check scf.for initOperand: "
<< initOperand->get() << "\n";
llvm::outs() << "johnlu getBackward slice check scf.for yieldOperand: "
<< yieldOperand.get() << "\n";
Copy link
Preview

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug output should not be committed to production code. Consider using LLVM_DEBUG macro or removing these debug statements before merging.

Suggested change
<< yieldOperand.get() << "\n";
LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for initOperand: "
<< initOperand->get() << "\n");
LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for yieldOperand: "
<< yieldOperand.get() << "\n");

Copilot uses AI. Check for mistakes.

llvm::outs() << "johnlu getBackward slice check scf.for initOperand: "
<< initOperand->get() << "\n";
llvm::outs() << "johnlu getBackward slice check scf.for yieldOperand: "
<< yieldOperand.get() << "\n";
Copy link
Preview

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug output should not be committed to production code. Consider using LLVM_DEBUG macro or removing these debug statements before merging.

Suggested change
<< yieldOperand.get() << "\n";
LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for initOperand: "
<< initOperand->get() << "\n");
LLVM_DEBUG(llvm::dbgs() << "johnlu getBackward slice check scf.for yieldOperand: "
<< yieldOperand.get() << "\n");

Copilot uses AI. Check for mistakes.

@@ -1045,6 +1058,7 @@ void LayoutRematerialization::rewriteSlice(SetVector<Value> &slice,
deadOps.push_back(forOp.getOperation());
Block &loopBody = *newForOp.getBody();
for (auto m : argMapping) {
mapping.map(newForOp.getResult(m.first), newForOp.getResult(m.second));
Copy link
Preview

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line appears to map a result to itself when m.first equals m.second, which could be problematic. The mapping logic should ensure proper relationships between old and new ForOp results.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FlexAttn] [BACKWARD] Improve the remove layout pass for flex attn backward kernel [BACKEND] Enhance the remove layout for Intel GPU
1 participant