Skip to content

Conversation

@alefimov-amd
Copy link
Contributor

@alefimov-amd alefimov-amd commented Oct 16, 2025

This PR transfers address computation from offsets in buffer loads to base pointers, which reuses amount of required computations and lowers register pressure.

This PR transfers address computation from offsets in buffer loads to base pointers,
which reuces amount of required computations and lowers register pressure.
@binarman binarman force-pushed the buffer_load_base_opt branch from 82cef71 to 18bdacd Compare October 28, 2025 22:03
Copy link
Collaborator

@zhanglx13 zhanglx13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general lgtm. Just a few minor issues.


LogicalResult matchAndRewrite(scf::ForOp forOp,
PatternRewriter &rewriter) const override {
LDBG("Analyzing ForOp for for offset pointer optimization: " << forOp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for for

// baseIncrement is a scalar which will be added to base pointer after
// optimization offsetInitialized is a value of offset on first loop iteration
// incrementOp is an operation that advances offset tensor
struct LoadData {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"load is a target buffer load" should also include buffer_load_to_local.
You may want to manually format multi-line comment for baseIncrement.
"offsetInitialized" should be "offsetInitializer"

It's better to name this struct "LoadInfo". Just like what we have in the pipeliner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation covers all buffer ops, so I ended up with BufferOpInfo

}
Value offsetInitializer = forOp.getInitArgs()[offsetOperandNo];
LoadData data = {loadOp, advanceStep, Value(), offsetInitializer,
incrementOp};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call createScalarIncrements here to fill in baseIncrement? So after this analyzeLoad function, you have all info needed for this load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to fully split analysis and transformation. I am afraid of situation when optimization creates some operations, but later heuristic decides to skip.

There are no such heuristic at the moment, but I can not be sure we will not need it.

// Gather buffer loads which could be optimized
SmallVector<LoadData> loads;
collectLoads<triton::amdgpu::BufferLoadOp>(loads, forOp);
collectLoads<triton::amdgpu::BufferLoadToLocalOp>(loads, forOp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an interface for all these buffer ops, also including buffer_atomic_rmw and cas? So we don't have to go over the whole for loop again and again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included this change in this PR and created separate here: #8600

This PR introduces a common interface for buffer address operands.
Copy link
Collaborator

@zhanglx13 zhanglx13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
good to go after fixing the naming issues for the interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants