-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[AMD] Optimize address increments for buffer loads in loops #8464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR transfers address computation from offsets in buffer loads to base pointers, which reuces amount of required computations and lowers register pressure.
82cef71 to
18bdacd
Compare
18bdacd to
226f8ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general lgtm. Just a few minor issues.
|
|
||
| LogicalResult matchAndRewrite(scf::ForOp forOp, | ||
| PatternRewriter &rewriter) const override { | ||
| LDBG("Analyzing ForOp for for offset pointer optimization: " << forOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for for
| // baseIncrement is a scalar which will be added to base pointer after | ||
| // optimization offsetInitialized is a value of offset on first loop iteration | ||
| // incrementOp is an operation that advances offset tensor | ||
| struct LoadData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"load is a target buffer load" should also include buffer_load_to_local.
You may want to manually format multi-line comment for baseIncrement.
"offsetInitialized" should be "offsetInitializer"
It's better to name this struct "LoadInfo". Just like what we have in the pipeliner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation covers all buffer ops, so I ended up with BufferOpInfo
| } | ||
| Value offsetInitializer = forOp.getInitArgs()[offsetOperandNo]; | ||
| LoadData data = {loadOp, advanceStep, Value(), offsetInitializer, | ||
| incrementOp}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call createScalarIncrements here to fill in baseIncrement? So after this analyzeLoad function, you have all info needed for this load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to fully split analysis and transformation. I am afraid of situation when optimization creates some operations, but later heuristic decides to skip.
There are no such heuristic at the moment, but I can not be sure we will not need it.
| // Gather buffer loads which could be optimized | ||
| SmallVector<LoadData> loads; | ||
| collectLoads<triton::amdgpu::BufferLoadOp>(loads, forOp); | ||
| collectLoads<triton::amdgpu::BufferLoadToLocalOp>(loads, forOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create an interface for all these buffer ops, also including buffer_atomic_rmw and cas? So we don't have to go over the whole for loop again and again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included this change in this PR and created separate here: #8600
This PR introduces a common interface for buffer address operands.
8fcf1d2 to
c2eacf4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
good to go after fixing the naming issues for the interface.
This PR transfers address computation from offsets in buffer loads to base pointers, which reuses amount of required computations and lowers register pressure.