-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[MLIR][XeGPU] Add support for cross-subgroup reduction from wg to sg #170936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
|
cc @akroviakov |
| auto reductionDims = llvm::to_vector(op.getReductionDims()); | ||
| if (reductionDims.size() != 1) | ||
| return rewriter.notifyMatchFailure( | ||
| op, "Only single dimension reduction is supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents 2D reductions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its one of the requirements for xegpu canonical form ..that pass should ensure it is only single dim reduction here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we face a problem. If there is a 2D test case, then we have to rewrite it as two 1D reductions first. From what I see, this pattern naturally supports intra-sg reduction or further handles cross-sg results.
If we were to consider 2D case, the pattern already has a most of the components for the hardcoded logic: do intra-sg reduction and then cross-sg via SLM. We do not care how "2D" is to be represented at lower levels.
When we go lower and start to actually care how sg-local 2D reduction is executed, we have to do two 1D reductions. We decide on the order based on the layout (we first reduce the dimension that does not require shuffles, if any).
However, if we are forced to split 2D reduction into two 1D reductions at wg level, we lose the ability to reason about the better order, because we do not require lane layout at WG level and cannot use it when splitting.
Please correct me if I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The restriction/requirement is driven by implementation, not from users. So if our implementation can be improved to lift the restriction, we should try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @akroviakov. We should handle multiple dims here. but for now this is fine.
| xegpu::StoreMatrixOp::create(rewriter, loc, storeData, memDesc.getResult(), | ||
| storeOffsets2D, /*layout=*/nullptr); | ||
|
|
||
| gpu::BarrierOp::create(rewriter, loc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To sync producer and consumer sg for data, both barrier and fence are needed.
| int64_t reductionDim = reductionDims[0]; | ||
| bool needsCrossSubgroupReduction = (sgLayout[reductionDim] > 1); | ||
|
|
||
| // If no cross-subgroup reduction needed, add accumulator and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code could use some helper functions so the main functions becomes shorter.
charithaintc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall direction looks good. will do another review once comments are addressed.
| /// so that reduction is local to subgroup & no cross-subgroup communication is | ||
| /// needed. | ||
| /// TODO: Add cases to handle more general situations which require SLM access. | ||
| // This pattern transforms vector.multi_dim_reduction ops to work at subgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the summary of your algo here.
| auto accs = adaptor.getAcc(); | ||
|
|
||
| SmallVector<Value> expandedAccs; | ||
| if (accs.size() == 1 && sources.size() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this case?
| int64_t totalResultElements = localElements; | ||
| for (size_t i = 0; i < sgLayout.size(); ++i) { | ||
| if (!llvm::is_contained(reductionDims, static_cast<int64_t>(i))) | ||
| totalResultElements *= sgLayout[i]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can simplify with computeProduct thing and divide with reductionDim size.
|
|
||
| auto loadOp = xegpu::LoadMatrixOp::create( | ||
| rewriter, loc, loadType2D, memDesc.getResult(), loadOffsets2D, | ||
| /*layout=*/nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a barrier here as well to make sure everyone finish loading the values?
No description provided.