Skip to content

Conversation

@nbpatel
Copy link
Contributor

@nbpatel nbpatel commented Dec 5, 2025

No description provided.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🐧 Linux x64 Test Results

  • 7193 tests passed
  • 596 tests skipped

✅ The build succeeded and all tests passed.

@Garra1980
Copy link

cc @akroviakov

auto reductionDims = llvm::to_vector(op.getReductionDims());
if (reductionDims.size() != 1)
return rewriter.notifyMatchFailure(
op, "Only single dimension reduction is supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents 2D reductions here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its one of the requirements for xegpu canonical form ..that pass should ensure it is only single dim reduction here

Copy link
Contributor

@akroviakov akroviakov Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we face a problem. If there is a 2D test case, then we have to rewrite it as two 1D reductions first. From what I see, this pattern naturally supports intra-sg reduction or further handles cross-sg results.

If we were to consider 2D case, the pattern already has a most of the components for the hardcoded logic: do intra-sg reduction and then cross-sg via SLM. We do not care how "2D" is to be represented at lower levels.

When we go lower and start to actually care how sg-local 2D reduction is executed, we have to do two 1D reductions. We decide on the order based on the layout (we first reduce the dimension that does not require shuffles, if any).

However, if we are forced to split 2D reduction into two 1D reductions at wg level, we lose the ability to reason about the better order, because we do not require lane layout at WG level and cannot use it when splitting.

Please correct me if I missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restriction/requirement is driven by implementation, not from users. So if our implementation can be improved to lift the restriction, we should try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @akroviakov. We should handle multiple dims here. but for now this is fine.

xegpu::StoreMatrixOp::create(rewriter, loc, storeData, memDesc.getResult(),
storeOffsets2D, /*layout=*/nullptr);

gpu::BarrierOp::create(rewriter, loc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sync producer and consumer sg for data, both barrier and fence are needed.

int64_t reductionDim = reductionDims[0];
bool needsCrossSubgroupReduction = (sgLayout[reductionDim] > 1);

// If no cross-subgroup reduction needed, add accumulator and return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code could use some helper functions so the main functions becomes shorter.

@charithaintc charithaintc self-requested a review December 9, 2025 23:06
Copy link
Contributor

@charithaintc charithaintc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall direction looks good. will do another review once comments are addressed.

/// so that reduction is local to subgroup & no cross-subgroup communication is
/// needed.
/// TODO: Add cases to handle more general situations which require SLM access.
// This pattern transforms vector.multi_dim_reduction ops to work at subgroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the summary of your algo here.

auto accs = adaptor.getAcc();

SmallVector<Value> expandedAccs;
if (accs.size() == 1 && sources.size() > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this case?

Comment on lines +1258 to +1262
int64_t totalResultElements = localElements;
for (size_t i = 0; i < sgLayout.size(); ++i) {
if (!llvm::is_contained(reductionDims, static_cast<int64_t>(i)))
totalResultElements *= sgLayout[i];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can simplify with computeProduct thing and divide with reductionDim size.


auto loadOp = xegpu::LoadMatrixOp::create(
rewriter, loc, loadType2D, memDesc.getResult(), loadOffsets2D,
/*layout=*/nullptr);
Copy link
Contributor

@charithaintc charithaintc Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a barrier here as well to make sure everyone finish loading the values?

@charithaintc
Copy link
Contributor

charithaintc commented Dec 11, 2025

I thought about this again. I think its not complicated to support reduction over multiple dims. We can use progressive lowering inside the pattern itself.

  • if size(red_dims) > 1, rewrite the reduction into multiple reductions where each reduction is over single dims and return.
  • if size(red_dims) == 1, apply the currently implemented logic.

all of this can be done in same pattern in my view.

cc @nbpatel , @Jianhui-Li , @akroviakov

@akroviakov
Copy link
Contributor

if size(red_dims) > 1, rewrite the reduction into multiple reductions where each reduction is over single dims

if size(red_dims) == 1, apply the currently implemented logic

Yes, this is currently the default option. What I try to understand is why it is the task of the WG-level to decide how SG optimizes/decomposes sg-local reductions, if at the WG-level we do not even require lane layouts?

Given wg-level

vector.multi_reduction <add>, %load, %cst  [0, 1] : vector<64x128xf32> to f32

we could distribute to sg as follows:

vector.multi_reduction <add>, %load, %cst  [0, 1] : vector<32x32xf32> to f32 // sg-local

memref.alloca ... // allocate SLM
// barrier, handle SG-local results

Then the subgroup level decides how best to handle the sg-local 2D reduction.
SG level can choose some HW-specific 2D reduction, or decompose into multiple lane level (1D vectors) reductions.
For example, the sg-distribution input after rewrite/optimization might look like this:

vector.multi_reduction <add>, %load, %cst  [1] : vector<32x32xf32> to vector<32xf32>// intra-lane
vector.multi_reduction <add>, %load, %cst  [0] : vector<32xf32> to f32 // cross-lane generate shuffles based on layout

memref.alloca ... // allocate SLM
// barrier, handle SG-local results

WG only decides on how to merge SG results (i.e., f32 in the example above).
If we allowed lane level to have 2D vectors, then SG level would only decide on how to merge lane level results, and lane level would decide how to best handle its local 2d vector, but we do not allow that, so we decompose at the SG level once we get lane layout to reason about decomposition.

This looks like a simple hierarchical merge where each level cares only about how to best handle the result of the lower level, and the lowest level decides how to best produce the result. I do not see why we should cross the hierarchy boundaries and, given only WG-level metadata, decide on how SG-level should handle its reduction best. Do I miss something?

@nbpatel
Copy link
Contributor Author

nbpatel commented Dec 11, 2025

  • if size(red_dims) > 1, rewrite the reduction into multiple reductions where each reduction is over single dims and return.

why do we need to rewrite it as multiple reduction instructions doing one reduce dim at at time? Wg to sg can do multi-dim reduce locally first and then do cross-subgroup reduction using SLM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants