[MLIR][XeGPU] Add support for cross-subgroup reduction from wg to sg #170936

nbpatel · 2025-12-05T22:20:59Z

No description provided.

github-actions · 2025-12-05T22:32:45Z

🐧 Linux x64 Test Results

7193 tests passed
596 tests skipped

✅ The build succeeded and all tests passed.

Garra1980 · 2025-12-05T23:52:54Z

cc @akroviakov

akroviakov · 2025-12-08T12:50:11Z

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

    auto reductionDims = llvm::to_vector(op.getReductionDims());
+    if (reductionDims.size() != 1)
+      return rewriter.notifyMatchFailure(
+          op, "Only single dimension reduction is supported");


What prevents 2D reductions here?

its one of the requirements for xegpu canonical form ..that pass should ensure it is only single dim reduction here

But then we face a problem. If there is a 2D test case, then we have to rewrite it as two 1D reductions first. From what I see, this pattern naturally supports intra-sg reduction or further handles cross-sg results.

If we were to consider 2D case, the pattern already has a most of the components for the hardcoded logic: do intra-sg reduction and then cross-sg via SLM. We do not care how "2D" is to be represented at lower levels.

When we go lower and start to actually care how sg-local 2D reduction is executed, we have to do two 1D reductions. We decide on the order based on the layout (we first reduce the dimension that does not require shuffles, if any).

However, if we are forced to split 2D reduction into two 1D reductions at wg level, we lose the ability to reason about the better order, because we do not require lane layout at WG level and cannot use it when splitting.

Please correct me if I missed something.

The restriction/requirement is driven by implementation, not from users. So if our implementation can be improved to lift the restriction, we should try.

I agree with @akroviakov. We should handle multiple dims here. but for now this is fine.

Jianhui-Li · 2025-12-09T19:04:02Z

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

+    xegpu::StoreMatrixOp::create(rewriter, loc, storeData, memDesc.getResult(),
+                                 storeOffsets2D, /*layout=*/nullptr);
+
+    gpu::BarrierOp::create(rewriter, loc);


To sync producer and consumer sg for data, both barrier and fence are needed.

Jianhui-Li · 2025-12-09T19:04:49Z

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

+    int64_t reductionDim = reductionDims[0];
+    bool needsCrossSubgroupReduction = (sgLayout[reductionDim] > 1);
+
+    // If no cross-subgroup reduction needed, add accumulator and return


The code could use some helper functions so the main functions becomes shorter.

charithaintc

overall direction looks good. will do another review once comments are addressed.

charithaintc · 2025-12-09T23:30:41Z

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

-/// so that reduction is local to subgroup & no cross-subgroup communication is
-/// needed.
-/// TODO: Add cases to handle more general situations which require SLM access.
+// This pattern transforms vector.multi_dim_reduction ops to work at subgroup


please add the summary of your algo here.

charithaintc · 2025-12-09T23:31:49Z

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

+    auto accs = adaptor.getAcc();
+
+    SmallVector<Value> expandedAccs;
+    if (accs.size() == 1 && sources.size() > 1) {


what is this case?

charithaintc · 2025-12-09T23:43:48Z

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

+    int64_t totalResultElements = localElements;
+    for (size_t i = 0; i < sgLayout.size(); ++i) {
+      if (!llvm::is_contained(reductionDims, static_cast<int64_t>(i)))
+        totalResultElements *= sgLayout[i];
+    }


can simplify with computeProduct thing and divide with reductionDim size.

charithaintc · 2025-12-09T23:47:14Z

mlir/lib/Dialect/XeGPU/Transforms/XeGPUWgToSgDistribute.cpp

+
+    auto loadOp = xegpu::LoadMatrixOp::create(
+        rewriter, loc, loadType2D, memDesc.getResult(), loadOffsets2D,
+        /*layout=*/nullptr);


We need a barrier here as well to make sure everyone finish loading the values?

Add support for cross-subgroup reduction from wg to sg

91de106

Merge branch 'main' into xegpu-wg-sg-reduction-cross-subgroup

207da7a

akroviakov reviewed Dec 8, 2025

View reviewed changes

Jianhui-Li reviewed Dec 9, 2025

View reviewed changes

charithaintc self-requested a review December 9, 2025 23:06

charithaintc reviewed Dec 9, 2025

View reviewed changes

[MLIR][XeGPU] Add support for cross-subgroup reduction from wg to sg #170936

Are you sure you want to change the base?

[MLIR][XeGPU] Add support for cross-subgroup reduction from wg to sg #170936

Conversation

nbpatel commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

Garra1980 commented Dec 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akroviakov Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charithaintc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charithaintc Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Dec 5, 2025 •

edited

Loading

akroviakov Dec 8, 2025 •

edited

Loading

charithaintc Dec 9, 2025 •

edited

Loading