[WIP][BACKEND] Generalize the MemBar to consider cross-CTA ops #8834

lezcano · 2025-11-25T16:27:39Z

The semantics here are that it's the user's/compiler's responsability to
add the relevant synchronisation if they reuse the same shmem buffer,
but otherwise the compiler will do so.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-25T16:29:57Z

lib/Analysis/Utility.cpp

+    if (auto blockArg = dyn_cast<BlockArgument>(cur)) {
+      auto yield = cast<scf::YieldOp>(blockArg.getOwner()->getTerminator());
+      cur = yield.getOperand(blockArg.getArgNumber() - 1);
+    } else {


Loop-carried memdesc lookup underflows on non-for regions

The new findShmemAlloc assumes every memdesc BlockArgument is preceded by an induction variable and pulls the defining value with yield.getOperand(blockArg.getArgNumber() - 1). In scf.while/other regions whose block arguments are one-to-one with the yielded values, getArgNumber() can be 0, so subtracting 1 dereferences a negative index and trips an assertion/UB when membar analysis touches a memdesc carried through such a loop. This is a regression from the previous implementation (in TritonGPU/Transforms/Utility.cpp) which only subtracted one for scf.for blocks and otherwise handled block args directly.

Useful? React with 👍 / 👎.

ThomasRaoux

I wonder if we could separate it from membar in order to decouple the logic

ThomasRaoux · 2025-11-25T16:54:01Z

lib/Analysis/Membar.cpp

+      auto writesUFDS = BlockInfo::UFDS(numCTAs);
+      return {readsUFDS, writesUFDS};
+    }
+  } else if (auto tma =


might be worth having an interface so that we don't need to add every op explicitly in membar

I can do it for tma and ops that take a memdesc explicitly (so they have effects) but for the other ops (just convert_layout and reduce) I have to do it manually, I think. I can do it manually for both of them in one go tho.

lib/Analysis/Utility.cpp

ThomasRaoux · 2025-11-25T16:58:06Z

include/triton/Analysis/Membar.h

+            // Skip if filtered or both ops touch the same explicit shared
+            // allocation (same local_alloc).
+            return !((filter && filter(lhsOp, rhsOp)) ||
+                     (joined.isDistributed() && haveSameAlloc(lhsOp, rhsOp)));


why do we have to check for haveSameAlloc here? We should know that those intersect already.
I don't think it is safe to assume that we can always track back the alloc

Changed the alloc tracking to a backwardSlice. I'd hope that this should then give us a pessimistic but correct analysis.

The context as to why we need to do this is in the OP.

Jokeren

OK I roughly understand the ideas here. Will wait for @lezcano to ping me once he considers it ready for view

include/triton/Analysis/Membar.h

Jokeren · 2025-11-25T16:51:11Z

lib/Analysis/Membar.cpp

  OpBuilder::InsertionGuard g(*builder);
-  auto barrierOp = triton::gpu::LocalBarrierOp::create(*builder, op->getLoc());
+  if (ctaClasses.isDistributed()) {
+    // TODO Insert a finer barrier when there is more than one CTA class


What does "finer" barrier mean here? Can you clarify?

The idea is that using mbars we can often do better synchronising all the CTAs.

Jokeren · 2025-11-25T18:08:28Z

include/triton/Analysis/Membar.h

 struct BlockInfo {
-  using IntervalMapT = std::map<Interval<size_t>, std::set<Operation *>>;
+  // UFDS to represent cross-CTA reads/writes
+  struct UFDS {


Might be better to make the name more explicit so that the struct name self-explains. e.g., CrossCTAUnionFindSet.

Going with CTA_UFDS to keep things tight. I will expand the name in the comment above the class tho

lezcano · 2025-11-26T10:39:43Z

I wonder if we could separate it from membar in order to decouple the logic

If we decouple it, we are going to generate strictly worse code. The thing here is that we'd first run the membar pass and put a bar.sync between each of the ops that need CGA synchronisation, and then in the next pass we would do a CGA sync, which is rather wasteful.

The semantics here are that it's the user's/compiler's responsability to add the relevant synchronisation if they reuse the same shmem buffer, but otherwise the compiler will do so.

lezcano requested review from Jokeren and ptillet as code owners November 25, 2025 16:27

chatgpt-codex-connector bot reviewed Nov 25, 2025

View reviewed changes

ThomasRaoux reviewed Nov 25, 2025

View reviewed changes

Jokeren reviewed Nov 25, 2025

View reviewed changes

lezcano added 2 commits November 26, 2025 15:28

[WIP][BACKEND] Generalize the MemBar to consider cross-CTA ops

81d4eaa

The semantics here are that it's the user's/compiler's responsability to add the relevant synchronisation if they reuse the same shmem buffer, but otherwise the compiler will do so.

local_load via bufferid

d958c81

lezcano force-pushed the membar_cga branch from f6c38b2 to d958c81 Compare November 26, 2025 17:54

lezcano marked this pull request as draft November 30, 2025 09:27

[WIP][BACKEND] Generalize the MemBar to consider cross-CTA ops #8834

Are you sure you want to change the base?

[WIP][BACKEND] Generalize the MemBar to consider cross-CTA ops #8834

Uh oh!

Conversation

lezcano commented Nov 25, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jokeren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lezcano commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lezcano commented Nov 26, 2025 •

edited

Loading