Skip to content

Conversation

@lezcano
Copy link
Contributor

@lezcano lezcano commented Nov 25, 2025

The semantics here are that it's the user's/compiler's responsability to
add the relevant synchronisation if they reuse the same shmem buffer,
but otherwise the compiler will do so.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 1144 to 1147
if (auto blockArg = dyn_cast<BlockArgument>(cur)) {
auto yield = cast<scf::YieldOp>(blockArg.getOwner()->getTerminator());
cur = yield.getOperand(blockArg.getArgNumber() - 1);
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Loop-carried memdesc lookup underflows on non-for regions

The new findShmemAlloc assumes every memdesc BlockArgument is preceded by an induction variable and pulls the defining value with yield.getOperand(blockArg.getArgNumber() - 1). In scf.while/other regions whose block arguments are one-to-one with the yielded values, getArgNumber() can be 0, so subtracting 1 dereferences a negative index and trips an assertion/UB when membar analysis touches a memdesc carried through such a loop. This is a regression from the previous implementation (in TritonGPU/Transforms/Utility.cpp) which only subtracted one for scf.for blocks and otherwise handled block args directly.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could separate it from membar in order to decouple the logic

auto writesUFDS = BlockInfo::UFDS(numCTAs);
return {readsUFDS, writesUFDS};
}
} else if (auto tma =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth having an interface so that we don't need to add every op explicitly in membar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do it for tma and ops that take a memdesc explicitly (so they have effects) but for the other ops (just convert_layout and reduce) I have to do it manually, I think. I can do it manually for both of them in one go tho.

// Skip if filtered or both ops touch the same explicit shared
// allocation (same local_alloc).
return !((filter && filter(lhsOp, rhsOp)) ||
(joined.isDistributed() && haveSameAlloc(lhsOp, rhsOp)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to check for haveSameAlloc here? We should know that those intersect already.
I don't think it is safe to assume that we can always track back the alloc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the alloc tracking to a backwardSlice. I'd hope that this should then give us a pessimistic but correct analysis.

The context as to why we need to do this is in the OP.

Copy link
Contributor

@Jokeren Jokeren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I roughly understand the ideas here. Will wait for @lezcano to ping me once he considers it ready for view

OpBuilder::InsertionGuard g(*builder);
auto barrierOp = triton::gpu::LocalBarrierOp::create(*builder, op->getLoc());
if (ctaClasses.isDistributed()) {
// TODO Insert a finer barrier when there is more than one CTA class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "finer" barrier mean here? Can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that using mbars we can often do better synchronising all the CTAs.

struct BlockInfo {
using IntervalMapT = std::map<Interval<size_t>, std::set<Operation *>>;
// UFDS to represent cross-CTA reads/writes
struct UFDS {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to make the name more explicit so that the struct name self-explains. e.g., CrossCTAUnionFindSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with CTA_UFDS to keep things tight. I will expand the name in the comment above the class tho

@lezcano
Copy link
Contributor Author

lezcano commented Nov 26, 2025

I wonder if we could separate it from membar in order to decouple the logic

If we decouple it, we are going to generate strictly worse code. The thing here is that we'd first run the membar pass and put a bar.sync between each of the ops that need CGA synchronisation, and then in the next pass we would do a CGA sync, which is rather wasteful.

The semantics here are that it's the user's/compiler's responsability to
add the relevant synchronisation if they reuse the same shmem buffer,
but otherwise the compiler will do so.
@lezcano lezcano marked this pull request as draft November 30, 2025 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants