Skip to content

Conversation

@lezcano
Copy link
Contributor

@lezcano lezcano commented Oct 20, 2025

This PR relands #8386.
It depends on #8492 to avoid regressing in some workloads.

@lezcano lezcano changed the title [LAYOUTS] Generate distributed layouts for tcgen05.ld/st generically (#8421) [RELAND][LAYOUTS] Generate distributed layouts for tcgen05.ld/st generically (#8421) Oct 20, 2025
@lezcano lezcano requested a review from ThomasRaoux October 20, 2025 22:17
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ttng::TritonNvidiaGPUDialect, gluon::GluonDialect>();
MLIRContext context(MLIRContext::Threading::DISABLED);
context.appendDialectRegistry(registry);
context.loadAllAvailableDialects();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an overload of getDistributedLayoutForTmemLdSt that doesn't require MemDesc and CTALayoutAttr so we don't need to load the dialects?

Context creation alone is cheap (30 us) but once we load the dialects as well it goes up to 380 us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm actually I guess we would still need the dialects for the layout attrs either way.

Perhaps we could have a global "default context" that is used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is going to be called, in very complex kernels, at most 3 times. I'd say we leave it as-is and if we ever find that this is an issue we see how to fix it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to load the dialect if all we care about is string attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement what Peter suggested in the first comment. Let's do it in a follow-up tho

lezcano and others added 10 commits October 29, 2025 13:10
#8421)

We move the previous handwritten logic to infer good distributed layout
for TMEM layouts to a generic logic. This proves to be more robust than
the previous one, as we see in the lit tests, where we are able to get
full vectorisation in many cases that we didn't before.

Writing this generically also allows us to add support for the two
remaining
tcgen05.ld/st instructions that were missing.

We align the verifier and the lowering as to now error out during
verification
if and only if we would not be able to lower the given layout.

We expose this function in gluon as well and kill the duplicated logic
that we
had in gluon in favour of the generic logic.

**There is just one semantic change** (the rest is generalisation /
strengthening):
We now generate distributed layouts that generate the fully vectorised
load/store
instructions (i.e. they would generate just one load/store instruction
to load all the
registers). As we see in the `_blackwell.mlir` cases, this did not use
to be the case.
That being said, in some cases the previous heuristics allowed to
`tt.split` the tensor
along the second dimension, while now this may not be the case. If one
wants to perform
this splitting game, they need to modify a bit the layout as we do when
calling `32x32b_splitn`
in the test `test_tmem_subslice_block_m_64`.
This change goes in line with the rest of the heuristics, where we have
full vectorisation by
default, and we expose other layouts for other use cases like
`splitLongM`.
@lezcano lezcano merged commit b620136 into main Oct 29, 2025
9 checks passed
@lezcano lezcano deleted the reland_ldst branch October 29, 2025 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants