-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[RELAND][LAYOUTS] Generate distributed layouts for tcgen05.ld/st generically (#8421)
#8495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tcgen05.ld/st generically (#8421)tcgen05.ld/st generically (#8421)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ttng::TritonNvidiaGPUDialect, gluon::GluonDialect>(); | ||
| MLIRContext context(MLIRContext::Threading::DISABLED); | ||
| context.appendDialectRegistry(registry); | ||
| context.loadAllAvailableDialects(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an overload of getDistributedLayoutForTmemLdSt that doesn't require MemDesc and CTALayoutAttr so we don't need to load the dialects?
Context creation alone is cheap (30 us) but once we load the dialects as well it goes up to 380 us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm actually I guess we would still need the dialects for the layout attrs either way.
Perhaps we could have a global "default context" that is used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is going to be called, in very complex kernels, at most 3 times. I'd say we leave it as-is and if we ever find that this is an issue we see how to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to load the dialect if all we care about is string attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could implement what Peter suggested in the first comment. Let's do it in a follow-up tho
#8421) We move the previous handwritten logic to infer good distributed layout for TMEM layouts to a generic logic. This proves to be more robust than the previous one, as we see in the lit tests, where we are able to get full vectorisation in many cases that we didn't before. Writing this generically also allows us to add support for the two remaining tcgen05.ld/st instructions that were missing. We align the verifier and the lowering as to now error out during verification if and only if we would not be able to lower the given layout. We expose this function in gluon as well and kill the duplicated logic that we had in gluon in favour of the generic logic. **There is just one semantic change** (the rest is generalisation / strengthening): We now generate distributed layouts that generate the fully vectorised load/store instructions (i.e. they would generate just one load/store instruction to load all the registers). As we see in the `_blackwell.mlir` cases, this did not use to be the case. That being said, in some cases the previous heuristics allowed to `tt.split` the tensor along the second dimension, while now this may not be the case. If one wants to perform this splitting game, they need to modify a bit the layout as we do when calling `32x32b_splitn` in the test `test_tmem_subslice_block_m_64`. This change goes in line with the rest of the heuristics, where we have full vectorisation by default, and we expose other layouts for other use cases like `splitLongM`.
This PR relands #8386.
It depends on #8492 to avoid regressing in some workloads.