Skip to content

Conversation

@anmyachev
Copy link
Contributor

This PR changes the Triton base from e7fb841 to 6294db5 (Nov 21).

Pass rate: _

Blocked on #5582

ThomasRaoux and others added 11 commits December 1, 2025 10:59
This commit modifies the denorm behavior for precise sqrt: switching
from FTZ (Flush To Zero) to denorm preservation.
This change addresses the issue that when there is a LoadOp and AddfOp
between 2 dots in a loop, this LoadOp is not streamable in
AMDGPUPipeline Pass. This case would make compile crash for erasing
LoadOp which still have uses.

The solution is to replace `loadToInfo` with `loadToStreamOps`, so that
only erase LoadOps that are converted to Stream Ops.
This PR enables buffer atomic on RDNA4 for supported data types.
Fixes upgrade to rocm7 breaking proton tests alongside implementing
CircularStoreOp for gmem

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: danial javady <[email protected]>
Currently we limit WMMA v3's kWidth to be {2, 8, 16} which matches the
hardware view for all possible WMMA instructions. In the case of
wmma_scaled, we assume kWidth always to be 16. But in attention kernel,
we can use kWidth = 8 which will remove the layout convert between 2
dots. This does not match the hardware view for continuous elements from
k dimension, but we can still get correct results unless the kWidth for
2 operands are the same. This PR removes the kWidth check for WMMA v3
and makes it mandatory, same as MFMA.
…788)

Broadcasts in the `block` dimensions are not redundant so we should not
mask them. This way each CTA has their own copy in shared memory, note
that the multicast mask will be set in such cases to efficiently load
the data.
We currently force initialisation of operands that have not yet been
visited with `setToEntryState`. This means that the order in which
values are visited can change the results of the analysis.

This can be a source of bugs. For example, the lowering for
`AsyncCopyGlobalToLocalOp` validates that the load addresses permit
sufficient vectorisation, however, this is up to the analysis actually
recovering the same information it had when the async copy was created.
Otherwise, we crash during lowering. I have an actual repro for this but
it has been very difficult to minimise it enough to make it suitable for
an lit test:
https://gist.github.com/neildhar/7eea6a312afa39d1cc83dc12627c2ba3

Populating the operands in this way also means that we have to handle
control flow like `ForOp` and `IfOp` explicitly in `setToEntryState`,
because we may attempt to populate their results when we visit their
users.

Instead, when we encounter an operation whose operands have not yet been
encountered, skip over the operation entirely. We can revisit it once
the operands have actually been visited. This improves the quality of
the analysis, and leaves the handling of control flow to the dataflow
framework.

This reland adds handling for the case where the dataflow analysis fails
to initialise a particular value (likely because it is determined to be
dead).
any mxfp where natively supported requires using the persistent matmul
kernel. in these cases, do not use heuristics to resolve `is_persistent`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.