-
Notifications
You must be signed in to change notification settings - Fork 78
Merge OpenAI Triton commit 29009f1
#5587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit modifies the denorm behavior for precise sqrt: switching from FTZ (Flush To Zero) to denorm preservation.
This change addresses the issue that when there is a LoadOp and AddfOp between 2 dots in a loop, this LoadOp is not streamable in AMDGPUPipeline Pass. This case would make compile crash for erasing LoadOp which still have uses. The solution is to replace `loadToInfo` with `loadToStreamOps`, so that only erase LoadOps that are converted to Stream Ops.
This PR enables buffer atomic on RDNA4 for supported data types.
Fixes upgrade to rocm7 breaking proton tests alongside implementing CircularStoreOp for gmem <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: danial javady <[email protected]>
Currently we limit WMMA v3's kWidth to be {2, 8, 16} which matches the
hardware view for all possible WMMA instructions. In the case of
wmma_scaled, we assume kWidth always to be 16. But in attention kernel,
we can use kWidth = 8 which will remove the layout convert between 2
dots. This does not match the hardware view for continuous elements from
k dimension, but we can still get correct results unless the kWidth for
2 operands are the same. This PR removes the kWidth check for WMMA v3
and makes it mandatory, same as MFMA.
…788) Broadcasts in the `block` dimensions are not redundant so we should not mask them. This way each CTA has their own copy in shared memory, note that the multicast mask will be set in such cases to efficiently load the data.
We currently force initialisation of operands that have not yet been visited with `setToEntryState`. This means that the order in which values are visited can change the results of the analysis. This can be a source of bugs. For example, the lowering for `AsyncCopyGlobalToLocalOp` validates that the load addresses permit sufficient vectorisation, however, this is up to the analysis actually recovering the same information it had when the async copy was created. Otherwise, we crash during lowering. I have an actual repro for this but it has been very difficult to minimise it enough to make it suitable for an lit test: https://gist.github.com/neildhar/7eea6a312afa39d1cc83dc12627c2ba3 Populating the operands in this way also means that we have to handle control flow like `ForOp` and `IfOp` explicitly in `setToEntryState`, because we may attempt to populate their results when we visit their users. Instead, when we encounter an operation whose operands have not yet been encountered, skip over the operation entirely. We can revisit it once the operands have actually been visited. This improves the quality of the analysis, and leaves the handling of control flow to the dataflow framework. This reland adds handling for the case where the dataflow analysis fails to initialise a particular value (likely because it is determined to be dead).
any mxfp where natively supported requires using the persistent matmul kernel. in these cases, do not use heuristics to resolve `is_persistent`
While poking around in this code, I noticed this optimization only supports tensors. This PR generalizes it to work on scalars as well.
Use FCmp + Select + FMul instead of llvm.copysign.f32. This avoids some perf regressions.
…798) This PR adds `MemWaitOpTrait` trait which is used to identify all wait instructions operating on the memory. This allows to treat wait-operations from the third party dialects in the membar analysis in the same way as the native ones. This removes a workaround from AMDGPU backend.
A default constructed `AxisInfo` passed as an operand to `AxisInfo::join` will always result in `join` returning the other operand. This means that we can call `join` unconditionally even when there is no existing entry in the map. This collapses the three separate map lookups (the check, the join, and the population) to just a single one.
* patches workaround for loop-scheduler by using stage/cluster from previous tmem access op in the partition to set stage/cluster for put.exit op, and if needed for the follow-up put.enter op
…xt of printing. (#8682)
…is (#8758)" This reverts commit 31281bc.
whitneywhtsang
approved these changes
Dec 1, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the Triton base from e7fb841 to 29009f1 (Nov 22).
Pass rate: 95.42%