[TLX] Add tlx.threadfence() and tlx.threadfence_system()#1011
Open
[TLX] Add tlx.threadfence() and tlx.threadfence_system()#1011
Conversation
…em-scope memory fences Triton TBE backward uses a multi-CTA cooperative pattern where multiple thread blocks atomically accumulate partial gradients, then the last block applies the optimizer update. This requires a GPU-scope memory fence between gradient writes and counter decrement — equivalent to CUDA's __threadfence(). TLX previously had no GPU/system-scope fence. The new ops lower through TTNG_ThreadfenceOp to LLVM::FenceOp with the appropriate syncscope, which the NVPTX backend emits as fence.acq_rel.gpu / fence.acq_rel.sys. Authored with Claude.
Contributor
|
Thanks for working on this! Can we give the APIs a more general name? Since we already have |
Replaces tlx.threadfence(), tlx.threadfence_system(), and tlx.fence_async_shared() with a single tlx.fence(scope) API where scope is a required argument: "gpu", "sys", or "async_shared". fence_async_shared is kept as a deprecated shim for backward compat. Authored with Claude.
htyu
reviewed
Mar 2, 2026
Contributor
htyu
left a comment
There was a problem hiding this comment.
Looks good overall. Can you please update the README?
include/triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOps.td
Outdated
Show resolved
Hide resolved
Renames the MLIR op from ttng.threadfence to ttng.fence to match the Python API rename done in the previous commit. Adds tlx.fence() documentation to the README. Authored with Claude.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Triton TBE backward uses a multi-CTA cooperative pattern where multiple thread blocks atomically accumulate partial gradients, then the last block applies the optimizer update. This requires a GPU-scope memory fence between gradient writes and counter decrement — equivalent to CUDA's __threadfence(). TLX previously had no GPU/system-scope fence.
The new ops lower through TTNG_ThreadfenceOp to LLVM::FenceOp with the appropriate syncscope, which the NVPTX backend emits as fence.acq_rel.gpu / fence.acq_rel.sys.
Combine new APIs into
tlx.fence_async_shared()with a singletlx.fence(scope)API where scopeis a required argument: "gpu", "sys", or "async_shared".
Test plan:
Internal use case D94839329