[TLX] Add tlx.threadfence() and tlx.threadfence_system() by dshi7 · Pull Request #1011 · facebookexperimental/triton

dshi7 · 2026-03-01T19:16:50Z

Triton TBE backward uses a multi-CTA cooperative pattern where multiple thread blocks atomically accumulate partial gradients, then the last block applies the optimizer update. This requires a GPU-scope memory fence between gradient writes and counter decrement — equivalent to CUDA's __threadfence(). TLX previously had no GPU/system-scope fence.

The new ops lower through TTNG_ThreadfenceOp to LLVM::FenceOp with the appropriate syncscope, which the NVPTX backend emits as fence.acq_rel.gpu / fence.acq_rel.sys.

Combine new APIs into tlx.fence_async_shared() with a single tlx.fence(scope) API where scope
is a required argument: "gpu", "sys", or "async_shared".

Test plan:

  pytest python/test/unit/language/test_tlx.py::test_fence_gpu
  pytest python/test/unit/language/test_tlx.py::test_fence_sys
  pytest python/test/unit/language/test_tlx.py::test_descriptor_load
  pytest python/test/unit/language/test_tlx.py::test_descriptor_load_l2_cache_hint
  pytest python/test/unit/language/test_tlx.py::test_descriptor_store_l2_cache_hint
  pytest python/test/unit/language/test_tlx.py::test_descriptor_store_reduce
  pytest python/test/unit/language/test_tlx.py::test_descriptor_load_multicast
  pytest python/test/unit/language/test_tlx.py::test_dummy_layout_function_inlining
  pytest third_party/tlx/tutorials/testing/test_correctness.py

Internal use case D94839329

…em-scope memory fences Triton TBE backward uses a multi-CTA cooperative pattern where multiple thread blocks atomically accumulate partial gradients, then the last block applies the optimizer update. This requires a GPU-scope memory fence between gradient writes and counter decrement — equivalent to CUDA's __threadfence(). TLX previously had no GPU/system-scope fence. The new ops lower through TTNG_ThreadfenceOp to LLVM::FenceOp with the appropriate syncscope, which the NVPTX backend emits as fence.acq_rel.gpu / fence.acq_rel.sys. Authored with Claude.

meta-codesync · 2026-03-01T19:52:25Z

@dshi7 has imported this pull request. If you are a Meta employee, you can view this in D94837956.

htyu · 2026-03-02T20:02:34Z

Thanks for working on this! Can we give the APIs a more general name? Since we already have tlx.fence_async_shared, maybe having a generic name like fence that covers both global and shared ?

Replaces tlx.threadfence(), tlx.threadfence_system(), and tlx.fence_async_shared() with a single tlx.fence(scope) API where scope is a required argument: "gpu", "sys", or "async_shared". fence_async_shared is kept as a deprecated shim for backward compat. Authored with Claude.

htyu

Looks good overall. Can you please update the README?

include/triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOps.td

Renames the MLIR op from ttng.threadfence to ttng.fence to match the Python API rename done in the previous commit. Adds tlx.fence() documentation to the README. Authored with Claude.

htyu

LGTM!

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 1, 2026

dshi7 changed the title ~~[TLX] Add tlx.threadfence() and tlx.threadfence_system() for GPU/syst…~~ [TLX] Add tlx.threadfence() and tlx.threadfence_system() Mar 1, 2026

dshi7 requested a review from htyu March 2, 2026 15:07

dshi7 added 4 commits March 2, 2026 14:02

Merge branch 'main' into daohang/tlx.threadfence

a221c7c

linter

5bf6902

Merge branch 'main' into daohang/tlx.threadfence

e248049

htyu reviewed Mar 2, 2026

View reviewed changes

include/triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOps.td Outdated Show resolved Hide resolved

dshi7 added 2 commits March 2, 2026 17:10

[TLX] Rename TTNG_ThreadfenceOp to TTNG_FenceOp and document tlx.fence()

12831dd

Renames the MLIR op from ttng.threadfence to ttng.fence to match the Python API rename done in the previous commit. Adds tlx.fence() documentation to the README. Authored with Claude.

Merge branch 'main' into daohang/tlx.threadfence

38fb305

htyu approved these changes Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLX] Add tlx.threadfence() and tlx.threadfence_system()#1011

[TLX] Add tlx.threadfence() and tlx.threadfence_system()#1011
dshi7 wants to merge 7 commits intomainfrom
daohang/tlx.threadfence

dshi7 commented Mar 1, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 1, 2026

Uh oh!

htyu commented Mar 2, 2026

Uh oh!

htyu left a comment

Uh oh!

Uh oh!

htyu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dshi7 commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Mar 1, 2026

Uh oh!

htyu commented Mar 2, 2026

Uh oh!

htyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

htyu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dshi7 commented Mar 1, 2026 •

edited

Loading