Skip to content

[TLX] Add tlx.threadfence() and tlx.threadfence_system()#1011

Open
dshi7 wants to merge 7 commits intomainfrom
daohang/tlx.threadfence
Open

[TLX] Add tlx.threadfence() and tlx.threadfence_system()#1011
dshi7 wants to merge 7 commits intomainfrom
daohang/tlx.threadfence

Conversation

@dshi7
Copy link
Contributor

@dshi7 dshi7 commented Mar 1, 2026

Triton TBE backward uses a multi-CTA cooperative pattern where multiple thread blocks atomically accumulate partial gradients, then the last block applies the optimizer update. This requires a GPU-scope memory fence between gradient writes and counter decrement — equivalent to CUDA's __threadfence(). TLX previously had no GPU/system-scope fence.

The new ops lower through TTNG_ThreadfenceOp to LLVM::FenceOp with the appropriate syncscope, which the NVPTX backend emits as fence.acq_rel.gpu / fence.acq_rel.sys.

Combine new APIs into tlx.fence_async_shared() with a single tlx.fence(scope) API where scope
is a required argument: "gpu", "sys", or "async_shared".

Test plan:

  pytest python/test/unit/language/test_tlx.py::test_fence_gpu
  pytest python/test/unit/language/test_tlx.py::test_fence_sys
  pytest python/test/unit/language/test_tlx.py::test_descriptor_load
  pytest python/test/unit/language/test_tlx.py::test_descriptor_load_l2_cache_hint
  pytest python/test/unit/language/test_tlx.py::test_descriptor_store_l2_cache_hint
  pytest python/test/unit/language/test_tlx.py::test_descriptor_store_reduce
  pytest python/test/unit/language/test_tlx.py::test_descriptor_load_multicast
  pytest python/test/unit/language/test_tlx.py::test_dummy_layout_function_inlining
  pytest third_party/tlx/tutorials/testing/test_correctness.py

Internal use case D94839329

…em-scope memory fences

Triton TBE backward uses a multi-CTA cooperative pattern where multiple
thread blocks atomically accumulate partial gradients, then the last
block applies the optimizer update. This requires a GPU-scope memory
fence between gradient writes and counter decrement — equivalent to
CUDA's __threadfence(). TLX previously had no GPU/system-scope fence.

The new ops lower through TTNG_ThreadfenceOp to LLVM::FenceOp with the
appropriate syncscope, which the NVPTX backend emits as
fence.acq_rel.gpu / fence.acq_rel.sys.

Authored with Claude.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 1, 2026
@dshi7 dshi7 changed the title [TLX] Add tlx.threadfence() and tlx.threadfence_system() for GPU/syst… [TLX] Add tlx.threadfence() and tlx.threadfence_system() Mar 1, 2026
@meta-codesync
Copy link

meta-codesync bot commented Mar 1, 2026

@dshi7 has imported this pull request. If you are a Meta employee, you can view this in D94837956.

@dshi7 dshi7 requested a review from htyu March 2, 2026 15:07
@htyu
Copy link
Contributor

htyu commented Mar 2, 2026

Thanks for working on this! Can we give the APIs a more general name? Since we already have tlx.fence_async_shared, maybe having a generic name like fence that covers both global and shared ?

dshi7 added 4 commits March 2, 2026 14:02
Replaces tlx.threadfence(), tlx.threadfence_system(), and
tlx.fence_async_shared() with a single tlx.fence(scope) API where scope
is a required argument: "gpu", "sys", or "async_shared".
fence_async_shared is kept as a deprecated shim for backward compat.

Authored with Claude.
Copy link
Contributor

@htyu htyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Can you please update the README?

dshi7 added 2 commits March 2, 2026 17:10
Renames the MLIR op from ttng.threadfence to ttng.fence to match the
Python API rename done in the previous commit. Adds tlx.fence()
documentation to the README.

Authored with Claude.
Copy link
Contributor

@htyu htyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants