[BugFix] : Add missing __syncthreads() after AtomicAdd and replace address_ of with access_ptr by Dayuxiaoshui · Pull Request #1581 · tile-ai/tilelang

Dayuxiaoshui · 2025-12-31T07:24:14Z

This commit fixes two issues:

Issue [BUG] Missing __syncthreads() after AtomicAdd in generated CUDA kernel #1257: Add missing __syncthreads() after AtomicAdd operations
- Added ThreadSyncAfterAtomicInserter class in thread_storage_sync.cc
- Automatically inserts __syncthreads() after AtomicAdd on shared memory
- Integrated into TileLangThreadSync() pass
- Fixes synchronization issue in generated CUDA kernels
Issue [Feature Request] Replace all address_of with access_ptr #1423: Replace address_of with access_ptr (tvm_access_ptr)
- Updated atomic_add.cc to use MakeAccessPtrFromRegion
- Updated atomicadd_vectorize.cc to use MakeAccessPtrFromRegion
- Provides richer semantic information for analysis
- Maintains backward compatibility

Changes:

src/op/atomic_add.cc: Replace address_of with access_ptr in AtomicAdd operations
src/transform/atomicadd_vectorize.cc: Replace address_of with access_ptr in vectorization
src/transform/thread_storage_sync.cc: Add ThreadSyncAfterAtomicInserter for syncthreads insertion

Verification:

Compilation: Success
Runtime: Verified with test case
Generated CUDA code: Confirmed __syncthreads() is present after AtomicAdd

Summary by CodeRabbit

Bug Fixes
- Automatically insert thread synchronization barriers after atomic operations targeting shared memory to ensure correct visibility and ordering.
Refactor
- Standardized buffer argument handling to use region-based access pointers across atomic, vectorized, and transfer paths, improving consistency while preserving public interfaces and remaining compatible with legacy pointer forms.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…of with access_ptr This commit fixes two issues: 1. Issue tile-ai#1257: Add missing __syncthreads() after AtomicAdd operations - Added ThreadSyncAfterAtomicInserter class in thread_storage_sync.cc - Automatically inserts __syncthreads() after AtomicAdd on shared memory - Integrated into TileLangThreadSync() pass - Fixes synchronization issue in generated CUDA kernels 2. Issue tile-ai#1423: Replace address_of with access_ptr (tvm_access_ptr) - Updated atomic_add.cc to use MakeAccessPtrFromRegion - Updated atomicadd_vectorize.cc to use MakeAccessPtrFromRegion - Provides richer semantic information for analysis - Maintains backward compatibility Changes: - src/op/atomic_add.cc: Replace address_of with access_ptr in AtomicAdd operations - src/transform/atomicadd_vectorize.cc: Replace address_of with access_ptr in vectorization - src/transform/thread_storage_sync.cc: Add ThreadSyncAfterAtomicInserter for syncthreads insertion Verification: - Compilation: Success - Runtime: Verified with test case - Generated CUDA code: Confirmed __syncthreads() is present after AtomicAdd

coderabbitai · 2025-12-31T07:24:24Z

📝 Walkthrough

Walkthrough

Replaces direct buffer address_of usage with region-derived access pointers (MakeAccessPtrFromRegion) in atomic-add lowering and vectorization; adds a pass that injects a tvm_storage_sync barrier after shared-memory atomic-adds, supporting both tvm_access_ptr and legacy address_of forms.

Changes

Cohort / File(s)	Change Summary
Atomic Add Lowering `src/op/atomic_add.cc`	Replace `tvm_address_of` usage with `BufferRegion` → `MakeAccessPtrFromRegion` for dst/src pointers in SIMT loop and TMA lowering; use explicit access types (read=1, write=2) and pass access_ptrs to tma_store.
Atomic Add Vectorization `src/transform/atomicadd_vectorize.cc`	Add `utils.h` include; convert BufferLoad/address_of first-argument handling to access pointers via `MakeAccessPtrFromRegion` across vectorized (x4/x2), scalar, and forwarding paths; forward access_ptrs instead of address_of.
Thread Storage Synchronization `src/transform/thread_storage_sync.cc`	Add `ThreadSyncAfterAtomicInserter` and integrate into TileLangThreadSync to insert `tvm_storage_sync` after AtomicAdd family calls when the target buffer is in shared storage; detects both `tvm_access_ptr` and legacy `address_of`+`BufferLoad` forms.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Vectorizer as AtomicAdd_Vectorize
  participant Lower as AtomicAdd_Lower
  participant Mem as SharedMemory
  participant Sync as ThreadSyncInserter

  Note over Vectorizer,Lower: New path: build BufferRegion → MakeAccessPtrFromRegion
  Vectorizer->>Lower: emit atomic call with access_ptr (dst/src)
  Lower->>Mem: perform atomic_add (tma_store / atomic op) using access_ptr
  Mem-->>Lower: atomic completed
  Lower->>Sync: Evaluate for insertion
  Sync->>Sync: detect AtomicAdd on shared buffer (access_ptr or address_of)
  Sync->>Mem: insert tvm_storage_sync (shared scope)
  Note over Sync,Mem: Barrier ensures visibility after atomics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[Refactor] Phase out the primitives folder since its design has been merged into tileop #1429: Touches same atomic pointer/address handling and lowering/vectorization paths.
[Feature]: Add test for atomicadd auto vectorize and remove useless code #1019: Overlaps in atomic-add lowering and vectorization changes for buffer pointer handling.
[Bugfix] Implement classic arena algorithm for shmem merge and WAW conflict detection #1146: Related modifications to thread-sync insertion logic around atomic operations.

Suggested reviewers

LeiWang1999
Rachmanino

Poem

🐰 I hopped through regions, found a ptr so spry,

Replaced old addresses with access on the fly,
When shared bits add, I call for a sync,
Threads line up tidy, not a race to the brink,
I nibble code crumbs and leave a safer sky.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main changes in the PR: adding __syncthreads() after AtomicAdd and replacing address_of with access_ptr across three modified files.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-12-31T07:24:24Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/op/atomic_add.cc (1)

276-282: Remove unused variable dst_load.

The dst_load variable created on line 276 is never used. It appears to be leftover from a previous approach.
🔎 Proposed fix
-  BufferLoad dst_load = BufferLoad(dst, dst_indices);
   Array<Range> dst_ranges;
   for (const PrimExpr &index : dst_indices) {
     dst_ranges.push_back(Range::FromMinExtent(index, 1));
   }
   BufferRegion dst_region = BufferRegion(dst, dst_ranges);
   PrimExpr dst_ptr = MakeAccessPtrFromRegion(dst_region, 2); // 2 = write access

src/transform/atomicadd_vectorize.cc (1)

283-293: Simplify redundant conditional branches.

Both branches of the if/else statement perform the same action (new_args.push_back(node->args[0])), making the condition check unnecessary.

🔎 Proposed fix

         } else if (const auto *call = node->args[0].as<CallNode>()) {
-          // If it's already an address_of or access_ptr, forward it; otherwise, keep original.
-          if (call->op.same_as(builtin::address_of()) || 
-              call->op.same_as(builtin::tvm_access_ptr())) {
-            new_args.push_back(node->args[0]);
-          } else {
-            new_args.push_back(node->args[0]);
-          }
+          // Forward the call as-is (address_of, access_ptr, or other)
+          new_args.push_back(node->args[0]);
         } else {

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53ea96c and 2cdedac.

📒 Files selected for processing (3)

src/op/atomic_add.cc
src/transform/atomicadd_vectorize.cc
src/transform/thread_storage_sync.cc

🧰 Additional context used

🧬 Code graph analysis (1)

src/op/atomic_add.cc (1)

src/op/utils.cc (2)

MakeAccessPtrFromRegion (55-93)

MakeAccessPtrFromRegion (55-56)

🔇 Additional comments (6)

src/op/atomic_add.cc (1)

389-406: LGTM!

The refactoring to use MakeAccessPtrFromRegion for both source and destination buffers in the TMA path is correct. Access modes are properly set (1 for read, 2 for write).

src/transform/atomicadd_vectorize.cc (2)

7-7: LGTM!

Necessary include for MakeAccessPtrFromRegion.

237-262: LGTM!

The conversion from address_of to MakeAccessPtrFromRegion is correctly implemented for all vector sizes (4, 2, and scalar). Access modes are properly set.

src/transform/thread_storage_sync.cc (3)

499-506: LGTM!

The class documentation and constructor are well-structured for inserting thread synchronization after atomic operations on shared memory.

507-557: Potential performance concern: sync inserted after every atomic operation.

This inserter adds a __syncthreads() after every individual AtomicAdd on shared memory. If there are multiple consecutive atomic operations, this will insert multiple barriers which may be excessive. Consider whether batching syncs or deferring to the existing ThreadSyncPlanner analysis would be more efficient.

However, for correctness, this conservative approach ensures proper visibility ordering.

893-896: LGTM!

The integration correctly applies ThreadSyncAfterAtomicInserter only for shared memory scope, and the ordering in the pass pipeline is appropriate.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cdedac and 294fbae.

📒 Files selected for processing (3)

src/op/atomic_add.cc
src/transform/atomicadd_vectorize.cc
src/transform/thread_storage_sync.cc

🚧 Files skipped from review as they are similar to previous changes (1)

src/op/atomic_add.cc

🧰 Additional context used

🧬 Code graph analysis (1)

src/transform/thread_storage_sync.cc (2)

src/transform/multi_version_buffer_rewriter.cc (2)

call (443-473)

call (443-444)

tilelang/language/tir/op.py (2)

call_extern (173-195)

tvm_storage_sync (534-547)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (6)

src/transform/atomicadd_vectorize.cc (4)

252-265: LGTM! Pointer usage is consistent.

The vectorized and scalar paths correctly use the converted dst_ptr and value_ptr (or value_node in scalar case) instead of the legacy address_of approach, maintaining consistency with the access_ptr-based representation.

278-286: LGTM! Non-vectorized conversion is correct.

The BufferLoad-to-access_ptr conversion in the non-vectorized path uses the correct write access flag (2) and extent of 1, which is appropriate for scalar operations.

288-295: LGTM! Backward compatibility maintained.

The forwarding logic correctly handles both address_of and tvm_access_ptr, ensuring backward compatibility during the transition to the new pointer representation.

237-251: Verify extent for vectorized accesses.

The code creates Range::FromMinExtent(index, 1) for each index, resulting in an extent of 1 in the final tvm_access_ptr. However, when vector_size_ is 2 or 4, the actual memory access spans multiple elements. The extent in the last dimension should likely be vector_size_ instead of 1 to accurately represent the accessed memory region for downstream analysis passes that rely on tvm_access_ptr extent information.

src/transform/thread_storage_sync.cc (2)

499-506: LGTM! Class structure follows established patterns.

The ThreadSyncAfterAtomicInserter class follows the same design pattern as ThreadSyncAfterWaitQueueInserter, maintaining consistency within the module.

898-901: LGTM! Integration is correctly scoped and positioned.

The ThreadSyncAfterAtomicInserter is correctly applied only for shared memory scope and positioned after the main ThreadSyncInserter, ensuring proper ordering of synchronization insertion passes.

src/transform/thread_storage_sync.cc

…branches, use empty()

Dayuxiaoshui · 2025-12-31T16:04:40Z

@LeiWang1999 We have successfully fixed issue #1257 by inserting __syncthreads() synchronization after AtomicAdd operations on shared memory. The fix has been verified: it correctly inserts sync for shared memory AtomicAdd operations and does not insert sync for global memory AtomicAdd operations. There are 5 failing tests in CI related to backward pass, but all AtomicAdd operations in these tests target global memory (dQ, dV, dK), which should not be affected by our fix. The basic fix has been verified as correct, and the failing tests are likely caused by other factors that require further investigation.

coderabbitai bot reviewed Dec 31, 2025

View reviewed changes

style: Apply clang-format fixes from pre-commit

294fbae

coderabbitai bot reviewed Dec 31, 2025

View reviewed changes

src/transform/thread_storage_sync.cc Show resolved Hide resolved

fix: Address code review comments - remove unused variable, simplify …

b370d87

…branches, use empty()

Dayuxiaoshui changed the title ~~fix: Add missing __syncthreads() after AtomicAdd and replace address_ of with access_ptr~~ [BugFix] : Add missing __syncthreads() after AtomicAdd and replace address_ of with access_ptr Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] : Add missing __syncthreads() after AtomicAdd and replace address_ of with access_ptr#1581

[BugFix] : Add missing __syncthreads() after AtomicAdd and replace address_ of with access_ptr#1581
Dayuxiaoshui wants to merge 3 commits intotile-ai:mainfrom
Dayuxiaoshui:main

Dayuxiaoshui commented Dec 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 31, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Dayuxiaoshui commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dayuxiaoshui commented Dec 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dayuxiaoshui commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dayuxiaoshui commented Dec 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 31, 2025 •

edited

Loading