[BugFix] : Add missing __syncthreads() after AtomicAdd and replace address_ of with access_ptr#1581
[BugFix] : Add missing __syncthreads() after AtomicAdd and replace address_ of with access_ptr#1581Dayuxiaoshui wants to merge 3 commits intotile-ai:mainfrom
Conversation
…of with access_ptr This commit fixes two issues: 1. Issue tile-ai#1257: Add missing __syncthreads() after AtomicAdd operations - Added ThreadSyncAfterAtomicInserter class in thread_storage_sync.cc - Automatically inserts __syncthreads() after AtomicAdd on shared memory - Integrated into TileLangThreadSync() pass - Fixes synchronization issue in generated CUDA kernels 2. Issue tile-ai#1423: Replace address_of with access_ptr (tvm_access_ptr) - Updated atomic_add.cc to use MakeAccessPtrFromRegion - Updated atomicadd_vectorize.cc to use MakeAccessPtrFromRegion - Provides richer semantic information for analysis - Maintains backward compatibility Changes: - src/op/atomic_add.cc: Replace address_of with access_ptr in AtomicAdd operations - src/transform/atomicadd_vectorize.cc: Replace address_of with access_ptr in vectorization - src/transform/thread_storage_sync.cc: Add ThreadSyncAfterAtomicInserter for syncthreads insertion Verification: - Compilation: Success - Runtime: Verified with test case - Generated CUDA code: Confirmed __syncthreads() is present after AtomicAdd
📝 WalkthroughWalkthroughReplaces direct buffer address_of usage with region-derived access pointers (MakeAccessPtrFromRegion) in atomic-add lowering and vectorization; adds a pass that injects a tvm_storage_sync barrier after shared-memory atomic-adds, supporting both tvm_access_ptr and legacy address_of forms. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Vectorizer as AtomicAdd_Vectorize
participant Lower as AtomicAdd_Lower
participant Mem as SharedMemory
participant Sync as ThreadSyncInserter
Note over Vectorizer,Lower: New path: build BufferRegion → MakeAccessPtrFromRegion
Vectorizer->>Lower: emit atomic call with access_ptr (dst/src)
Lower->>Mem: perform atomic_add (tma_store / atomic op) using access_ptr
Mem-->>Lower: atomic completed
Lower->>Sync: Evaluate for insertion
Sync->>Sync: detect AtomicAdd on shared buffer (access_ptr or address_of)
Sync->>Mem: insert tvm_storage_sync (shared scope)
Note over Sync,Mem: Barrier ensures visibility after atomics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/op/atomic_add.cc (1)
276-282: Remove unused variabledst_load.The
dst_loadvariable created on line 276 is never used. It appears to be leftover from a previous approach.🔎 Proposed fix
- BufferLoad dst_load = BufferLoad(dst, dst_indices); Array<Range> dst_ranges; for (const PrimExpr &index : dst_indices) { dst_ranges.push_back(Range::FromMinExtent(index, 1)); } BufferRegion dst_region = BufferRegion(dst, dst_ranges); PrimExpr dst_ptr = MakeAccessPtrFromRegion(dst_region, 2); // 2 = write accesssrc/transform/atomicadd_vectorize.cc (1)
283-293: Simplify redundant conditional branches.Both branches of the if/else statement perform the same action (
new_args.push_back(node->args[0])), making the condition check unnecessary.🔎 Proposed fix
} else if (const auto *call = node->args[0].as<CallNode>()) { - // If it's already an address_of or access_ptr, forward it; otherwise, keep original. - if (call->op.same_as(builtin::address_of()) || - call->op.same_as(builtin::tvm_access_ptr())) { - new_args.push_back(node->args[0]); - } else { - new_args.push_back(node->args[0]); - } + // Forward the call as-is (address_of, access_ptr, or other) + new_args.push_back(node->args[0]); } else {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/op/atomic_add.ccsrc/transform/atomicadd_vectorize.ccsrc/transform/thread_storage_sync.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/atomic_add.cc (1)
src/op/utils.cc (2)
MakeAccessPtrFromRegion(55-93)MakeAccessPtrFromRegion(55-56)
🔇 Additional comments (6)
src/op/atomic_add.cc (1)
389-406: LGTM!The refactoring to use
MakeAccessPtrFromRegionfor both source and destination buffers in the TMA path is correct. Access modes are properly set (1 for read, 2 for write).src/transform/atomicadd_vectorize.cc (2)
7-7: LGTM!Necessary include for
MakeAccessPtrFromRegion.
237-262: LGTM!The conversion from
address_oftoMakeAccessPtrFromRegionis correctly implemented for all vector sizes (4, 2, and scalar). Access modes are properly set.src/transform/thread_storage_sync.cc (3)
499-506: LGTM!The class documentation and constructor are well-structured for inserting thread synchronization after atomic operations on shared memory.
507-557: Potential performance concern: sync inserted after every atomic operation.This inserter adds a
__syncthreads()after every individualAtomicAddon shared memory. If there are multiple consecutive atomic operations, this will insert multiple barriers which may be excessive. Consider whether batching syncs or deferring to the existingThreadSyncPlanneranalysis would be more efficient.However, for correctness, this conservative approach ensures proper visibility ordering.
893-896: LGTM!The integration correctly applies
ThreadSyncAfterAtomicInserteronly for shared memory scope, and the ordering in the pass pipeline is appropriate.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/op/atomic_add.ccsrc/transform/atomicadd_vectorize.ccsrc/transform/thread_storage_sync.cc
🚧 Files skipped from review as they are similar to previous changes (1)
- src/op/atomic_add.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/thread_storage_sync.cc (2)
src/transform/multi_version_buffer_rewriter.cc (2)
call(443-473)call(443-444)tilelang/language/tir/op.py (2)
call_extern(173-195)tvm_storage_sync(534-547)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (6)
src/transform/atomicadd_vectorize.cc (4)
252-265: LGTM! Pointer usage is consistent.The vectorized and scalar paths correctly use the converted
dst_ptrandvalue_ptr(orvalue_nodein scalar case) instead of the legacyaddress_ofapproach, maintaining consistency with the access_ptr-based representation.
278-286: LGTM! Non-vectorized conversion is correct.The BufferLoad-to-access_ptr conversion in the non-vectorized path uses the correct write access flag (2) and extent of 1, which is appropriate for scalar operations.
288-295: LGTM! Backward compatibility maintained.The forwarding logic correctly handles both
address_ofandtvm_access_ptr, ensuring backward compatibility during the transition to the new pointer representation.
237-251: Verify extent for vectorized accesses.The code creates
Range::FromMinExtent(index, 1)for each index, resulting in an extent of 1 in the finaltvm_access_ptr. However, whenvector_size_is 2 or 4, the actual memory access spans multiple elements. The extent in the last dimension should likely bevector_size_instead of 1 to accurately represent the accessed memory region for downstream analysis passes that rely ontvm_access_ptrextent information.src/transform/thread_storage_sync.cc (2)
499-506: LGTM! Class structure follows established patterns.The
ThreadSyncAfterAtomicInserterclass follows the same design pattern asThreadSyncAfterWaitQueueInserter, maintaining consistency within the module.
898-901: LGTM! Integration is correctly scoped and positioned.The
ThreadSyncAfterAtomicInserteris correctly applied only for shared memory scope and positioned after the mainThreadSyncInserter, ensuring proper ordering of synchronization insertion passes.
…branches, use empty()
|
@LeiWang1999 We have successfully fixed issue #1257 by inserting __syncthreads() synchronization after AtomicAdd operations on shared memory. The fix has been verified: it correctly inserts sync for shared memory AtomicAdd operations and does not insert sync for global memory AtomicAdd operations. There are 5 failing tests in CI related to backward pass, but all AtomicAdd operations in these tests target global memory (dQ, dV, dK), which should not be affected by our fix. The basic fix has been verified as correct, and the failing tests are likely caused by other factors that require further investigation. |
This commit fixes two issues:
Issue [BUG] Missing
__syncthreads()afterAtomicAddin generated CUDA kernel #1257: Add missing __syncthreads() after AtomicAdd operationsIssue [Feature Request] Replace all
address_ofwithaccess_ptr#1423: Replace address_of with access_ptr (tvm_access_ptr)Changes:
Verification:
Summary by CodeRabbit
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.