Skip to content

Commit c23e300

Browse files
authored
[Tutorial] Add barrier before atomic in layernorm backward (triton-lang#6307)
The layernorm backward pass failed on the HIP backend. It generates different `db` and `dw` outputs in different runs. Reason of the problem is the `atomic_xchg` failed to synchronize different threads in a workgroup before its leading thread releases the lock. Specifically, the layernorm backward kernel uses a lock to create a critical section to sync multiple workgroups, so multiple workgroups can add to the same buffer sequentially. But the `atomic_xchg` did not sync threads of a workgroup before its leading thread releases the lock, so there is a scenario that other threads have not finished adding to the buffer when the leading thread releases the lock and this causes a problem. Initial solution is: when the `value` input of the atomic op is a scalar, we call `__syncthreads` to sync all threads in a workgroup before the leading thread can perform the atomic operation. But according to the discussion in triton-lang#4504, atomic ops are not supposed to do such synchronization for threads of a workgroup, so we will add an explicit barrier in the kernel of the backward pass. (This is the same approach to fix the unit test `test_core.py::test_atomic_cas`).
1 parent 646d063 commit c23e300

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

python/tutorials/05-layer-norm.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,11 @@ def _layer_norm_bwd_dx_fused(DX, # pointer to the input gradient
185185
partial_db += tl.load(DB, mask=mask)
186186
tl.store(DW, partial_dw, mask=mask)
187187
tl.store(DB, partial_db, mask=mask)
188+
189+
# need a barrier to ensure all threads finished before
190+
# releasing the lock
191+
tl.debug_barrier()
192+
188193
# Release the lock
189194
tl.atomic_xchg(Lock, 0)
190195

0 commit comments

Comments
 (0)