[Tutorial] Add barrier before atomic in layernorm backward (triton-lang#6307)

scxiao · web-flow · commit c23e30008fad · 2025-04-02T14:58:58.000-07:00
The layernorm backward pass failed on the HIP backend. It generates different `db` and `dw` outputs in different runs. Reason of the problem is the `atomic_xchg` failed to synchronize different threads in a workgroup before its leading thread releases the lock. Specifically, the layernorm backward kernel uses a lock to create a critical section to sync multiple workgroups, so multiple workgroups can add to the same buffer sequentially. But the `atomic_xchg` did not sync threads of a workgroup before its leading thread releases the lock, so there is a scenario that other threads have not finished adding to the buffer when the leading thread releases the lock and this causes a problem. Initial solution is: when the `value` input of the atomic op is a scalar, we call `__syncthreads` to sync all threads in a workgroup before the leading thread can perform the atomic operation. But according to the discussion in triton-lang#4504, atomic ops are not supposed to do such synchronization for threads of a workgroup, so we will add an explicit barrier in the kernel of the backward pass. (This is the same approach to fix the unit test `test_core.py::test_atomic_cas`).
diff --git a/python/tutorials/05-layer-norm.py b/python/tutorials/05-layer-norm.py
@@ -185,6 +185,11 @@ def _layer_norm_bwd_dx_fused(DX,  # pointer to the input gradient
         partial_db += tl.load(DB, mask=mask)
     tl.store(DW, partial_dw, mask=mask)
     tl.store(DB, partial_db, mask=mask)
+
+    # need a barrier to ensure all threads finished before
+    # releasing the lock
+    tl.debug_barrier()
+
     # Release the lock
     tl.atomic_xchg(Lock, 0)