Skip to content

Commit 1070cd5

Browse files
[AMD] Fix nondeterministic atomic tests failure on RDNA (triton-lang#8633)
This PR fixes nondeterministic results in Triton's atomic RMW unit tests on RDNA. The issue was caused by an unnecessary non-offset LDS store of the atomic return ("old") that every lane executed to the same LDS address, creating a race without any functional use. %16 = atomicrmw add ptr addrspace(1) %dst, i64 %val syncscope("agent") acq_rel store i64 %16, ptr addrspace(3) @global_smem ; **<-- non-offset LDS write** . . . %8 = (lane_id & 63) << 3 %19 = getelementptr i8, ptr addrspace(3) @global_smem, i32 %8 store <1 x i64> %21, ptr addrspace(3) %19 %24 = load i64, ptr addrspace(3) %19 There is no read from @global_smem at offset 0 after the first store. Thus, the zero-offset store is dead, but every lane still writes it, causing write-write contention. For lane 0, the same location later used for the real value. **Mismatched elements: 1 / 64 (1.56%)** Max absolute difference among violations: 1.87127712 Max relative difference among violations: 1.52464041 **ACTUAL:** array([**-0.643921**, -0.221742, 1.451132, -2.439096, 1.336105, 0.748346, 0.725985, -0.604978, 2.061625, 1.040246, -0.511721, -0.546404, -0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,... **DESIRED:** array([ **1.227356**, -0.221742, 1.451132, -2.439096, 1.336105, 0.748346, 0.725985, -0.604978, 2.061625, 1.040246, -0.511721, -0.546404, -0.163719, 0.279938, -0.730161, -0.21177 , -0.323359, 1.263584,... ============================== 1 failed in 3.73s =============================== A non-offset LDS store (e.g., writing to LDS[0]) only makes sense in a true scalar scenario when You deliberately perform one atomic per wave (single output). **Fix** We removed/guarded the non-offset LDS write so it is not emitted when it is unsafe or unnecessary: No non-offset LDS staging for tensor/multi-output atomics. No LDS staging when the atomic return is unused.
1 parent b78e1de commit 1070cd5

File tree

2 files changed

+3
-2
lines changed

2 files changed

+3
-2
lines changed

python/test/unit/language/test_core.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1404,7 +1404,7 @@ def kernel(X):
14041404
@pytest.mark.interpreter
14051405
@pytest.mark.parametrize("shape, axis, num_ctas, dtype_x_str, check_return_val",
14061406
[(shape, axis, num_ctas, dtype_x_str, check_return_val)
1407-
for shape in [(2, 2), (2, 8), (8, 2), (8, 8), (32, 32), (64, 64)]
1407+
for shape in [(2, 2), (2, 8), (8, 2), (8, 8), (32, 32), (64, 64), (128, 128)]
14081408
for axis in [0, 1]
14091409
for num_ctas in num_ctas_list
14101410
for dtype_x_str in ['bfloat16', 'float16', 'float32', 'uint64', 'int64', 'float64']

third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1852,8 +1852,9 @@ struct AtomicRMWOpConversion
18521852
emitRedundantThreadPredicate(freeVarMasks, rewriter, loc, targetInfo);
18531853
auto tid = getThreadId(rewriter, loc);
18541854

1855+
bool needLdsStaging = !tensorTy && !opResult.use_empty();
18551856
std::optional<Value> atomicSharedMemBase =
1856-
op->hasAttr("allocation.offset")
1857+
op->hasAttr("allocation.offset") && needLdsStaging
18571858
? std::optional<Value>(getSharedMemoryBase(
18581859
loc, rewriter, targetInfo, op.getOperation()))
18591860
: std::nullopt;

0 commit comments

Comments
 (0)