[AMDGPU] Insert waitcnt for non-global fence release in GFX12 #159282

ritter-x2a · 2025-09-17T08:21:35Z

A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.

Fixes SWDEV-554932.

ritter-x2a · 2025-09-17T08:21:53Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-09-17T08:24:21Z

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.

Fixes SWDEV-554932.

Full diff: https://github.com/llvm/llvm-project/pull/159282.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+1-2)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll (+122)
(modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+51)

diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index c501ebba0c7ed..beeddc58be193 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -2522,8 +2522,7 @@ bool SIGfx12CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
   // sequentially consistent, and no other thread can access scratch
   // memory.
 
-  // Other address spaces do not have a cache.
-  if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) == SIAtomicAddrSpace::NONE)
+  if (AddrSpace == SIAtomicAddrSpace::SCRATCH)
     return false;
 
   if (Pos == Position::AFTER)
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll
new file mode 100644
index 0000000000000..1e6dc4e06ef4d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll
@@ -0,0 +1,122 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 < %s | FileCheck --check-prefixes=GFX10-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck --check-prefixes=GFX11-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1200 < %s | FileCheck --check-prefixes=GFX12-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 < %s | FileCheck --check-prefixes=GFX1250 %s
+
+
+define float @test_barrier_workgroup_local_mmra(ptr addrspace(3) noundef %x, ptr addrspace(3) noundef %y, float %val) {
+; GFX10-WGP-LABEL: test_barrier_workgroup_local_mmra:
+; GFX10-WGP:       ; %bb.0:
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-WGP-NEXT:    ds_write_b32 v0, v2
+; GFX10-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-WGP-NEXT:    s_barrier
+; GFX10-WGP-NEXT:    ds_read_b32 v0, v1
+; GFX10-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-WGP-LABEL: test_barrier_workgroup_local_mmra:
+; GFX11-WGP:       ; %bb.0:
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-WGP-NEXT:    ds_store_b32 v0, v2
+; GFX11-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-WGP-NEXT:    s_barrier
+; GFX11-WGP-NEXT:    ds_load_b32 v0, v1
+; GFX11-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: test_barrier_workgroup_local_mmra:
+; GFX12-WGP:       ; %bb.0:
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_wait_expcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT:    ds_store_b32 v0, v2
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
+; GFX12-WGP-NEXT:    s_barrier_signal -1
+; GFX12-WGP-NEXT:    s_barrier_wait -1
+; GFX12-WGP-NEXT:    ds_load_b32 v0, v1
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
+; GFX12-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1250-LABEL: test_barrier_workgroup_local_mmra:
+; GFX1250:       ; %bb.0:
+; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    ds_store_b32 v0, v2
+; GFX1250-NEXT:    s_wait_dscnt 0x0
+; GFX1250-NEXT:    s_barrier_signal -1
+; GFX1250-NEXT:    s_barrier_wait -1
+; GFX1250-NEXT:    ds_load_b32 v0, v1
+; GFX1250-NEXT:    s_wait_dscnt 0x0
+; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
+  store float %val, ptr addrspace(3) %x
+  fence syncscope("workgroup") release, !mmra !0
+  tail call void @llvm.amdgcn.s.barrier()
+  fence syncscope("workgroup") acquire, !mmra !0
+  %ret = load float, ptr addrspace(3) %y
+  ret float %ret
+}
+
+define float @test_barrier_workgroup_global_mmra(ptr addrspace(1) noundef %x, ptr addrspace(1) noundef %y, float %val) {
+; GFX10-WGP-LABEL: test_barrier_workgroup_global_mmra:
+; GFX10-WGP:       ; %bb.0:
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-WGP-NEXT:    global_store_dword v[0:1], v4, off
+; GFX10-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-WGP-NEXT:    s_barrier
+; GFX10-WGP-NEXT:    buffer_gl0_inv
+; GFX10-WGP-NEXT:    global_load_dword v0, v[2:3], off
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-WGP-LABEL: test_barrier_workgroup_global_mmra:
+; GFX11-WGP:       ; %bb.0:
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-WGP-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX11-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-WGP-NEXT:    s_barrier
+; GFX11-WGP-NEXT:    buffer_gl0_inv
+; GFX11-WGP-NEXT:    global_load_b32 v0, v[2:3], off
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: test_barrier_workgroup_global_mmra:
+; GFX12-WGP:       ; %bb.0:
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_wait_expcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_barrier_signal -1
+; GFX12-WGP-NEXT:    s_barrier_wait -1
+; GFX12-WGP-NEXT:    global_inv scope:SCOPE_SE
+; GFX12-WGP-NEXT:    global_load_b32 v0, v[2:3], off
+; GFX12-WGP-NEXT:    s_wait_loadcnt 0x0
+; GFX12-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1250-LABEL: test_barrier_workgroup_global_mmra:
+; GFX1250:       ; %bb.0:
+; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_barrier_signal -1
+; GFX1250-NEXT:    s_barrier_wait -1
+; GFX1250-NEXT:    global_load_b32 v0, v[2:3], off
+; GFX1250-NEXT:    s_wait_loadcnt 0x0
+; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
+  store float %val, ptr addrspace(1) %x
+  fence syncscope("workgroup") release, !mmra !1
+  tail call void @llvm.amdgcn.s.barrier()
+  fence syncscope("workgroup") acquire, !mmra !1
+  %ret = load float, ptr addrspace(1) %y
+  ret float %ret
+}
+
+!0 = !{!"amdgpu-synchronize-as", !"local"}
+!1 = !{!"amdgpu-synchronize-as", !"global"}
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll
index cc42428e1aa06..682c1f6519959 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll
@@ -143,14 +143,17 @@ define amdgpu_kernel void @workgroup_release_fence() {
 ;
 ; GFX12-WGP-LABEL: workgroup_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: workgroup_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: workgroup_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("workgroup") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -213,14 +216,17 @@ define amdgpu_kernel void @workgroup_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: workgroup_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: workgroup_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: workgroup_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("workgroup") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -283,14 +289,17 @@ define amdgpu_kernel void @workgroup_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: workgroup_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: workgroup_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: workgroup_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("workgroup") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -670,14 +679,18 @@ define amdgpu_kernel void @agent_release_fence() {
 ;
 ; GFX12-WGP-LABEL: agent_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: agent_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: agent_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -740,14 +753,18 @@ define amdgpu_kernel void @agent_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: agent_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: agent_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: agent_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -810,14 +827,18 @@ define amdgpu_kernel void @agent_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: agent_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: agent_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: agent_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -940,6 +961,7 @@ define amdgpu_kernel void @agent_one_as_release_fence() {
 ;
 ; GFX1250-LABEL: agent_one_as_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent-one-as") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1001,6 +1023,7 @@ define amdgpu_kernel void @agent_one_as_acq_rel_fence() {
 ;
 ; GFX1250-LABEL: agent_one_as_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent-one-as") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1062,6 +1085,7 @@ define amdgpu_kernel void @agent_one_as_seq_cst_fence() {
 ;
 ; GFX1250-LABEL: agent_one_as_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent-one-as") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1197,14 +1221,20 @@ define amdgpu_kernel void @system_release_fence() {
 ;
 ; GFX12-WGP-LABEL: system_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1267,14 +1297,20 @@ define amdgpu_kernel void @system_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: system_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1337,14 +1373,20 @@ define amdgpu_kernel void @system_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: system_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1459,14 +1501,17 @@ define amdgpu_kernel void @system_one_as_release_fence() {
 ;
 ; GFX12-WGP-LABEL: system_one_as_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_one_as_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_one_as_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("one-as") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1520,14 +1565,17 @@ define amdgpu_kernel void @system_one_as_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: system_one_as_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_one_as_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_one_as_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("one-as") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1581,14 +1629,17 @@ define amdgpu_kernel void @system_one_as_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: system_one_as_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_one_as_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_one_as_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("one-as") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}

Pierre-vh

Can you also check if other CacheControl classes have a similar issue, and if the acquire equivalent has a similar issue too (first check if you can reproduce a similar issue in a test case) ?

MMRAs touch the OrderingAddrSpace, so anything that checks for it too eagerly could be fooled unfortunately

Pierre-vh · 2025-09-17T08:30:59Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

-  if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) == SIAtomicAddrSpace::NONE)
+  if (AddrSpace == SIAtomicAddrSpace::SCRATCH)
    return false;



I think a better fix is to move the code surrounding the switch below into something like if((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE || ((AddrSpace & SIAtomicAddrSpace::SCRATCH) != SIAtomicAddrSpace::NONE) && ST.hasGloballyAddressableScratch())

Basically always go to the InsertWait section, just skip the wb/inv insertion if we're not touching the global AS.
That way we don't get those global_wb inserted for the local fences in the test below

The original code had an early return on a negative condition which matched the one in insertAcquire(). Why change it and move all the code inside a positive condition? That seems to have introduced a lot of whitespace noise due to reindentation.

Here, in insertRelease, waitcnts are inserted after this global_wb insertion code; the early return therefore also (wrongly) affected the waitcnt insertion.
In insertAcquire, the early return only guards the global_inv insertion while necessary waitcnts are inserted outside of insertAcquire, before it is called.
We could introduce a bool canAffectGlobalAS(SIAtomicAddrSpace) helper that encapsulates this condition to make it more recognizable that one case is negated.

I see now ... I had not noticed the insertWait() at the end. canAffectGlobalAS() seems like a good idea. And maybe we can also have insertWB(), so that it's clear what is happening at a release? In either case, I think it's safer to fix that early return as a separate commit so that any breakages can be investigated separately.

And maybe we can also have insertWB(), so that it's clear what is happening at a release?

Feel free to overrule me here, but I don't think we gain much from introducing insertWB. I find that logic only used in three insertRelease implementations (gfx90a, gfx940, and gfx12), where it does something slightly different in each case, so there isn't much duplication to get rid off.
We could turn the insertRelease implementations for gfx90a, gfx940, and gfx12 into insertWB(...); insertWait(...); (for the others it seems to be just insertWait(...);), but that would only move the WB code from SIGfx*CacheControl::insertRelease to the new SIGfx*CacheControl::insertWB.

@Pierre-vh

I think a better fix is to move the code surrounding the switch below into something like if((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE || ((AddrSpace & SIAtomicAddrSpace::SCRATCH) != SIAtomicAddrSpace::NONE) && ST.hasGloballyAddressableScratch())

It looks like you recently ensured with #154710 that atomic scratch instructions are replaced by flat instructions in case of GloballyAddressableScratch, so that part of the suggested condition should never trigger, right?
So, wouldn't this be better handled with an assert(!ST.hasGloballyAddressableScratch() || ((AS & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) || (AS & SIAtomicAddrSpace::SCRATCH) == SIAtomicAddrSpace::NONE)?

I've separated (and adjusted, to reflect my above comment about the GloballyAddressableScratch part being unnecessary) the part that concerned GloballyAddressableScratch into #160129.

llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll

kuhar

Avoid internal ticket URLs in PR descriptions: 6773269

arsenm · 2025-09-19T01:22:26Z

6773269

The policy was designed to specifically allow this

ritter-x2a · 2025-09-19T13:12:32Z

@Pierre-vh

check if other CacheControl classes have a similar issue

No, the early return for globals was only present for gfx12, the previous generations always called insertWait, so they don't have the issue.

if the acquire equivalent has a similar issue too

Also no (mostly), the waitcnt for that is also inserted unconditionally, even before insertAcquire is called.
However, we currently wouldn't emit a GLOBAL_INV for scratch fences with GloballyAddressableScratch, that seems wrong to me. The updated PR changes that as well.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

A fence release could be followed by a barrier, so it should wait for the relevant memory accesses to complete, even if it is mmra-limited to LDS. So far, that would be skipped for non-global fence releases. Fixes SWDEV-554932.

...and treat release/acquire on GloballyAddressableScratch as global. This also fixes an issue where an early return between an iterator increment/decrement pair in SIGfx12CacheControl::insertRelease could have lead to unexpected insertion locations.

ritter-x2a · 2025-09-23T09:50:56Z

Merge activity

Sep 23, 9:50 AM UTC: A user started a stack merge that includes this pull request via Graphite.
Sep 23, 9:52 AM UTC: @ritter-x2a merged this pull request with Graphite.

ritter-x2a added the backend:AMDGPU label Sep 17, 2025 — with Graphite App

ritter-x2a requested review from Pierre-vh, krzysz00 and ssahasra September 17, 2025 08:24

ritter-x2a marked this pull request as ready for review September 17, 2025 08:24

Pierre-vh reviewed Sep 17, 2025

View reviewed changes

krzysz00 reviewed Sep 17, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll Outdated Show resolved Hide resolved

krzysz00 mentioned this pull request Sep 18, 2025

[mlir][AMDGPU] Use LDS-only MMRA fences for lds_barrier #157919

Merged

kuhar reviewed Sep 18, 2025

View reviewed changes

ritter-x2a force-pushed the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch from 6297305 to 1a6bab8 Compare September 19, 2025 13:10

ritter-x2a force-pushed the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch from 1a6bab8 to 2d3f86d Compare September 22, 2025 15:28

ritter-x2a mentioned this pull request Sep 22, 2025

[AMDGPU] SIMemoryLegalizer: Factor out check if memory operations can affect the global AS #160129

Merged

ritter-x2a commented Sep 22, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Show resolved Hide resolved

ritter-x2a added 4 commits September 23, 2025 04:09

Remove changes related to GloballyAddressableScratch from this PR

4fdf715

Properly report changes in SIGfx12CacheControl::insertRelease

3b7305a

ritter-x2a force-pushed the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch from 2d3f86d to 3b7305a Compare September 23, 2025 08:54

Pierre-vh approved these changes Sep 23, 2025

View reviewed changes

ritter-x2a merged commit 3f8c7e9 into main Sep 23, 2025
9 checks passed

ritter-x2a deleted the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch September 23, 2025 09:52

Uh oh!

[AMDGPU] Insert waitcnt for non-global fence release in GFX12 #159282

[AMDGPU] Insert waitcnt for non-global fence release in GFX12 #159282

Uh oh!

Conversation

ritter-x2a commented Sep 17, 2025

Uh oh!

ritter-x2a commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 17, 2025

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Pierre-vh Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritter-x2a Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

ssahasra Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

ritter-x2a Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

ssahasra Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

ritter-x2a Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

ritter-x2a Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

ritter-x2a Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kuhar left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm commented Sep 19, 2025

Uh oh!

ritter-x2a commented Sep 19, 2025

Uh oh!

Uh oh!

ritter-x2a commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ritter-x2a commented Sep 17, 2025 •

edited

Loading

Pierre-vh Sep 17, 2025 •

edited

Loading

ritter-x2a commented Sep 23, 2025 •

edited

Loading