Skip to content

Conversation

@ritter-x2a
Copy link
Member

A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.

Fixes SWDEV-554932.

Copy link
Member Author

ritter-x2a commented Sep 17, 2025

@llvmbot
Copy link
Member

llvmbot commented Sep 17, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.

Fixes SWDEV-554932.


Full diff: https://github.com/llvm/llvm-project/pull/159282.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+1-2)
  • (added) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll (+122)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+51)
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index c501ebba0c7ed..beeddc58be193 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -2522,8 +2522,7 @@ bool SIGfx12CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
   // sequentially consistent, and no other thread can access scratch
   // memory.
 
-  // Other address spaces do not have a cache.
-  if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) == SIAtomicAddrSpace::NONE)
+  if (AddrSpace == SIAtomicAddrSpace::SCRATCH)
     return false;
 
   if (Pos == Position::AFTER)
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll
new file mode 100644
index 0000000000000..1e6dc4e06ef4d
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers-mmra.ll
@@ -0,0 +1,122 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 < %s | FileCheck --check-prefixes=GFX10-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck --check-prefixes=GFX11-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1200 < %s | FileCheck --check-prefixes=GFX12-WGP %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 < %s | FileCheck --check-prefixes=GFX1250 %s
+
+
+define float @test_barrier_workgroup_local_mmra(ptr addrspace(3) noundef %x, ptr addrspace(3) noundef %y, float %val) {
+; GFX10-WGP-LABEL: test_barrier_workgroup_local_mmra:
+; GFX10-WGP:       ; %bb.0:
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-WGP-NEXT:    ds_write_b32 v0, v2
+; GFX10-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-WGP-NEXT:    s_barrier
+; GFX10-WGP-NEXT:    ds_read_b32 v0, v1
+; GFX10-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-WGP-LABEL: test_barrier_workgroup_local_mmra:
+; GFX11-WGP:       ; %bb.0:
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-WGP-NEXT:    ds_store_b32 v0, v2
+; GFX11-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-WGP-NEXT:    s_barrier
+; GFX11-WGP-NEXT:    ds_load_b32 v0, v1
+; GFX11-WGP-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: test_barrier_workgroup_local_mmra:
+; GFX12-WGP:       ; %bb.0:
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_wait_expcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT:    ds_store_b32 v0, v2
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
+; GFX12-WGP-NEXT:    s_barrier_signal -1
+; GFX12-WGP-NEXT:    s_barrier_wait -1
+; GFX12-WGP-NEXT:    ds_load_b32 v0, v1
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
+; GFX12-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1250-LABEL: test_barrier_workgroup_local_mmra:
+; GFX1250:       ; %bb.0:
+; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    ds_store_b32 v0, v2
+; GFX1250-NEXT:    s_wait_dscnt 0x0
+; GFX1250-NEXT:    s_barrier_signal -1
+; GFX1250-NEXT:    s_barrier_wait -1
+; GFX1250-NEXT:    ds_load_b32 v0, v1
+; GFX1250-NEXT:    s_wait_dscnt 0x0
+; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
+  store float %val, ptr addrspace(3) %x
+  fence syncscope("workgroup") release, !mmra !0
+  tail call void @llvm.amdgcn.s.barrier()
+  fence syncscope("workgroup") acquire, !mmra !0
+  %ret = load float, ptr addrspace(3) %y
+  ret float %ret
+}
+
+define float @test_barrier_workgroup_global_mmra(ptr addrspace(1) noundef %x, ptr addrspace(1) noundef %y, float %val) {
+; GFX10-WGP-LABEL: test_barrier_workgroup_global_mmra:
+; GFX10-WGP:       ; %bb.0:
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-WGP-NEXT:    global_store_dword v[0:1], v4, off
+; GFX10-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX10-WGP-NEXT:    s_barrier
+; GFX10-WGP-NEXT:    buffer_gl0_inv
+; GFX10-WGP-NEXT:    global_load_dword v0, v[2:3], off
+; GFX10-WGP-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-WGP-LABEL: test_barrier_workgroup_global_mmra:
+; GFX11-WGP:       ; %bb.0:
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-WGP-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX11-WGP-NEXT:    s_waitcnt_vscnt null, 0x0
+; GFX11-WGP-NEXT:    s_barrier
+; GFX11-WGP-NEXT:    buffer_gl0_inv
+; GFX11-WGP-NEXT:    global_load_b32 v0, v[2:3], off
+; GFX11-WGP-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12-WGP-LABEL: test_barrier_workgroup_global_mmra:
+; GFX12-WGP:       ; %bb.0:
+; GFX12-WGP-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12-WGP-NEXT:    s_wait_expcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_samplecnt 0x0
+; GFX12-WGP-NEXT:    s_wait_bvhcnt 0x0
+; GFX12-WGP-NEXT:    s_wait_kmcnt 0x0
+; GFX12-WGP-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX12-WGP-NEXT:    s_wait_storecnt 0x0
+; GFX12-WGP-NEXT:    s_barrier_signal -1
+; GFX12-WGP-NEXT:    s_barrier_wait -1
+; GFX12-WGP-NEXT:    global_inv scope:SCOPE_SE
+; GFX12-WGP-NEXT:    global_load_b32 v0, v[2:3], off
+; GFX12-WGP-NEXT:    s_wait_loadcnt 0x0
+; GFX12-WGP-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1250-LABEL: test_barrier_workgroup_global_mmra:
+; GFX1250:       ; %bb.0:
+; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_barrier_signal -1
+; GFX1250-NEXT:    s_barrier_wait -1
+; GFX1250-NEXT:    global_load_b32 v0, v[2:3], off
+; GFX1250-NEXT:    s_wait_loadcnt 0x0
+; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
+  store float %val, ptr addrspace(1) %x
+  fence syncscope("workgroup") release, !mmra !1
+  tail call void @llvm.amdgcn.s.barrier()
+  fence syncscope("workgroup") acquire, !mmra !1
+  %ret = load float, ptr addrspace(1) %y
+  ret float %ret
+}
+
+!0 = !{!"amdgpu-synchronize-as", !"local"}
+!1 = !{!"amdgpu-synchronize-as", !"global"}
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll
index cc42428e1aa06..682c1f6519959 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll
@@ -143,14 +143,17 @@ define amdgpu_kernel void @workgroup_release_fence() {
 ;
 ; GFX12-WGP-LABEL: workgroup_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: workgroup_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: workgroup_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("workgroup") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -213,14 +216,17 @@ define amdgpu_kernel void @workgroup_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: workgroup_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: workgroup_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: workgroup_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("workgroup") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -283,14 +289,17 @@ define amdgpu_kernel void @workgroup_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: workgroup_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: workgroup_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: workgroup_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("workgroup") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -670,14 +679,18 @@ define amdgpu_kernel void @agent_release_fence() {
 ;
 ; GFX12-WGP-LABEL: agent_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: agent_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: agent_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -740,14 +753,18 @@ define amdgpu_kernel void @agent_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: agent_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: agent_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: agent_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -810,14 +827,18 @@ define amdgpu_kernel void @agent_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: agent_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: agent_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: agent_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -940,6 +961,7 @@ define amdgpu_kernel void @agent_one_as_release_fence() {
 ;
 ; GFX1250-LABEL: agent_one_as_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent-one-as") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1001,6 +1023,7 @@ define amdgpu_kernel void @agent_one_as_acq_rel_fence() {
 ;
 ; GFX1250-LABEL: agent_one_as_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent-one-as") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1062,6 +1085,7 @@ define amdgpu_kernel void @agent_one_as_seq_cst_fence() {
 ;
 ; GFX1250-LABEL: agent_one_as_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("agent-one-as") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1197,14 +1221,20 @@ define amdgpu_kernel void @system_release_fence() {
 ;
 ; GFX12-WGP-LABEL: system_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1267,14 +1297,20 @@ define amdgpu_kernel void @system_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: system_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1337,14 +1373,20 @@ define amdgpu_kernel void @system_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: system_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-WGP-NEXT:    s_wait_dscnt 0x0
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
+; GFX12-CU-NEXT:    s_wait_dscnt 0x0
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
+; GFX1250-NEXT:    s_wait_dscnt 0x0
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1459,14 +1501,17 @@ define amdgpu_kernel void @system_one_as_release_fence() {
 ;
 ; GFX12-WGP-LABEL: system_one_as_release_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_one_as_release_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_one_as_release_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("one-as") release, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1520,14 +1565,17 @@ define amdgpu_kernel void @system_one_as_acq_rel_fence() {
 ;
 ; GFX12-WGP-LABEL: system_one_as_acq_rel_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_one_as_acq_rel_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_one_as_acq_rel_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("one-as") acq_rel, !mmra !{!"amdgpu-synchronize-as", !"local"}
@@ -1581,14 +1629,17 @@ define amdgpu_kernel void @system_one_as_seq_cst_fence() {
 ;
 ; GFX12-WGP-LABEL: system_one_as_seq_cst_fence:
 ; GFX12-WGP:       ; %bb.0: ; %entry
+; GFX12-WGP-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-WGP-NEXT:    s_endpgm
 ;
 ; GFX12-CU-LABEL: system_one_as_seq_cst_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
+; GFX12-CU-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX12-CU-NEXT:    s_endpgm
 ;
 ; GFX1250-LABEL: system_one_as_seq_cst_fence:
 ; GFX1250:       ; %bb.0: ; %entry
+; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_endpgm
 entry:
   fence syncscope("one-as") seq_cst, !mmra !{!"amdgpu-synchronize-as", !"local"}

@ritter-x2a ritter-x2a marked this pull request as ready for review September 17, 2025 08:24
Copy link
Contributor

@Pierre-vh Pierre-vh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check if other CacheControl classes have a similar issue, and if the acquire equivalent has a similar issue too (first check if you can reproduce a similar issue in a test case) ?

MMRAs touch the OrderingAddrSpace, so anything that checks for it too eagerly could be fooled unfortunately

if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) == SIAtomicAddrSpace::NONE)
if (AddrSpace == SIAtomicAddrSpace::SCRATCH)
return false;

Copy link
Contributor

@Pierre-vh Pierre-vh Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better fix is to move the code surrounding the switch below into something like if((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE || ((AddrSpace & SIAtomicAddrSpace::SCRATCH) != SIAtomicAddrSpace::NONE) && ST.hasGloballyAddressableScratch())

Basically always go to the InsertWait section, just skip the wb/inv insertion if we're not touching the global AS.
That way we don't get those global_wb inserted for the local fences in the test below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code had an early return on a negative condition which matched the one in insertAcquire(). Why change it and move all the code inside a positive condition? That seems to have introduced a lot of whitespace noise due to reindentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, in insertRelease, waitcnts are inserted after this global_wb insertion code; the early return therefore also (wrongly) affected the waitcnt insertion.
In insertAcquire, the early return only guards the global_inv insertion while necessary waitcnts are inserted outside of insertAcquire, before it is called.
We could introduce a bool canAffectGlobalAS(SIAtomicAddrSpace) helper that encapsulates this condition to make it more recognizable that one case is negated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now ... I had not noticed the insertWait() at the end. canAffectGlobalAS() seems like a good idea. And maybe we can also have insertWB(), so that it's clear what is happening at a release? In either case, I think it's safer to fix that early return as a separate commit so that any breakages can be investigated separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we can also have insertWB(), so that it's clear what is happening at a release?

Feel free to overrule me here, but I don't think we gain much from introducing insertWB. I find that logic only used in three insertRelease implementations (gfx90a, gfx940, and gfx12), where it does something slightly different in each case, so there isn't much duplication to get rid off.
We could turn the insertRelease implementations for gfx90a, gfx940, and gfx12 into insertWB(...); insertWait(...); (for the others it seems to be just insertWait(...);), but that would only move the WB code from SIGfx*CacheControl::insertRelease to the new SIGfx*CacheControl::insertWB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pierre-vh

I think a better fix is to move the code surrounding the switch below into something like if((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE || ((AddrSpace & SIAtomicAddrSpace::SCRATCH) != SIAtomicAddrSpace::NONE) && ST.hasGloballyAddressableScratch())

It looks like you recently ensured with #154710 that atomic scratch instructions are replaced by flat instructions in case of GloballyAddressableScratch, so that part of the suggested condition should never trigger, right?
So, wouldn't this be better handled with an assert(!ST.hasGloballyAddressableScratch() || ((AS & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) || (AS & SIAtomicAddrSpace::SCRATCH) == SIAtomicAddrSpace::NONE)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've separated (and adjusted, to reflect my above comment about the GloballyAddressableScratch part being unnecessary) the part that concerned GloballyAddressableScratch into #160129.

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid internal ticket URLs in PR descriptions: 6773269

@arsenm
Copy link
Contributor

arsenm commented Sep 19, 2025

6773269

The policy was designed to specifically allow this

@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch from 6297305 to 1a6bab8 Compare September 19, 2025 13:10
Copy link
Member Author

@Pierre-vh

check if other CacheControl classes have a similar issue

No, the early return for globals was only present for gfx12, the previous generations always called insertWait, so they don't have the issue.

if the acquire equivalent has a similar issue too

Also no (mostly), the waitcnt for that is also inserted unconditionally, even before insertAcquire is called.
However, we currently wouldn't emit a GLOBAL_INV for scratch fences with GloballyAddressableScratch, that seems wrong to me. The updated PR changes that as well.

@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch from 1a6bab8 to 2d3f86d Compare September 22, 2025 15:28
A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.

Fixes SWDEV-554932.
...and treat release/acquire on GloballyAddressableScratch as global.

This also fixes an issue where an early return between an iterator
increment/decrement pair in SIGfx12CacheControl::insertRelease could have lead
to unexpected insertion locations.
@ritter-x2a ritter-x2a force-pushed the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch from 2d3f86d to 3b7305a Compare September 23, 2025 08:54
Copy link
Member Author

ritter-x2a commented Sep 23, 2025

Merge activity

  • Sep 23, 9:50 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Sep 23, 9:52 AM UTC: @ritter-x2a merged this pull request with Graphite.

@ritter-x2a ritter-x2a merged commit 3f8c7e9 into main Sep 23, 2025
9 checks passed
@ritter-x2a ritter-x2a deleted the users/ritter-x2a/09-17-_amdgpu_insert_waitcnt_for_non-global_fence_release_in_gfx12 branch September 23, 2025 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants