[AMDGPU] Enable volatile and non-temporal for loads to LDS #153244

krzysz00 · 2025-08-12T18:36:25Z

The primary purpose of this commit is to enable marking loads to LDS (global.load.lds, buffer.*.load.lds) volatile (using bit 31 of the aux as with normal buffer loads) and to ensure that their !nontemporal annotations translate to appropriate settings of te cache control bits.

However, in the process of implementing this feature, we also fixed

Incorrect handling of buffer loads to LDS in GlobalISel
Updating the handling of volatile on buffers in SIMemoryLegalizer: previously, the mapping of address spaces would cause volatile on buffer loads to be silently dropped on at least gfx10.

The primary purpose of this commit is to enable marking loads to LDS (global.load.lds, buffer.*.load.lds) volatile (using bit 31 of the aux as with normal buffer loads) and to ensure that their !nontemporal annotations translate to appropriate settings of te cache control bits. However, in the process of implementing this feature, we also fixed - Incorrect handling of buffer loads to LDS in GlobalISel - Updating the handling of nolatile on buffers in SIMemoryLegalizer: previously, the mapping of address spaces would cause volatile on buffer loads to be silently dropped on at least gfx10.

llvmbot · 2025-08-12T18:36:54Z

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-amdgpu

Author: Krzysztof Drewniak (krzysz00)

Changes

The primary purpose of this commit is to enable marking loads to LDS (global.load.lds, buffer.*.load.lds) volatile (using bit 31 of the aux as with normal buffer loads) and to ensure that their !nontemporal annotations translate to appropriate settings of te cache control bits.

However, in the process of implementing this feature, we also fixed

Incorrect handling of buffer loads to LDS in GlobalISel
Updating the handling of nolatile on buffers in SIMemoryLegalizer: previously, the mapping of address spaces would cause volatile on buffer loads to be silently dropped on at least gfx10.

Patch is 39.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153244.diff

10 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+2-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+14-4)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+12-3)
(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+89-15)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-contents-legalization.ll (+6-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll (+169)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.load.lds.ll (+40)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.load.ll (+6-2)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+20-10)

diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 90cfd8cedd51b..70741d920f593 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2783,7 +2783,8 @@ class AMDGPUGlobalLoadLDS :
      llvm_i32_ty,                       // imm offset (applied to both global and LDS address)
      llvm_i32_ty],                      // auxiliary data (imm, cachepolicy (bit 0 = sc0,
                                         //                                   bit 1 = sc1,
-                                        //                                   bit 4 = scc))
+                                        //                                   bit 4 = scc,
+                                        //                                   bit 31 = volatile (compiler implemented)))
     [IntrWillReturn, NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
      ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>, IntrNoCallback, IntrNoFree],
      "", [SDNPMemOperand]>;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index b7fd131e76056..014efa52d4628 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -3481,10 +3481,14 @@ bool AMDGPUInstructionSelector::selectBufferLoadLds(MachineInstr &MI) const {
           : 0); // swz
 
   MachineMemOperand *LoadMMO = *MI.memoperands_begin();
+  // Don't set the offset value here because the pointer points to the base of
+  // the buffer.
   MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
-  LoadPtrI.Offset = MI.getOperand(6 + OpOffset).getImm();
+
   MachinePointerInfo StorePtrI = LoadPtrI;
-  StorePtrI.V = nullptr;
+  LoadPtrI.V = PoisonValue::get(PointerType::get(MF->getFunction().getContext(),
+                                                 AMDGPUAS::BUFFER_RESOURCE));
+  LoadPtrI.AddrSpace = AMDGPUAS::BUFFER_RESOURCE;
   StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
 
   auto F = LoadMMO->getFlags() &
@@ -3662,13 +3666,19 @@ bool AMDGPUInstructionSelector::selectGlobalLoadLds(MachineInstr &MI) const{
   if (isSGPR(Addr))
     MIB.addReg(VOffset);
 
-  MIB.add(MI.getOperand(4))  // offset
-     .add(MI.getOperand(5)); // cpol
+  MIB.add(MI.getOperand(4)); // offset
+
+  bool IsGFX12Plus = AMDGPU::isGFX12Plus(*Subtarget);
+  unsigned Aux = MI.getOperand(5).getImm();
+  MIB.addImm(Aux & (IsGFX12Plus ? AMDGPU::CPol::ALL
+                                : AMDGPU::CPol::ALL_pregfx12)); // cpol
 
   MachineMemOperand *LoadMMO = *MI.memoperands_begin();
   MachinePointerInfo LoadPtrI = LoadMMO->getPointerInfo();
   LoadPtrI.Offset = MI.getOperand(4).getImm();
   MachinePointerInfo StorePtrI = LoadPtrI;
+  LoadPtrI.V = PoisonValue::get(PointerType::get(MF->getFunction().getContext(),
+                                                 AMDGPUAS::GLOBAL_ADDRESS));
   LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
   StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
   auto F = LoadMMO->getFlags() &
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index e866bd47e267d..5c74b9a20ec0b 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1577,6 +1577,9 @@ bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
     Info.memVT = EVT::getIntegerVT(CI.getContext(), Width * 8);
     Info.ptrVal = CI.getArgOperand(1);
     Info.flags |= MachineMemOperand::MOLoad | MachineMemOperand::MOStore;
+    auto *Aux = cast<ConstantInt>(CI.getArgOperand(CI.arg_size() - 1));
+    if (Aux->getZExtValue() & AMDGPU::CPol::VOLATILE)
+      Info.flags |= MachineMemOperand::MOVolatile;
     return true;
   }
   case Intrinsic::amdgcn_ds_bvh_stack_rtn:
@@ -10706,8 +10709,8 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
 
     MachinePointerInfo StorePtrI = LoadPtrI;
     LoadPtrI.V = PoisonValue::get(
-        PointerType::get(*DAG.getContext(), AMDGPUAS::GLOBAL_ADDRESS));
-    LoadPtrI.AddrSpace = AMDGPUAS::GLOBAL_ADDRESS;
+        PointerType::get(*DAG.getContext(), AMDGPUAS::BUFFER_RESOURCE));
+    LoadPtrI.AddrSpace = AMDGPUAS::BUFFER_RESOURCE;
     StorePtrI.AddrSpace = AMDGPUAS::LOCAL_ADDRESS;
 
     auto F = LoadMMO->getFlags() &
@@ -10794,7 +10797,13 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
     }
 
     Ops.push_back(Op.getOperand(5));  // Offset
-    Ops.push_back(Op.getOperand(6));  // CPol
+
+    bool IsGFX12Plus = AMDGPU::isGFX12Plus(*Subtarget);
+    unsigned Aux = Op.getConstantOperandVal(6);
+    Ops.push_back(DAG.getTargetConstant(
+        Aux & (IsGFX12Plus ? AMDGPU::CPol::ALL : AMDGPU::CPol::ALL_pregfx12),
+        DL, MVT::i32)); // CPol
+
     Ops.push_back(M0Val.getValue(0)); // Chain
     Ops.push_back(M0Val.getValue(1)); // Glue
 
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 53f554eccb1fb..f9a09f0e1a23c 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -25,6 +25,7 @@
 #include "llvm/IR/DiagnosticInfo.h"
 #include "llvm/IR/MemoryModelRelaxationAnnotations.h"
 #include "llvm/IR/PassManager.h"
+#include "llvm/Support/AMDGPUAddrSpace.h"
 #include "llvm/Support/AtomicOrdering.h"
 #include "llvm/TargetParser/TargetParser.h"
 
@@ -263,6 +264,12 @@ class SIMemOpAccess final {
   /// rmw operation, "std::nullopt" otherwise.
   std::optional<SIMemOpInfo>
   getAtomicCmpxchgOrRmwInfo(const MachineBasicBlock::iterator &MI) const;
+
+  /// \returns DMA to LDS info if \p MI is as a direct-to/from-LDS load/store,
+  /// along with an indication of whether this is a load or store. If it is not
+  /// a direct-to-LDS operation, returns std::nullopt.
+  std::optional<std::tuple<SIMemOpInfo, SIMemOp>>
+  getLdsLoadStoreInfo(const MachineBasicBlock::iterator &MI) const;
 };
 
 class SICacheControl {
@@ -662,6 +669,10 @@ class SIMemoryLegalizer final {
   /// instructions are added/deleted or \p MI is modified, false otherwise.
   bool expandAtomicCmpxchgOrRmw(const SIMemOpInfo &MOI,
                                 MachineBasicBlock::iterator &MI);
+  /// Expands LDS load/store operation \p MI. Returns true if instructions are
+  /// added/deleted or \p MI is modified, false otherwise.
+  bool expandLdsLoadStore(const SIMemOpInfo &MOI, SIMemOp OpKind,
+                          MachineBasicBlock::iterator &MI);
 
 public:
   SIMemoryLegalizer(const MachineModuleInfo &MMI) : MMI(MMI) {};
@@ -786,6 +797,9 @@ SIAtomicAddrSpace SIMemOpAccess::toSIAtomicAddrSpace(unsigned AS) const {
     return SIAtomicAddrSpace::SCRATCH;
   if (AS == AMDGPUAS::REGION_ADDRESS)
     return SIAtomicAddrSpace::GDS;
+  if (AS == AMDGPUAS::BUFFER_FAT_POINTER || AS == AMDGPUAS::BUFFER_RESOURCE ||
+      AS == AMDGPUAS::BUFFER_STRIDED_POINTER)
+    return SIAtomicAddrSpace::GLOBAL;
 
   return SIAtomicAddrSpace::OTHER;
 }
@@ -937,6 +951,43 @@ std::optional<SIMemOpInfo> SIMemOpAccess::getAtomicCmpxchgOrRmwInfo(
   return constructFromMIWithMMO(MI);
 }
 
+std::optional<std::tuple<SIMemOpInfo, SIMemOp>>
+SIMemOpAccess::getLdsLoadStoreInfo(
+    const MachineBasicBlock::iterator &MI) const {
+  assert(MI->getDesc().TSFlags & SIInstrFlags::maybeAtomic);
+
+  if (!(MI->mayLoad() && MI->mayStore()))
+    return std::nullopt;
+
+  // An LDS DMA will have exactly two memory operands.
+  if (MI->getNumMemOperands() != 2)
+    return std::nullopt;
+
+  bool HasLDS = false;
+  bool HasNonLDS = false;
+  SIMemOp OpKind = SIMemOp::LOAD;
+  for (const auto &MMO : MI->memoperands()) {
+    unsigned AS = MMO->getAddrSpace();
+    HasLDS |= AS == AMDGPUAS::LOCAL_ADDRESS;
+    if (AS != AMDGPUAS::LOCAL_ADDRESS) {
+      HasNonLDS |= true;
+      if (!HasLDS) {
+        // If the pointer to LDS was in the first memop, this is a store
+        // from that pointer.
+        OpKind = SIMemOp::STORE;
+      }
+    }
+  }
+  if (!HasLDS || !HasNonLDS) {
+    return std::nullopt;
+  }
+
+  if (auto MOI = constructFromMIWithMMO(MI)) {
+    return std::make_tuple(*MOI, OpKind);
+  }
+  return std::nullopt;
+}
+
 SICacheControl::SICacheControl(const GCNSubtarget &ST) : ST(ST) {
   TII = ST.getInstrInfo();
   IV = getIsaVersion(ST.getCPU());
@@ -1039,8 +1090,8 @@ bool SIGfx6CacheControl::enableVolatileAndOrNonTemporal(
     bool IsVolatile, bool IsNonTemporal, bool IsLastUse = false) const {
   // Only handle load and store, not atomic read-modify-write insructions. The
   // latter use glc to indicate if the atomic returns a result and so must not
-  // be used for cache control.
-  assert(MI->mayLoad() ^ MI->mayStore());
+  // be used for cache control. There used to be a load ^ store assert here,
+  // but it was removed to allow handling direct-to-LDS copies.
 
   // Only update load and store, not LLVM IR atomic read-modify-write
   // instructions. The latter are always marked as volatile so cannot sensibly
@@ -1401,8 +1452,8 @@ bool SIGfx90ACacheControl::enableVolatileAndOrNonTemporal(
     bool IsVolatile, bool IsNonTemporal, bool IsLastUse = false) const {
   // Only handle load and store, not atomic read-modify-write insructions. The
   // latter use glc to indicate if the atomic returns a result and so must not
-  // be used for cache control.
-  assert(MI->mayLoad() ^ MI->mayStore());
+  // be used for cache control. There used to be a load ^ store assert here,
+  // but it was removed to allow handling direct-to-LDS copies.
 
   // Only update load and store, not LLVM IR atomic read-modify-write
   // instructions. The latter are always marked as volatile so cannot sensibly
@@ -1703,8 +1754,8 @@ bool SIGfx940CacheControl::enableVolatileAndOrNonTemporal(
     bool IsVolatile, bool IsNonTemporal, bool IsLastUse = false) const {
   // Only handle load and store, not atomic read-modify-write insructions. The
   // latter use glc to indicate if the atomic returns a result and so must not
-  // be used for cache control.
-  assert(MI->mayLoad() ^ MI->mayStore());
+  // be used for cache control. There used to be a load ^ store assert here,
+  // but it was removed to allow handling direct-to-LDS copies.
 
   // Only update load and store, not LLVM IR atomic read-modify-write
   // instructions. The latter are always marked as volatile so cannot sensibly
@@ -1936,8 +1987,8 @@ bool SIGfx10CacheControl::enableVolatileAndOrNonTemporal(
 
   // Only handle load and store, not atomic read-modify-write insructions. The
   // latter use glc to indicate if the atomic returns a result and so must not
-  // be used for cache control.
-  assert(MI->mayLoad() ^ MI->mayStore());
+  // be used for cache control. There used to be a load ^ store assert here,
+  // but it was removed to allow handling direct-to-LDS copies.
 
   // Only update load and store, not LLVM IR atomic read-modify-write
   // instructions. The latter are always marked as volatile so cannot sensibly
@@ -2216,8 +2267,9 @@ bool SIGfx11CacheControl::enableVolatileAndOrNonTemporal(
 
   // Only handle load and store, not atomic read-modify-write insructions. The
   // latter use glc to indicate if the atomic returns a result and so must not
-  // be used for cache control.
-  assert(MI->mayLoad() ^ MI->mayStore());
+  // be used for cache control. There used to be a load ^ store assert here,
+  // but it was removed to allow handling direct-to-LDS
+  // copies.assert(MI->mayLoad() ^ MI->mayStore());
 
   // Only update load and store, not LLVM IR atomic read-modify-write
   // instructions. The latter are always marked as volatile so cannot sensibly
@@ -2535,7 +2587,8 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
     bool IsVolatile, bool IsNonTemporal, bool IsLastUse = false) const {
 
   // Only handle load and store, not atomic read-modify-write instructions.
-  assert(MI->mayLoad() ^ MI->mayStore());
+  // There used to be a load ^ store assert here, but it was removed to
+  // allow handling direct-to-LDS copies.
 
   // Only update load and store, not LLVM IR atomic read-modify-write
   // instructions. The latter are always marked as volatile so cannot sensibly
@@ -2818,6 +2871,23 @@ bool SIMemoryLegalizer::expandAtomicCmpxchgOrRmw(const SIMemOpInfo &MOI,
   return Changed;
 }
 
+bool SIMemoryLegalizer::expandLdsLoadStore(const SIMemOpInfo &MOI,
+                                           SIMemOp OpKind,
+                                           MachineBasicBlock::iterator &MI) {
+  assert(MI->mayLoad() && MI->mayStore());
+
+  bool Changed = false;
+
+  // Handle volatile and/or nontemporal markers on direct-to-LDS loads and
+  // stores. The operation is treated as a volatile/nontemporal store
+  // to its second argument.
+  Changed |= CC->enableVolatileAndOrNonTemporal(
+      MI, MOI.getInstrAddrSpace(), OpKind, MOI.isVolatile(),
+      MOI.isNonTemporal(), MOI.isLastUse());
+
+  return Changed;
+}
+
 bool SIMemoryLegalizerLegacy::runOnMachineFunction(MachineFunction &MF) {
   const MachineModuleInfo &MMI =
       getAnalysis<MachineModuleInfoWrapperPass>().getMMI();
@@ -2863,14 +2933,18 @@ bool SIMemoryLegalizer::run(MachineFunction &MF) {
       if (!(MI->getDesc().TSFlags & SIInstrFlags::maybeAtomic))
         continue;
 
-      if (const auto &MOI = MOA.getLoadInfo(MI))
+      if (const auto &MOI = MOA.getLoadInfo(MI)) {
         Changed |= expandLoad(*MOI, MI);
-      else if (const auto &MOI = MOA.getStoreInfo(MI)) {
+      } else if (const auto &MOI = MOA.getStoreInfo(MI)) {
         Changed |= expandStore(*MOI, MI);
-      } else if (const auto &MOI = MOA.getAtomicFenceInfo(MI))
+      } else if (const auto &MOIAndOpKind = MOA.getLdsLoadStoreInfo(MI)) {
+        const auto &[MOI, OpKind] = *MOIAndOpKind;
+        Changed |= expandLdsLoadStore(MOI, OpKind, MI);
+      } else if (const auto &MOI = MOA.getAtomicFenceInfo(MI)) {
         Changed |= expandAtomicFence(*MOI, MI);
-      else if (const auto &MOI = MOA.getAtomicCmpxchgOrRmwInfo(MI))
+      } else if (const auto &MOI = MOA.getAtomicCmpxchgOrRmwInfo(MI)) {
         Changed |= expandAtomicCmpxchgOrRmw(*MOI, MI);
+      }
     }
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-contents-legalization.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-contents-legalization.ll
index 53b2542cf9a7e..142290a39f8f4 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-contents-legalization.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-contents-legalization.ll
@@ -3611,10 +3611,10 @@ define <6 x i8> @volatile_load_v6i8(ptr addrspace(8) inreg %buf) {
 ; SDAG:       ; %bb.0:
 ; SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; SDAG-NEXT:    buffer_load_dword v0, off, s[16:19], 0 glc
+; SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; SDAG-NEXT:    buffer_load_ushort v6, off, s[16:19], 0 offset:4 glc
-; SDAG-NEXT:    s_waitcnt vmcnt(1)
-; SDAG-NEXT:    v_lshrrev_b32_e32 v7, 8, v0
 ; SDAG-NEXT:    s_waitcnt vmcnt(0)
+; SDAG-NEXT:    v_lshrrev_b32_e32 v7, 8, v0
 ; SDAG-NEXT:    v_and_b32_e32 v1, 0xffff, v6
 ; SDAG-NEXT:    v_lshrrev_b64 v[3:4], 24, v[0:1]
 ; SDAG-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
@@ -3627,12 +3627,12 @@ define <6 x i8> @volatile_load_v6i8(ptr addrspace(8) inreg %buf) {
 ; GISEL:       ; %bb.0:
 ; GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GISEL-NEXT:    buffer_load_dword v0, off, s[16:19], 0 glc
+; GISEL-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-NEXT:    buffer_load_ushort v4, off, s[16:19], 0 offset:4 glc
-; GISEL-NEXT:    s_waitcnt vmcnt(1)
+; GISEL-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-NEXT:    v_lshrrev_b32_e32 v1, 8, v0
 ; GISEL-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
 ; GISEL-NEXT:    v_lshrrev_b32_e32 v3, 24, v0
-; GISEL-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-NEXT:    v_lshrrev_b32_e32 v5, 8, v4
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
   %p = addrspacecast ptr addrspace(8) %buf to ptr addrspace(7)
@@ -3652,6 +3652,7 @@ define void @volatile_store_v6i8(<6 x i8> %data, ptr addrspace(8) inreg %buf) {
 ; SDAG-NEXT:    v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
 ; SDAG-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; SDAG-NEXT:    buffer_store_dword v0, off, s[16:19], 0
+; SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; SDAG-NEXT:    buffer_store_short v4, off, s[16:19], 0 offset:4
 ; SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; SDAG-NEXT:    s_setpc_b64 s[30:31]
@@ -3671,6 +3672,7 @@ define void @volatile_store_v6i8(<6 x i8> %data, ptr addrspace(8) inreg %buf) {
 ; GISEL-NEXT:    v_lshl_or_b32 v0, v1, 16, v0
 ; GISEL-NEXT:    v_or_b32_sdwa v2, v4, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
 ; GISEL-NEXT:    buffer_store_dword v0, off, s[16:19], 0
+; GISEL-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-NEXT:    buffer_store_short v2, off, s[16:19], 0 offset:4
 ; GISEL-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-NEXT:    s_setpc_b64 s[30:31]
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll
index 5d03dfb56c8cc..26aa71193c726 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll
@@ -218,3 +218,172 @@ main_body:
   ret void
 }
 
+define amdgpu_ps void @global_load_lds_dword_volatile(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) inreg %lptr) {
+; GFX90A-LABEL: global_load_lds_dword_volatile:
+; GFX90A:       ; %bb.0: ; %main_body
+; GFX90A-NEXT:    s_mov_b32 m0, s0
+; GFX90A-NEXT:    s_nop 0
+; GFX90A-NEXT:    global_load_dword v[0:1], off lds
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    global_load_dword v[0:1], off offset:256 lds
+; GFX90A-NEXT:    global_load_dword v[0:1], off offset:512 lds
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX942-LABEL: global_load_lds_dword_volatile:
+; GFX942:       ; %bb.0: ; %main_body
+; GFX942-NEXT:    s_mov_b32 m0, s0
+; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    global_load_lds_dword v[0:1], off sc0 sc1
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
+; GFX942-NEXT:    global_load_lds_dword v[0:1], off offset:256
+; GFX942-NEXT:    global_load_lds_dword v[0:1], off offset:512
+; GFX942-NEXT:    s_endpgm
+;
+; GFX10-LABEL: global_load_lds_dword_volatile:
+; GFX10:       ; %bb.0: ; %main_body
+; GFX10-NEXT:    s_mov_b32 m0, s0
+; GFX10-NEXT:    global_load_dword v[0:1], off lds
+; GFX10-NEXT:    global_load_dword v[0:1], off offset:256 lds
+; GFX10-NEXT:    global_load_dword v[0:1], off offset:512 lds
+; GFX10-NEXT:    s_endpgm
+;
+; GFX942-GISEL-LABEL: global_load_lds_dword_volatile:
+; GFX942-GISEL:       ; %bb.0: ; %main_body
+; GFX942-GISEL-NEXT:    s_mov_b32 m0, s0
+; GFX942-GISEL-NEXT:    s_nop 0
+; GFX942-GISEL-NEXT:    global_load_lds_dword v[0:1], off sc0 sc1
+; GFX942-GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX942-GISEL-NEXT:    global_load_lds_dword v[0:1], off offset:256
+; GFX942-GISEL-NEXT:    global_load_lds_dword v[0:1], off offset:512
+; GFX942-GISEL-NEXT:    s_endpgm
+main_body:
+  call void @llvm.amdgcn.load.to.lds.p1(ptr addrspace(1) %gptr, ptr addrspace(3) %lptr, i32 4, i32 0, i32 2147483648)
+  call void @llvm.amdgcn.load.to.lds.p1(ptr addrspace(1) %gptr, ptr addrspace(3) %lptr, i32 4, i32 256, i32 0)
+  call void @llvm.amdgcn.load.to.lds.p1(ptr addrspace(1) %gptr, ptr addrspace(3) %lptr, i32 4, i32 512, i32 0)
+  ret void
+}
+
+define amdgpu_ps void @buffer_load_lds_dword_volatile(ptr addrspace(7) nocapture inreg %gptr, i32 %off, ptr addrspace(3) inreg %lptr) {
+; GFX90A-LABEL: buffer_load_lds_dword_volatile:
+; GFX90A:       ; %bb.0: ; %main_body
+; GFX90A-NEXT:    v_add_u32_e32 v0, s4, v0
+; GFX90A-NEXT:    s_mov_b32 m0, s5
+; GFX90A-NEXT:    s_nop 0
+; GFX90A-NEXT:    buffer_load_dword v0, s[0:3], 0 offen lds
+; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_load_dword v0, s[0:3], 0 offen offset:256 lds
+; GFX90A-NEXT:    buffer_load_dword v0, s[0:3], 0 offen offset:512 lds
+; GFX90A-NEXT:    s_endpgm
+;
+; GFX942-LABEL: buffer_load_lds_dword_volatile:
+; GFX942:       ; %bb.0: ; %main_body
+; GFX942-NEXT:    v_add_u32_e32 v0, s4, v0
+; GFX942-NEXT:    s_mov_b32 m0, s5
+; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    buffer_load_dword v0, s[0:3], 0 offen sc0 sc1 lds
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
+; GFX942-NEXT:    buffer_load_dword v0, s[0:3], 0 offen offset:256 lds
+; GFX942-NEXT:    buffer_load_dword v0, s[0:3], 0 offen offset:512 lds
+; GFX942-NEXT: ...
[truncated]

arsenm · 2025-08-13T06:43:31Z

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

                                        //                                   bit 1 = sc1,
-                                        //                                   bit 4 = scc))
+                                        //                                   bit 4 = scc,
+                                        //                                   bit 31 = volatile (compiler implemented)))


We ought to get this in the AMDGPUUsage documentation for intrinsics (and we really should get tablegenerated docs for this)

Yeah ... should making a table of aux bits bi this PR?

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll

Co-authored-by: Matt Arsenault <[email protected]>

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

Co-authored-by: Matt Arsenault <[email protected]>

Pierre-vh

This is too generic, it needs to be properly confined to the set of instructions you'd like to target, which I assume are the GFX9-style DMA instructions.

It'll also need an update to AMDGPUUsage to document the effects on the memory model.

Lastly, did you verify the additional bits enabled on the instructions do what you want ? As these behave quite differently I wouldn't be surprised if there is a catch somewhere.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

…ments

github-actions · 2025-08-20T15:06:35Z

✅ With the latest revision this PR passed the C/C++ code formatter.

krzysz00 · 2025-08-21T15:36:11Z

We're now confined to the DMAs, and I'm pretty sure the cache bits do what we want them to do, no?

This is coming from the fact that empirically, setting such cache bits is needed for performance, and I'm trying to avoid a big architecture-specific switch-case on Aux

Pierre-vh

Memory legalizer looks a lot better, thanks
Can you please add a memory legalizer test in the same style as the others? A big test covering all (or most of them at least) the cases exhaustively on all supported architectures.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

krzysz00 · 2025-09-03T22:51:45Z

Re testing, is the issue that I missed some cases or that they're in the wrong spot or both?

ssahasra · 2025-09-05T11:09:51Z

This is coming from the fact that empirically, setting such cache bits is needed for performance, and I'm trying to avoid a big architecture-specific switch-case on Aux

What is the actual assessment of the performance gain? Is it because the operations are truly non-temporal, or is it because it helps to bypass the cache? These two are not the same thing.

krzysz00 · 2025-09-05T18:35:30Z

@ssahasra

Per @qedawkins

so I hacked in non-temporal and it definitely helps for my case. 0.08 -> 0.073 ms

This was, I believe, setting the NT and SC0 bits on a gfx950 (or a gfx942).

(Further discussion revealed that the SC0 bit probably wasn't needed).

This patch came out of that higher-level abstractions notion that we're trying to get into LLVM - setting the nontemporal bit increases performance of some direct-to-LDS loads (because you don't need to cache them in roughly the way you wouldn't need to cache a nontemporal load) and the alternative would be to go up into MLIR and basically add a copy of the nontemporal-enablement logic to the code that generates the intrinsic call.

krzysz00 · 2025-09-08T16:43:09Z

@Pierre-vh Is there still a reason for this hold?

ssahasra

LGTM, but I don't claim to be an expert in the actual mechanics of information flows from metadata to backend bits. Please wait for @Pierre-vh to review.

llvm/lib/Target/AMDGPU/SIDefines.h

Pierre-vh · 2025-09-09T06:55:53Z

It's missing the memory legalizer tests. I'd like to see a test that follows the same pattern as the other memory legalizer tests (memory-legalizer-...).

One check line for all archs
Test a reasonable mix of DMA operations with the desired flags.

krzysz00 · 2025-09-18T00:32:16Z

Have gone and added the test

Co-authored-by: Matt Arsenault <[email protected]>

arsenm

lgtm with test nit

llvm/test/CodeGen/AMDGPU/memory-legalizer-lds-dma-volatile-and-nontemporal.ll

krzysz00 · 2025-09-23T16:36:33Z

@Pierre-vh ping

krzysz00 · 2025-10-17T23:05:15Z

@Pierre-vh ping

Pierre-vh

Sorry, got lost in the noise

krzysz00 requested review from optimisan, ritter-x2a and ro-i August 12, 2025 18:36

llvmbot added backend:AMDGPU llvm:ir labels Aug 12, 2025

arsenm reviewed Aug 13, 2025

View reviewed changes

ro-i reviewed Aug 14, 2025

View reviewed changes

DeMorgan fix

59ab35c

Co-authored-by: Matt Arsenault <[email protected]>

arsenm reviewed Aug 20, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

dbabad9

Co-authored-by: Matt Arsenault <[email protected]>

Pierre-vh requested changes Aug 20, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Outdated Show resolved Hide resolved

krzysz00 added 2 commits August 20, 2025 14:36

Fix volatile on gfx10 because backwards condition, address review com…

70bb738

…ments

Use actual helper function, put asserts back

36f9c48

Review feedback

cf99da0

krzysz00 requested review from Pierre-vh, arsenm and ro-i August 20, 2025 15:23

Pierre-vh reviewed Aug 26, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Outdated Show resolved Hide resolved

krzysz00 added 2 commits September 3, 2025 18:21

Merge branch 'main' into nontemporal-volatile-lds

80885a8

Nominal documentation for LDS loads

bbcf49f

krzysz00 added 2 commits September 4, 2025 23:14

Renaming (that I forgot to push lol)

d5beb7e

Delete test file I never ended up populating

988fa45

Pierre-vh requested a review from ssahasra September 5, 2025 09:40

ssahasra approved these changes Sep 9, 2025

View reviewed changes

arsenm reviewed Sep 9, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIDefines.h Outdated Show resolved Hide resolved

krzysz00 added 2 commits September 17, 2025 20:56

Merge branch 'main' into nontemporal-volatile-lds

848f81c

Add memory legalizer test

399b40f

krzysz00 requested a review from Pierre-vh September 18, 2025 00:31

Fix typo

1788e7b

Co-authored-by: Matt Arsenault <[email protected]>

krzysz00 requested a review from arsenm September 18, 2025 00:32

arsenm approved these changes Sep 18, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/memory-legalizer-lds-dma-volatile-and-nontemporal.ll Show resolved Hide resolved

Nit from Matt re test defs

f176d2a

Pierre-vh approved these changes Oct 20, 2025

View reviewed changes

[AMDGPU] Enable volatile and non-temporal for loads to LDS #153244

Are you sure you want to change the base?

[AMDGPU] Enable volatile and non-temporal for loads to LDS #153244

Conversation

krzysz00 commented Aug 12, 2025 • edited by arsenm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

krzysz00 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krzysz00 commented Aug 21, 2025

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krzysz00 commented Sep 3, 2025

Uh oh!

ssahasra commented Sep 5, 2025

Uh oh!

krzysz00 commented Sep 5, 2025

Uh oh!

krzysz00 commented Sep 8, 2025

Uh oh!

ssahasra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pierre-vh commented Sep 9, 2025

Uh oh!

krzysz00 commented Sep 18, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krzysz00 commented Sep 23, 2025

Uh oh!

krzysz00 commented Oct 17, 2025

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

krzysz00 commented Aug 12, 2025 •

edited by arsenm

Loading

llvmbot commented Aug 12, 2025 •

edited

Loading

github-actions bot commented Aug 20, 2025 •

edited

Loading