Skip to content

Conversation

@choikwa
Copy link
Contributor

@choikwa choikwa commented May 20, 2025

…lustered

This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace. In testing, it has been observed that clustering memory ops with different base pointers can improve performance. In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers. Internal CQE testing did not show significant regressions.

RFC

…lustered

This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace.
In testing, it has been observed that clustering memory ops with different base pointers can improve performance.
In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers.
Internal CQE testing did not show significant regressions.
@llvmbot
Copy link
Member

llvmbot commented May 20, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: choikwa (choikwa)

Changes

…lustered

This patch relaxes same base pointer requirement for memory ops clustering by only testing for identical addrspace. In testing, it has been observed that clustering memory ops with different base pointers can improve performance. In particular, Babelstream dot_kernel(double) performed up to 15% better with clustered memory loads with different base pointers. Internal CQE testing did not show significant regressions.

RFC


Patch is 2.67 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140674.diff

107 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+33-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+114-114)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+6)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_32.ll (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_64.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-swmmac-index_key.ll (+10-16)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-swmmac-index_key.ll (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/add.v2i16.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+25-25)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+8927-9038)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+204-210)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+1293-1303)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+58-54)
  • (modified) llvm/test/CodeGen/AMDGPU/array-ptr-calc-i32.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll (+86-78)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+43-44)
  • (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+8-3)
  • (modified) llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp-modifier.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/cluster_stores.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll (+841-136)
  • (modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-scc-clobber.ll (+14-13)
  • (modified) llvm/test/CodeGen/AMDGPU/ctpop16.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fcmp.f16.ll (+56)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+30-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fma-combine.ll (+10-11)
  • (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/fmul.f16.ll (+47-45)
  • (modified) llvm/test/CodeGen/AMDGPU/frem.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fsub.f16.ll (+15-14)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+76-82)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+60-56)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+114-98)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+60-42)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+60-42)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+114-98)
  • (modified) llvm/test/CodeGen/AMDGPU/group-image-instructions.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/idot2.ll (+348-350)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+346-351)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+610-618)
  • (modified) llvm/test/CodeGen/AMDGPU/idot8s.ll (+117-121)
  • (modified) llvm/test/CodeGen/AMDGPU/idot8u.ll (+98-105)
  • (modified) llvm/test/CodeGen/AMDGPU/implicit-kernarg-backend-usage.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-call-known-callees.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll (+7-6)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-frame-extern.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bvh8_intersect_ray.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dead.ll (+10-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dual_intersect_ray.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.bf16.bf16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f16.f16.ll (+8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f32.bf16.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fmad.ftz.ll (+1-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.lds.kernel.id.ll (+8-7)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll (+17-17)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.tfe.ll (+20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.tfe.ll (+20)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.waitcnt.out.order.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_32.ll (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_64.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.fma.f16.ll (+44-44)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.fmuladd.f16.ll (+31-21)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+39-39)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maxnum.f16.ll (+80-86)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+39-39)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minnum.f16.ll (+80-86)
  • (modified) llvm/test/CodeGen/AMDGPU/load-select-ptr.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/mixed-vmem-types.ll (+6-1)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+68-62)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+101-51)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+70-53)
  • (modified) llvm/test/CodeGen/AMDGPU/reassoc-mul-add-1-to-mad.ll (+59-61)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+8-5)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+8-5)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-commute.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+155-155)
  • (modified) llvm/test/CodeGen/AMDGPU/sitofp.f16.ll (+32-30)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/uitofp.f16.ll (+32-30)
  • (modified) llvm/test/CodeGen/AMDGPU/v_madak_f16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fadd.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fmul.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (+191-16)
  • (modified) llvm/test/CodeGen/AMDGPU/vselect.ll (+25-29)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-swmmac-index_key.ll (+10-16)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-swmmac-index_key.ll (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma_multiple_32.ll (+22-44)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma_multiple_64.ll (+22)
  • (modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+42-53)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+32-32)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 85276bd24bcf4..8b19ab35bc822 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -47,6 +47,12 @@ namespace llvm::AMDGPU {
 #include "AMDGPUGenSearchableTables.inc"
 } // namespace llvm::AMDGPU
 
+static cl::opt<bool> DisableDiffBasePtrMemClustering(
+  "amdgpu-disable-diff-baseptr-mem-clustering",
+  cl::desc("Disable clustering memory ops with different base pointers"),
+  cl::init(false),
+  cl::Hidden);
+
 // Must be at least 4 to be able to branch over minimum unconditional branch
 // code. This is only for making it possible to write reasonably small tests for
 // long branches.
@@ -522,6 +528,22 @@ bool SIInstrInfo::getMemOperandsWithOffsetWidth(
   return false;
 }
 
+static bool memOpsHaveSameAddrspace(const MachineInstr &MI1,
+                                  ArrayRef<const MachineOperand *> BaseOps1,
+                                  const MachineInstr &MI2,
+                                  ArrayRef<const MachineOperand *> BaseOps2) {
+  // If base is identical, assume identical addrspace
+  if (BaseOps1.front()->isIdenticalTo(*BaseOps2.front()))
+    return true;
+
+  if (!MI1.hasOneMemOperand() || !MI2.hasOneMemOperand())
+    return false;
+
+  auto *MO1 = *MI1.memoperands_begin();
+  auto *MO2 = *MI2.memoperands_begin();
+  return MO1->getAddrSpace() == MO2->getAddrSpace();
+}
+
 static bool memOpsHaveSameBasePtr(const MachineInstr &MI1,
                                   ArrayRef<const MachineOperand *> BaseOps1,
                                   const MachineInstr &MI2,
@@ -559,14 +581,21 @@ bool SIInstrInfo::shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
                                       int64_t Offset2, bool OffsetIsScalable2,
                                       unsigned ClusterSize,
                                       unsigned NumBytes) const {
-  // If the mem ops (to be clustered) do not have the same base ptr, then they
-  // should not be clustered
   unsigned MaxMemoryClusterDWords = DefaultMemoryClusterDWordsLimit;
   if (!BaseOps1.empty() && !BaseOps2.empty()) {
     const MachineInstr &FirstLdSt = *BaseOps1.front()->getParent();
     const MachineInstr &SecondLdSt = *BaseOps2.front()->getParent();
-    if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
-      return false;
+    
+    if (!DisableDiffBasePtrMemClustering) {
+      // Only consider memory ops from same addrspace for clustering
+      if (!memOpsHaveSameAddrspace(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
+        return false;
+    } else {
+      // If the mem ops (to be clustered) do not have the same base ptr, then they
+      // should not be clustered
+      if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
+        return false;
+    }
 
     const SIMachineFunctionInfo *MFI =
         FirstLdSt.getMF()->getInfo<SIMachineFunctionInfo>();
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
index 27b93872b9f1d..f562d958529d1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll
@@ -8,31 +8,31 @@ define void @add_v3i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 2, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v8, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 4, v0
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v9, v[6:7]
-; GFX8-NEXT:    flat_load_ushort v10, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v2
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 4, v0
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT:    flat_load_ushort v10, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 2, v2
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v11, v[0:1]
+; GFX8-NEXT:    flat_load_ushort v12, v[2:3]
+; GFX8-NEXT:    flat_load_ushort v8, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 4, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 4, v2
-; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v3, vcc
-; GFX8-NEXT:    flat_load_ushort v11, v[2:3]
-; GFX8-NEXT:    flat_load_ushort v12, v[0:1]
 ; GFX8-NEXT:    flat_load_ushort v6, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v7, v[0:1]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v5, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 4, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v5, vcc
-; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v7, v8, v11
+; GFX8-NEXT:    s_waitcnt vmcnt(3)
+; GFX8-NEXT:    v_add_u16_e32 v9, v11, v12
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v8, v9, v12
+; GFX8-NEXT:    v_add_u16_e32 v6, v6, v8
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v6, v10, v6
-; GFX8-NEXT:    flat_store_short v[4:5], v7
-; GFX8-NEXT:    flat_store_short v[0:1], v8
-; GFX8-NEXT:    flat_store_short v[2:3], v6
+; GFX8-NEXT:    v_add_u16_e32 v7, v10, v7
+; GFX8-NEXT:    flat_store_short v[4:5], v9
+; GFX8-NEXT:    flat_store_short v[0:1], v6
+; GFX8-NEXT:    flat_store_short v[2:3], v7
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -153,28 +153,28 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 4, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
-; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 6, v0
-; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v12, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 8, v0
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v13, v[6:7]
-; GFX8-NEXT:    flat_load_ushort v14, v[8:9]
-; GFX8-NEXT:    flat_load_ushort v15, v[10:11]
-; GFX8-NEXT:    flat_load_ushort v16, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v2
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 4, v2
+; GFX8-NEXT:    flat_load_ushort v12, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v13, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 6, v0
+; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 8, v0
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT:    flat_load_ushort v14, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v15, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 2, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 6, v2
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 4, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 8, v2
+; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 6, v2
 ; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v16, v[0:1]
 ; GFX8-NEXT:    flat_load_ushort v17, v[2:3]
-; GFX8-NEXT:    flat_load_ushort v18, v[0:1]
-; GFX8-NEXT:    flat_load_ushort v19, v[6:7]
-; GFX8-NEXT:    flat_load_ushort v20, v[8:9]
+; GFX8-NEXT:    flat_load_ushort v18, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v19, v[8:9]
 ; GFX8-NEXT:    flat_load_ushort v10, v[10:11]
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 8, v2
+; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v11, v[0:1]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 2, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v5, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 4, v4
@@ -184,20 +184,20 @@ define void @add_v5i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 8, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v5, vcc
 ; GFX8-NEXT:    s_waitcnt vmcnt(4)
-; GFX8-NEXT:    v_add_u16_e32 v11, v12, v17
+; GFX8-NEXT:    v_add_u16_e32 v16, v16, v17
 ; GFX8-NEXT:    s_waitcnt vmcnt(3)
-; GFX8-NEXT:    v_add_u16_e32 v12, v13, v18
+; GFX8-NEXT:    v_add_u16_e32 v12, v12, v18
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v13, v14, v19
+; GFX8-NEXT:    v_add_u16_e32 v13, v13, v19
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v14, v15, v20
+; GFX8-NEXT:    v_add_u16_e32 v10, v14, v10
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v10, v16, v10
-; GFX8-NEXT:    flat_store_short v[4:5], v11
+; GFX8-NEXT:    v_add_u16_e32 v11, v15, v11
+; GFX8-NEXT:    flat_store_short v[4:5], v16
 ; GFX8-NEXT:    flat_store_short v[0:1], v12
 ; GFX8-NEXT:    flat_store_short v[2:3], v13
-; GFX8-NEXT:    flat_store_short v[6:7], v14
-; GFX8-NEXT:    flat_store_short v[8:9], v10
+; GFX8-NEXT:    flat_store_short v[6:7], v10
+; GFX8-NEXT:    flat_store_short v[8:9], v11
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -513,25 +513,25 @@ define void @add_v9i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addrs
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v14, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
 ; GFX8-NEXT:    flat_load_ushort v0, v[0:1]
+; GFX8-NEXT:    flat_load_ushort v1, v[2:3]
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v1, v6, v10
-; GFX8-NEXT:    v_add_u16_sdwa v2, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v3, v7, v11
-; GFX8-NEXT:    v_add_u16_sdwa v10, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v11, v8, v12
+; GFX8-NEXT:    v_add_u16_e32 v2, v6, v10
+; GFX8-NEXT:    v_add_u16_sdwa v3, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v10, v7, v11
+; GFX8-NEXT:    v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v14, v8, v12
 ; GFX8-NEXT:    v_add_u16_sdwa v8, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u16_e32 v12, v9, v13
 ; GFX8-NEXT:    v_add_u16_sdwa v9, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 16, v4
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v13, v14, v0
-; GFX8-NEXT:    v_or_b32_e32 v0, v1, v2
-; GFX8-NEXT:    v_or_b32_e32 v1, v3, v10
-; GFX8-NEXT:    v_or_b32_e32 v2, v11, v8
+; GFX8-NEXT:    v_add_u16_e32 v13, v0, v1
+; GFX8-NEXT:    v_or_b32_e32 v0, v2, v3
+; GFX8-NEXT:    v_or_b32_e32 v1, v10, v11
+; GFX8-NEXT:    v_or_b32_e32 v2, v14, v8
 ; GFX8-NEXT:    v_or_b32_e32 v3, v12, v9
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v5, vcc
 ; GFX8-NEXT:    flat_store_dwordx4 v[4:5], v[0:3]
@@ -604,10 +604,10 @@ define void @add_v10i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
 ; GFX8-NEXT:    flat_load_dword v14, v[0:1]
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v3, vcc
-; GFX8-NEXT:    flat_load_dword v15, v[0:1]
+; GFX8-NEXT:    flat_load_dword v15, v[2:3]
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
 ; GFX8-NEXT:    v_add_u16_e32 v0, v6, v10
 ; GFX8-NEXT:    v_add_u16_sdwa v1, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
@@ -663,53 +663,53 @@ define void @add_v11i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    flat_load_dwordx4 v[6:9], v[0:1]
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
-; GFX8-NEXT:    v_add_u32_e32 v14, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v15, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v16, vcc, 18, v2
-; GFX8-NEXT:    v_addc_u32_e32 v17, vcc, 0, v3, vcc
-; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 20, v2
-; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
-; GFX8-NEXT:    flat_load_ushort v14, v[14:15]
-; GFX8-NEXT:    flat_load_ushort v15, v[16:17]
-; GFX8-NEXT:    flat_load_ushort v16, v[2:3]
-; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v0
-; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v1, vcc
-; GFX8-NEXT:    s_waitcnt vmcnt(3)
-; GFX8-NEXT:    v_add_u16_e32 v17, v6, v10
-; GFX8-NEXT:    v_add_u16_sdwa v10, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 18, v0
-; GFX8-NEXT:    v_add_u16_e32 v18, v7, v11
-; GFX8-NEXT:    v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_add_u16_e32 v14, v6, v10
+; GFX8-NEXT:    v_add_u16_sdwa v15, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 16, v0
+; GFX8-NEXT:    v_add_u16_e32 v16, v7, v11
+; GFX8-NEXT:    v_add_u16_sdwa v17, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u16_e32 v18, v8, v12
+; GFX8-NEXT:    v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 18, v0
+; GFX8-NEXT:    v_add_u16_e32 v19, v9, v13
+; GFX8-NEXT:    v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v1, vcc
+; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v20, v[6:7]
+; GFX8-NEXT:    flat_load_ushort v21, v[8:9]
+; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 18, v2
+; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v3, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 20, v0
-; GFX8-NEXT:    flat_load_ushort v2, v[2:3]
-; GFX8-NEXT:    flat_load_ushort v3, v[6:7]
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_ushort v21, v[0:1]
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 20, v2
+; GFX8-NEXT:    flat_load_ushort v10, v[10:11]
+; GFX8-NEXT:    flat_load_ushort v11, v[6:7]
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
+; GFX8-NEXT:    flat_load_ushort v22, v[0:1]
+; GFX8-NEXT:    flat_load_ushort v2, v[2:3]
 ; GFX8-NEXT:    v_add_u32_e32 v6, vcc, 16, v4
 ; GFX8-NEXT:    v_addc_u32_e32 v7, vcc, 0, v5, vcc
-; GFX8-NEXT:    v_add_u16_e32 v19, v8, v12
-; GFX8-NEXT:    v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u32_e32 v8, vcc, 18, v4
-; GFX8-NEXT:    v_add_u16_e32 v20, v9, v13
-; GFX8-NEXT:    v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_addc_u32_e32 v9, vcc, 0, v5, vcc
-; GFX8-NEXT:    v_or_b32_e32 v0, v17, v10
-; GFX8-NEXT:    v_or_b32_e32 v1, v18, v11
+; GFX8-NEXT:    v_or_b32_e32 v0, v14, v15
+; GFX8-NEXT:    v_or_b32_e32 v1, v16, v17
+; GFX8-NEXT:    v_or_b32_e32 v3, v19, v13
+; GFX8-NEXT:    s_waitcnt vmcnt(3)
+; GFX8-NEXT:    v_add_u16_e32 v20, v20, v10
 ; GFX8-NEXT:    v_add_u32_e32 v10, vcc, 20, v4
-; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v5, vcc
 ; GFX8-NEXT:    s_waitcnt vmcnt(2)
-; GFX8-NEXT:    v_add_u16_e32 v14, v2, v14
-; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v15, v3, v15
-; GFX8-NEXT:    v_or_b32_e32 v2, v19, v12
-; GFX8-NEXT:    v_or_b32_e32 v3, v20, v13
+; GFX8-NEXT:    v_add_u16_e32 v21, v21, v11
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
-; GFX8-NEXT:    v_add_u16_e32 v16, v21, v16
+; GFX8-NEXT:    v_add_u16_e32 v14, v22, v2
+; GFX8-NEXT:    v_or_b32_e32 v2, v18, v12
+; GFX8-NEXT:    v_addc_u32_e32 v11, vcc, 0, v5, vcc
 ; GFX8-NEXT:    flat_store_dwordx4 v[4:5], v[0:3]
-; GFX8-NEXT:    flat_store_short v[6:7], v14
-; GFX8-NEXT:    flat_store_short v[8:9], v15
-; GFX8-NEXT:    flat_store_short v[10:11], v16
+; GFX8-NEXT:    flat_store_short v[6:7], v20
+; GFX8-NEXT:    flat_store_short v[8:9], v21
+; GFX8-NEXT:    flat_store_short v[10:11], v14
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -794,34 +794,34 @@ define void @add_v12i16(ptr addrspace(1) %ptra, ptr addrspace(1) %ptrb, ptr addr
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX8-NEXT:    flat_load_dwordx4 v[6:9], v[0:1]
 ; GFX8-NEXT:    flat_load_dwordx4 v[10:13], v[2:3]
-; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
-; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v0
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; GFX8-NEXT:    flat_load_dwordx2 v[14:15], v[2:3]
-; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v2, v6, v10
-; GFX8-NEXT:    v_add_u16_sdwa v3, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v10, v7, v11
+; GFX8-NEXT:    v_add_u32_e32 v2, vcc, 16, v2
+; GFX8-NEXT:    v_addc_u32_e32 v3, vcc, 0, v3, vcc
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_add_u16_e32 v14, v6, v10
+; GFX8-NEXT:    v_add_u16_sdwa v10, v6, v10 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v15, v7, v11
 ; GFX8-NEXT:    v_add_u16_sdwa v11, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    flat_load_dwordx2 v[6:7], v[0:1]
 ; GFX8-NEXT:    v_add_u16_e32 v16, v8, v12
-; GFX8-NEXT:    v_add_u16_sdwa v8, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v12, v9, v13
-; GFX8-NEXT:    v_add_u16_sdwa v9, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_or_b32_e32 v0, v2, v3
-; GFX8-NEXT:    v_or_b32_e32 v1, v10, v11
-; GFX8-NEXT:    v_or_b32_e32 v2, v16, v8
-; GFX8-NEXT:    v_or_b32_e32 v3, v12, v9
+; GFX8-NEXT:    v_add_u16_sdwa v12, v8, v12 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v17, v9, v13
+; GFX8-NEXT:    v_add_u16_sdwa v13, v9, v13 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    flat_load_dwordx2 v[6:7], v[0:1]
+; GFX8-NEXT:    flat_load_dwordx2 v[8:9], v[2:3]
+; GFX8-NEXT:    v_or_b32_e32 v0, v14, v10
+; GFX8-NEXT:    v_or_b32_e32 v1, v15, v11
+; GFX8-NEXT:    v_or_b32_e32 v2, v16, v12
+; GFX8-NEXT:    v_or_b32_e32 v3, v17, v13
 ; GFX8-NEXT:    flat_store_dwordx4 v[4:5], v[0:3]
 ; GFX8-NEXT:    s_waitcnt vmcnt(1)
-; GFX8-NEXT:    v_add_u16_e32 v8, v6, v14
-; GFX8-NEXT:    v_add_u16_sdwa v6, v6, v14 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
-; GFX8-NEXT:    v_add_u16_e32 v9, v7, v15
-; GFX8-NEXT:    v_add_u16_sdwa v7, v7, v15 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v10, v6, v8
+; GFX8-NEXT:    v_add_u16_sdwa v6, v6, v8 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
+; GFX8-NEXT:    v_add_u16_e32 v8, v7, v9
+; GFX8-NEXT:    v_add_u16_sdwa v7, v7, v9 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 16, v4
-; GFX8-NEXT:    v_or_b32_e32 v6, v8, v6
-; GFX8-NEXT:    v_or_b32_e32 v7, v9, v7
+; GFX8-NEXT:    v_or_b32_e32 v6, v10, v6
+; GFX8-NEXT:    v_or_b32_e32 v7, v8, v7
 ; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, 0, v5, vcc
 ; GFX8-NEXT:    flat_store_dwordx2 v[0:1], v[6:7]
 ; GFX8-NEXT:    s_waitcnt vmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
index 86766e2904619..89f896a2b1656 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
@@ -288,16 +288,16 @@ define amdgpu_kernel void...
[truncated]

@github-actions
Copy link

github-actions bot commented May 20, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@choikwa
Copy link
Contributor Author

choikwa commented May 20, 2025

The main motivation for doing this came from looking at MISched logs and observing degradation when two loads from different arrays of dot_product weren't put adjacent. I saw that shouldClusterMemOp was the main determinant for rejecting clustering two loads if base pointers were different and otherwise it was relying only on tie-breaking heuristics to decide if loads should be put together, which isn't deterministic.

Shader Programming Guide section 3.1.8 on "Soft" Memory Clause also notes that back-to-back requests are much more efficient for the cache.

@shiltian shiltian changed the title [AMDGPU][MISched] Allow memory ops of different base pointers to be c… [AMDGPU][MISched] Allow memory ops of different base pointers to be clustered May 20, 2025
@github-actions
Copy link

github-actions bot commented May 20, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/test/CodeGen/AMDGPU/test-enable-diffbase-clustering-flag.ll llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_32.ll llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_64.ll llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/add.v2i16.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll llvm/test/CodeGen/AMDGPU/array-ptr-calc-i32.ll llvm/test/CodeGen/AMDGPU/attributor-flatscratchinit-undefined-behavior2.ll llvm/test/CodeGen/AMDGPU/bf16.ll llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll llvm/test/CodeGen/AMDGPU/call-argument-types.ll llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll llvm/test/CodeGen/AMDGPU/clamp-modifier.ll llvm/test/CodeGen/AMDGPU/clamp.ll llvm/test/CodeGen/AMDGPU/cluster_stores.ll llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll llvm/test/CodeGen/AMDGPU/copy-to-reg-scc-clobber.ll llvm/test/CodeGen/AMDGPU/ctpop16.ll llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll llvm/test/CodeGen/AMDGPU/ds_read2.ll llvm/test/CodeGen/AMDGPU/fcmp.f16.ll llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll llvm/test/CodeGen/AMDGPU/fma-combine.ll llvm/test/CodeGen/AMDGPU/fmed3.ll llvm/test/CodeGen/AMDGPU/fmul.f16.ll llvm/test/CodeGen/AMDGPU/frem.ll llvm/test/CodeGen/AMDGPU/fsub.f16.ll llvm/test/CodeGen/AMDGPU/function-args-inreg.ll llvm/test/CodeGen/AMDGPU/function-args.ll llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll llvm/test/CodeGen/AMDGPU/group-image-instructions.ll llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll llvm/test/CodeGen/AMDGPU/idot2.ll llvm/test/CodeGen/AMDGPU/idot4s.ll llvm/test/CodeGen/AMDGPU/idot4u.ll llvm/test/CodeGen/AMDGPU/idot8s.ll llvm/test/CodeGen/AMDGPU/idot8u.ll llvm/test/CodeGen/AMDGPU/indirect-call-known-callees.ll llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll llvm/test/CodeGen/AMDGPU/issue130120-eliminate-frame-index.ll llvm/test/CodeGen/AMDGPU/kernel-args.ll llvm/test/CodeGen/AMDGPU/lds-frame-extern.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bvh8_intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dead.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dual_intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.bf16.bf16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f16.f16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f32.bf16.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fmad.ftz.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.lds.kernel.id.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.tfe.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.tfe.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.waitcnt.out.order.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_32.ll llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_64.ll llvm/test/CodeGen/AMDGPU/llvm.fma.f16.ll llvm/test/CodeGen/AMDGPU/llvm.fmuladd.f16.ll llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.maxnum.f16.ll llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll llvm/test/CodeGen/AMDGPU/llvm.minnum.f16.ll llvm/test/CodeGen/AMDGPU/load-select-ptr.ll llvm/test/CodeGen/AMDGPU/max.i16.ll llvm/test/CodeGen/AMDGPU/min.ll llvm/test/CodeGen/AMDGPU/mixed-vmem-types.ll llvm/test/CodeGen/AMDGPU/mul.ll llvm/test/CodeGen/AMDGPU/or.ll llvm/test/CodeGen/AMDGPU/permute_i8.ll llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll llvm/test/CodeGen/AMDGPU/reassoc-mul-add-1-to-mad.ll llvm/test/CodeGen/AMDGPU/rotl.ll llvm/test/CodeGen/AMDGPU/rotr.ll llvm/test/CodeGen/AMDGPU/sdwa-commute.ll llvm/test/CodeGen/AMDGPU/select.f16.ll llvm/test/CodeGen/AMDGPU/sitofp.f16.ll llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll llvm/test/CodeGen/AMDGPU/sub.ll llvm/test/CodeGen/AMDGPU/sub.v2i16.ll llvm/test/CodeGen/AMDGPU/uitofp.f16.ll llvm/test/CodeGen/AMDGPU/v_madak_f16.ll llvm/test/CodeGen/AMDGPU/vector-reduce-fadd.ll llvm/test/CodeGen/AMDGPU/vector-reduce-fmul.ll llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll llvm/test/CodeGen/AMDGPU/vselect.ll llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-swmmac-index_key.ll llvm/test/CodeGen/AMDGPU/wmma_multiple_32.ll llvm/test/CodeGen/AMDGPU/wmma_multiple_64.ll llvm/test/CodeGen/AMDGPU/wqm.ll llvm/test/CodeGen/AMDGPU/xor.ll

The following files introduce new uses of undef:

  • llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@choikwa choikwa requested a review from kerbowa May 20, 2025 20:02
@jayfoad
Copy link
Contributor

jayfoad commented May 21, 2025

Have you done any other benchmarking on this patch? It seems like it could have a big effect on performance, both good and bad.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIR test exercising the flag would be good

@choikwa
Copy link
Contributor Author

choikwa commented May 21, 2025

Have you done any other benchmarking on this patch? It seems like it could have a big effect on performance, both good and bad.

I ran the ROCmValidation suite but didn't observe significant perf delta.

@choikwa
Copy link
Contributor Author

choikwa commented Jun 2, 2025

Ping

@kerbowa
Copy link
Member

kerbowa commented Jun 24, 2025

Unless someone objects or more testing is needed on the graphics side I think we should go ahead with this change since you haven't found any major perf-regression in initial testing. Cluster edges being "weak" means that the scheduler should always have the option to ignore them. In practice it will result in performance changes but since we have both an experimental and theoretical basis for the change I think that we should move forward with the patch.

@jayfoad
Copy link
Contributor

jayfoad commented Jun 24, 2025

In practice it will result in performance changes but since we have both an experimental and theoretical basis for the change I think that we should move forward with the patch.

What's the theoretical basis? The RFC thread ends with a suggestion "to understand why and where that 10% ~ 15% came from first".

@choikwa
Copy link
Contributor Author

choikwa commented Jun 24, 2025

In practice it will result in performance changes but since we have both an experimental and theoretical basis for the change I think that we should move forward with the patch.

What's the theoretical basis? The RFC thread ends with a suggestion "to understand why and where that 10% ~ 15% came from first".

As suggested in RFC, I can provide profiling results for more empirical data. Currently working through some issues with rocprofv3 + ATT on another issue, but once that gets resolved, I think we'll be able to get a better look at the HW level.

@choikwa
Copy link
Contributor Author

choikwa commented Jul 21, 2025

I investigated into obtaining profiling data for this issue, but due to the constraints (Issue only observed on MI200 & rocprofv3 limitations), I am unable to provide the per-instruction stall reasons. I could only extract aggregate HW Counter stats, which I've attached. It would be nice to have profiling on MI300, but I don't have a testcase that exhibits this improvement.

HW Counters MemoryInsts Adjacent Interleaved Diff%
arch_vgpr 16 12 -33%
accum_vgpr 0 4 100%
VALUInsts 1589.688 1427.688 -11%
VALUBusy 7.33375 5.791669 -27%
WriteSize 4.925987 6.588816 25%
SALUInsts 287.1875 318.1875 10%
SALUBusy 1.307518 1.28516 -2%
L2CacheHit 9.222125 7.20795 -28%
MemUnitBusy 79.31983 73.18036 -8%
MemUnitStalled 0.059394 0.046405 -28%
AverageNs 518102 587858 12%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants