[AMDGPU] Sink uniform buffer address offsets into soffset #169230

PrasoonMishra · 2025-11-23T18:39:01Z

This patch implements an optimization to partition MUBUF load/store offsets into vector and scalar components for better scheduling and reduced VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by moving the uniform part to soffset and keeping the divergent part in voffset.

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores when soffset is initially zero.

Test coverage is provided by buffer-offset-to-soffset-loads and buffer-offset-to-soffset-stores, which include comprehensive validation across i8, i16, i32, vector (v2/v4), and float variants, including positive and negative cases.

llvmbot · 2025-11-23T18:39:30Z

@llvm/pr-subscribers-backend-amdgpu

Author: Prasoon Mishra (PrasoonMishra)

Changes

This patch implements an optimization to partition MUBUF load/store offsets into vector and scalar components for better scheduling and reduced VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by moving the uniform part to soffset and keeping the divergent part in voffset.

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores when soffset is initially zero.

Test coverage is provided by buffer-offset-to-soffset-loads and buffer-offset-to-soffset-stores, which include comprehensive validation across i8, i16, i32, vector (v2/v4), and float variants, including positive and negative cases.

Patch is 45.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169230.diff

5 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+90)
(added) llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll (+457)
(added) llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-stores.ll (+461)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll (+10-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.load.ll (+8-4)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index 8e35ba77d69aa..9ff2e2a4bdf82 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -260,6 +260,7 @@ class AMDGPUCodeGenPrepareImpl
   bool visitIntrinsicInst(IntrinsicInst &I);
   bool visitFMinLike(IntrinsicInst &I);
   bool visitSqrt(IntrinsicInst &I);
+  bool visitBufferIntrinsic(IntrinsicInst &I);
   bool run();
 };
 
@@ -1910,6 +1911,15 @@ bool AMDGPUCodeGenPrepareImpl::visitIntrinsicInst(IntrinsicInst &I) {
     return visitFMinLike(I);
   case Intrinsic::sqrt:
     return visitSqrt(I);
+  case Intrinsic::amdgcn_raw_buffer_load:
+  case Intrinsic::amdgcn_raw_buffer_load_format:
+  case Intrinsic::amdgcn_raw_buffer_store:
+  case Intrinsic::amdgcn_raw_buffer_store_format:
+  case Intrinsic::amdgcn_raw_ptr_buffer_load:
+  case Intrinsic::amdgcn_raw_ptr_buffer_load_format:
+  case Intrinsic::amdgcn_raw_ptr_buffer_store:
+  case Intrinsic::amdgcn_raw_ptr_buffer_store_format:
+    return visitBufferIntrinsic(I);
   default:
     return false;
   }
@@ -2046,6 +2056,86 @@ bool AMDGPUCodeGenPrepareImpl::visitSqrt(IntrinsicInst &Sqrt) {
   return true;
 }
 
+/// Sink uniform addends in buffer address calculations into soffset.
+///
+/// Transforms buffer loads/stores with voffset = add(uniform, divergent)
+/// into voffset = divergent, soffset = uniform for better address coalescing
+/// Only applies to raw buffer operations with soffset initially zero.
+bool AMDGPUCodeGenPrepareImpl::visitBufferIntrinsic(IntrinsicInst &I) {
+  Intrinsic::ID IID = I.getIntrinsicID();
+  bool IsLoad = (IID == Intrinsic::amdgcn_raw_buffer_load ||
+                 IID == Intrinsic::amdgcn_raw_buffer_load_format ||
+                 IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||
+                 IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format);
+  bool IsStore = (IID == Intrinsic::amdgcn_raw_buffer_store ||
+                  IID == Intrinsic::amdgcn_raw_buffer_store_format ||
+                  IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||
+                  IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format);
+
+  if (!IsLoad && !IsStore)
+    return false;
+
+  // Buffer intrinsic operand layout (same for vector and pointer descriptor):
+  // Load:  (rsrc, voffset, soffset, cachepolicy)
+  // Store: (vdata, rsrc, voffset, soffset, cachepolicy)
+  const unsigned VOffsetIdx = IsStore ? 2 : 1;
+  const unsigned SOffsetIdx = IsStore ? 3 : 2;
+
+  Value *VOffset = I.getArgOperand(VOffsetIdx);
+  Value *SOffset = I.getArgOperand(SOffsetIdx);
+
+  // Only optimize when soffset is currently zero
+  if (!match(SOffset, m_Zero()))
+    return false;
+
+  // Pattern match: voffset = add(uniform, divergent)
+  Value *LHS, *RHS;
+  if (!match(VOffset, m_Add(m_Value(LHS), m_Value(RHS))))
+    return false;
+
+  bool LHSUniform = UA.isUniform(LHS);
+  bool RHSUniform = UA.isUniform(RHS);
+
+  // Need exactly one uniform and one divergent operand.
+  // TODO: Handle the case where both are uniform.
+  if (LHSUniform == RHSUniform)
+    return false;
+
+  Value *UniformAddend = LHSUniform ? LHS : RHS;
+  Value *DivergentAddend = LHSUniform ? RHS : LHS;
+
+  // Skip if the uniform addend is a non-negative constant that fits in the
+  // 12-bit immediate offset field. The backend will fold it into the immediate
+  // field, which avoids consuming an soffset operand.
+  // Negative or large constants must use soffset.
+  if (auto *CI = dyn_cast<ConstantInt>(UniformAddend)) {
+    int64_t Offset = CI->getSExtValue();
+    if (Offset >= 0 && Offset <= 4095)
+      return false;
+  }
+
+  LLVM_DEBUG(dbgs() << "AMDGPUCodeGenPrepare: Sinking uniform addend into "
+                       "soffset for buffer "
+                    << (IsStore ? "store" : "load") << ": " << I << '\n');
+
+  // Clone the instruction and insert it before the old instruction
+  CallInst *NewCall = cast<CallInst>(I.clone());
+  NewCall->insertBefore(I.getIterator());
+
+  // Update voffset and soffset operands
+  NewCall->setArgOperand(VOffsetIdx, DivergentAddend);
+  NewCall->setArgOperand(SOffsetIdx, UniformAddend);
+
+  // Replace and erase the old instruction
+  if (IsLoad) {
+    NewCall->takeName(&I);
+    I.replaceAllUsesWith(NewCall);
+  }
+  I.eraseFromParent();
+
+  return true;
+}
+
 bool AMDGPUCodeGenPrepare::runOnFunction(Function &F) {
   if (skipFunction(F))
     return false;
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll b/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll
new file mode 100644
index 0000000000000..c520062c5a01e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll
@@ -0,0 +1,457 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck -check-prefixes=CHECK %s
+
+; Test comprehensive patterns for ADD(divergent, uniform) optimization in buffer loads
+
+; Basic workitem.id.x + uniform
+define amdgpu_kernel void @test_basic_workitem_uniform(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_basic_workitem_uniform:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Reversed operands (uniform + divergent)
+define amdgpu_kernel void @test_reversed_operands(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_reversed_operands:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %soffset, %voffset  ; Reversed: uniform + divergent
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Multiple buffer loads with same pattern
+define amdgpu_kernel void @test_multiple_loads(ptr addrspace(1) %output, i32 %soffset1, i32 %soffset2) {
+; CHECK-LABEL: test_multiple_loads:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v1, v0, s[4:7], s2 offen
+; CHECK-NEXT:    buffer_load_dword v2, v0, s[4:7], s3 offen
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_add_u32_e32 v1, v1, v2
+; CHECK-NEXT:    global_store_dword v0, v1, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+
+  %sum1 = add i32 %voffset, %soffset1
+  %val1 = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum1, i32 0, i32 0)
+
+  %sum2 = add i32 %voffset, %soffset2
+  %val2 = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum2, i32 0, i32 0)
+
+  %result = add i32 %val1, %val2
+  store i32 %result, ptr addrspace(1) %output
+  ret void
+}
+
+; Different buffer load variants - byte load
+define amdgpu_kernel void @test_buffer_load_byte(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_byte:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_ubyte v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i8 @llvm.amdgcn.raw.buffer.load.i8(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  %ext = zext i8 %val to i32
+  store i32 %ext, ptr addrspace(1) %output
+  ret void
+}
+
+; Different buffer load variants - short load
+define amdgpu_kernel void @test_buffer_load_short(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_short:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_ushort v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i16 @llvm.amdgcn.raw.buffer.load.i16(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  %ext = zext i16 %val to i32
+  store i32 %ext, ptr addrspace(1) %output
+  ret void
+}
+
+; Vector loads - v2i32
+define amdgpu_kernel void @test_buffer_load_v2i32(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_v2i32:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dwordx2 v[0:1], v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store <2 x i32> %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Vector loads - v4i32
+define amdgpu_kernel void @test_buffer_load_v4i32(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_v4i32:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v4, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dwordx4 v[0:3], v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store <4 x i32> %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Float loads
+define amdgpu_kernel void @test_buffer_load_float(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_float:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store float %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Complex divergent expression + uniform
+define amdgpu_kernel void @test_complex_divergent(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_complex_divergent:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    v_add_u32_e32 v0, v0, v1
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid_x = call i32 @llvm.amdgcn.workitem.id.x()
+  %tid_y = call i32 @llvm.amdgcn.workitem.id.y()
+  %divergent = add i32 %tid_x, %tid_y  ; Still divergent
+  %sum = add i32 %divergent, %soffset  ; divergent + uniform
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - both operands divergent
+define amdgpu_kernel void @test_both_divergent(ptr addrspace(1) %output) {
+; CHECK-LABEL: test_both_divergent:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_add_u32_e32 v0, v0, v1
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid_x = call i32 @llvm.amdgcn.workitem.id.x()
+  %tid_y = call i32 @llvm.amdgcn.workitem.id.y()
+  %sum = add i32 %tid_x, %tid_y
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - both operands uniform
+define amdgpu_kernel void @test_both_uniform(ptr addrspace(1) %output, i32 %soffset1, i32 %soffset2) {
+; CHECK-LABEL: test_both_uniform:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_add_i32 s2, s2, s3
+; CHECK-NEXT:    v_mov_b32_e32 v0, s2
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[4:7], 0 offen
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %sum = add i32 %soffset1, %soffset2
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Nested in control flow
+define amdgpu_kernel void @test_control_flow(ptr addrspace(1) %output, i32 %soffset, i32 %condition) {
+; CHECK-LABEL: test_control_flow:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_cmp_lg_u32 s3, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB11_4
+; CHECK-NEXT:  ; %bb.1: ; %else
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    global_store_dword v1, v1, s[0:1]
+; CHECK-NEXT:    s_cbranch_execnz .LBB11_3
+; CHECK-NEXT:  .LBB11_2: ; %then
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[4:7], s2 offen
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:  .LBB11_3: ; %end
+; CHECK-NEXT:    s_endpgm
+; CHECK-NEXT:  .LBB11_4:
+; CHECK-NEXT:    s_branch .LBB11_2
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %cmp = icmp eq i32 %condition, 0
+  br i1 %cmp, label %then, label %else
+
+then:
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  br label %end
+
+else:
+  store i32 0, ptr addrspace(1) %output
+  br label %end
+
+end:
+  ret void
+}
+
+; Multiple uses of the ADD result - should still optimize buffer load
+define amdgpu_kernel void @test_multiple_uses(ptr addrspace(1) %output1, ptr addrspace(1) %output2, i32 %soffset) {
+; CHECK-LABEL: test_multiple_uses:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x34
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v1, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    v_add_u32_e32 v0, s6, v0
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v2, v1, s[0:1]
+; CHECK-NEXT:    global_store_dword v2, v0, s[2:3]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output1
+  store i32 %sum, ptr addrspace(1) %output2
+  ret void
+}
+
+; Chain of operations - workitem.id -> mul -> add -> buffer_load
+define amdgpu_kernel void @test_operation_chain(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_operation_chain:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    v_mul_u32_u24_e32 v0, 4, v0
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %scaled = mul i32 %tid, 4  ; Still divergent
+  %sum = add i32 %scaled, %soffset  ; divergent + uniform
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - Buffer load with non-zero soffset field already
+define amdgpu_kernel void @test_existing_soffset(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_existing_soffset:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_add_u32_e32 v0, s6, v0
+; CHECK-NEXT:    s_movk_i32 s6, 0x64
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 100, i32 0)  ; Non-zero soffset
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - Structured buffer loads
+define amdgpu_kernel void @test_struct_buffer_load(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_struct_buffer_load:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NE...
[truncated]

shiltian

Is this pass the right place doing this?

llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-stores.ll

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

PrasoonMishra · 2025-11-24T04:55:11Z

Is this pass the right place doing this?

Yes, AMDGPUCodeGenPrepare seems like a better place for this optimization. When I initially attempted it in SelectionDAG’s PreprocessISelDAG, @arsenm and @krzysz00 suggested moving it to the IR level (see PR #160939). This location also makes it easier to implement the next optimization, which promotes the buffer load to a scalar buffer load when both voffset and soffset are uniform. I plan to add that in a follow-up PR.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

jmmartinez · 2025-11-24T09:28:28Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  if (!match(SOffset, m_Zero()))
+    return false;


How does this code evolve when we want to handle soffsets that are not zero? Or more complex voffsets like (((non_uniform + uniform_a) + uniform_b) + 8).

In this previous case, 8 is kept as it is to sink it later to the constant part. uniform_a and uniform_b are kept in there too, while it'd have been possible to move, at least uniform_b, to the soffset ?

I understand that as a first step we can put this transformation in here; but I would not be surprised about it having its own pass (or having a pass to do "uniform reassociate" more generally).

You're correct. I intentionally targeted the simplest case (soffset = 0, single-level add) as a starting point. I wanted to evolve it incrementally here, as the current pattern is simple enough that a separate pass felt premature.
That said, if you feel it would benefit from its own pass from the start, I'm happy to refactor it out now. What do you think?

jayfoad · 2025-11-24T10:20:32Z

The claim here is that for raw buffer operations there is no semantic reason to separate out the voffset/soffset/immediate parts of the offset. So a more comprehensive implementation would be to canonicalize them as early as possible with a single offset operand which is the sum of the three offsets provided by the programmer. Then instruction selection can split that back into divergent/uniform/constant parts like we already do for lots of addressing modes.

This patch implements an optimization to partition MUBUF load/store offsets into vector and scalar components for better address coalescing and reduced VGPR pressure. Transform buffer operations where voffset = add(uniform, divergent) by moving the uniform part to soffset and keeping the divergent part in voffset. Before: v_add_u32 v1, v0, sN buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen After: buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen The optimization currently applies to raw buffer loads/stores when soffset is initially zero. Tests includes comprehensive validation of both buffer loads and stores across various supported variants (i8, i16, i32, vectors, floats) with positive and negative test cases.

PrasoonMishra · 2025-11-24T14:22:21Z

The claim here is that for raw buffer operations there is no semantic reason to separate out the voffset/soffset/immediate parts of the offset. So a more comprehensive implementation would be to canonicalize them as early as possible with a single offset operand which is the sum of the three offsets provided by the programmer. Then instruction selection can split that back into divergent/uniform/constant parts like we already do for lots of addressing modes.

I agree that semantically raw buffer addressing is just the sum of voffset + soffset + imm. However, we keep them separate to enable later optimizations like promoting to SMEM when everything is uniform. This makes late splitting tricky because it must preserve uniformity

// Original
voffset = (divergent_tid * stride) + uniform_val1;
soffset = uniform_val2;
// Backend knows soffset is uniform can promote to s_buffer_load if voffset == 0

// After merging:
merged_offset = (divergent_tid * stride) + uniform_val1 + uniform_val2;
// Problem: merged_offset is divergent, hence cannot promote even though most is uniform.

It also needs to carry alignment info for SMEM promotion (see #138975). Without these, it will prevent s_buffer_load promotion.

jayfoad · 2025-11-24T14:52:59Z

I agree that semantically raw buffer addressing is just the sum of voffset + soffset + imm. However, we keep them separate to enable later optimizations like promoting to SMEM when everything is uniform. This makes late splitting tricky because it must preserve uniformity

No, it is simple. You can only promote to SMEM if the entire address is uniform.

// Original
voffset = (divergent_tid * stride) + uniform_val1;
soffset = uniform_val2;
// Backend knows soffset is uniform can promote to s_buffer_load if voffset == 0

Don't understand the "if voffset == 0" condition, since this example has a non-0 voffset.

// After merging:
merged_offset = (divergent_tid * stride) + uniform_val1 + uniform_val2;
// Problem: merged_offset is divergent, hence cannot promote even though most is uniform.

The part that came from voffset is still non-zero and divergent, so it would be incorrect to promote to SMEM.

arsenm · 2025-11-24T15:02:14Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  if (!IsLoad && !IsStore)
+    return false;


Suggested change

if (!IsLoad && !IsStore)

return false;

This is unreachable

arsenm · 2025-11-24T15:02:23Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  bool IsLoad = (IID == Intrinsic::amdgcn_raw_buffer_load ||
+                 IID == Intrinsic::amdgcn_raw_buffer_load_format ||
+                 IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||
+                 IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format);


Suggested change

bool IsLoad = (IID == Intrinsic::amdgcn_raw_buffer_load ||

IID == Intrinsic::amdgcn_raw_buffer_load_format ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format);

bool IsLoad = IID == Intrinsic::amdgcn_raw_buffer_load ||

IID == Intrinsic::amdgcn_raw_buffer_load_format ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format;

arsenm · 2025-11-24T15:02:34Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+  bool IsStore = (IID == Intrinsic::amdgcn_raw_buffer_store ||
+                  IID == Intrinsic::amdgcn_raw_buffer_store_format ||
+                  IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||
+                  IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format);


Suggested change

bool IsStore = (IID == Intrinsic::amdgcn_raw_buffer_store ||

IID == Intrinsic::amdgcn_raw_buffer_store_format ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format);

bool IsStore = IID == Intrinsic::amdgcn_raw_buffer_store ||

IID == Intrinsic::amdgcn_raw_buffer_store_format ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||

IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format;

arsenm · 2025-11-24T15:03:18Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+    int64_t Offset = CI->getSExtValue();
+    if (Offset >= 0 && Offset <= 4095)


This shouldn't be hardcoded, and doesn't this very per subtarget? Check isLegalAddressingMode?

arsenm · 2025-11-24T15:03:55Z

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

+
+  LLVM_DEBUG(dbgs() << "AMDGPUCodeGenPrepare: Sinking uniform addend into "
+                       "soffset for buffer "
+                    << (IsStore ? "store" : "load") << ": " << I << '\n');


Suggested change

<< (IsStore ? "store" : "load") << ": " << I << '\n');

<< ": " << I << '\n');

If you're printing the whole instruction adding the extra word doesn't help

arsenm · 2025-11-24T15:04:20Z

llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll

@@ -0,0 +1,457 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck -check-prefixes=CHECK %s


Suggested change

; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck -check-prefixes=CHECK %s

; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s

Only should explicitly disable global-isel if you are also testing global-isel

arsenm · 2025-11-24T15:04:40Z

llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll

+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()


I'd prefer to test in non-kernel functions and rely on inreg arguments for SGPRs. That avoids a lot of prolog noise

krzysz00 · 2025-11-26T19:05:37Z

(Noting that I have seen this PR and should be reminded to read it after Thanksgiving)

Also, re voffset/soffset/..., I'm just going to write out what I think might be a subtlety here around alignment.

let v0 = ...(align 4)
let s0 = ...(align 4)
let v1 = v0 + s0 + 4

I0: buffer_load_dword rsrc, (voffset = v1) 
<=>
I1: buffer_load_dword rsrc, (vOffset = v0, sOffset = s0, immOffset = 4)

;;; However?

let v2 = v0 + 3
let s1 = s0 + 1
I2: buffer_load_dword rsrc (voffset = v2, soffset = s1)
<?=>
I3: buffer_load_dword rsrc, (vOffset = v0, soffset = s0, immOffset = 4)

These are the bounds checking expressions for those four load instructions (on GCN) (as I understand them currently)

I0: (v0 + s0 + 4) +[unsigned,saturating] 4 >[unsigned] R
I1: v0 +[?] +[usat] 4 >[u] R -[unsigned,negative => 0] s0
I2: (v0 + 3) +[usat] 4 >[u] R -[u,clamping] (s0 + 1)
I3: v0 +[?] +[usat] 4 >[u] R -[u,clamping] s0

(I don't know and don't want to test right this second what the overflow behavior of VOFFSET + IMMOFFSET is, hence the +[?] syntax there - I suspect it's something that allows the existing VOFFSET => VOFFSET + IMMOFFSET transformation to take place ... unless it isn't)

I suspect this'll require a bit of testing just to make sure, because some of this shuffling around the inequality results in alive2 complaining once you try and leave the realm of pure integers.

(This also raises the possibility that going from voffset = v0 + 4 to voffset = v0, immOffset = 4 isn't correct, but ... it is, so I suspect I've got the problem modeled subtly wrong)

llvmbot added the backend:AMDGPU label Nov 23, 2025

shiltian reviewed Nov 24, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-stores.ll Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp Show resolved Hide resolved

shiltian requested review from arsenm, jayfoad and krzysz00 November 24, 2025 00:48

PrasoonMishra force-pushed the amdgpu-mubuf-soffset-sinking branch from c4872b0 to 8d78ffd Compare November 24, 2025 04:47

jmmartinez reviewed Nov 24, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp Outdated Show resolved Hide resolved

jmmartinez reviewed Nov 24, 2025

View reviewed changes

PrasoonMishra force-pushed the amdgpu-mubuf-soffset-sinking branch from 8d78ffd to 244905c Compare November 24, 2025 11:53

Merge branch 'main' into amdgpu-mubuf-soffset-sinking

88c0485

arsenm reviewed Nov 24, 2025

View reviewed changes

		int64_t Offset = CI->getSExtValue();
		if (Offset >= 0 && Offset <= 4095)

	<< (IsStore ? "store" : "load") << ": " << I << '\n');
	<< ": " << I << '\n');

		@@ -0,0 +1,457 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
		; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s \| FileCheck -check-prefixes=CHECK %s

[AMDGPU] Sink uniform buffer address offsets into soffset #169230

Are you sure you want to change the base?

[AMDGPU] Sink uniform buffer address offsets into soffset #169230

Conversation

PrasoonMishra commented Nov 23, 2025

Uh oh!

llvmbot commented Nov 23, 2025

Uh oh!

shiltian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PrasoonMishra commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PrasoonMishra Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Nov 24, 2025

Uh oh!

PrasoonMishra commented Nov 24, 2025

Uh oh!

jayfoad commented Nov 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krzysz00 commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

PrasoonMishra commented Nov 24, 2025 •

edited

Loading

PrasoonMishra Nov 24, 2025 •

edited

Loading