Skip to content

Conversation

@PrasoonMishra
Copy link
Contributor

This patch implements an optimization to partition MUBUF load/store offsets into vector and scalar components for better scheduling and reduced VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by moving the uniform part to soffset and keeping the divergent part in voffset.

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores when soffset is initially zero.

Test coverage is provided by buffer-offset-to-soffset-loads and buffer-offset-to-soffset-stores, which include comprehensive validation across i8, i16, i32, vector (v2/v4), and float variants, including positive and negative cases.

@llvmbot
Copy link
Member

llvmbot commented Nov 23, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Prasoon Mishra (PrasoonMishra)

Changes

This patch implements an optimization to partition MUBUF load/store offsets into vector and scalar components for better scheduling and reduced VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by moving the uniform part to soffset and keeping the divergent part in voffset.

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores when soffset is initially zero.

Test coverage is provided by buffer-offset-to-soffset-loads and buffer-offset-to-soffset-stores, which include comprehensive validation across i8, i16, i32, vector (v2/v4), and float variants, including positive and negative cases.


Patch is 45.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169230.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+90)
  • (added) llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll (+457)
  • (added) llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-stores.ll (+461)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll (+10-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.load.ll (+8-4)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index 8e35ba77d69aa..9ff2e2a4bdf82 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -260,6 +260,7 @@ class AMDGPUCodeGenPrepareImpl
   bool visitIntrinsicInst(IntrinsicInst &I);
   bool visitFMinLike(IntrinsicInst &I);
   bool visitSqrt(IntrinsicInst &I);
+  bool visitBufferIntrinsic(IntrinsicInst &I);
   bool run();
 };
 
@@ -1910,6 +1911,15 @@ bool AMDGPUCodeGenPrepareImpl::visitIntrinsicInst(IntrinsicInst &I) {
     return visitFMinLike(I);
   case Intrinsic::sqrt:
     return visitSqrt(I);
+  case Intrinsic::amdgcn_raw_buffer_load:
+  case Intrinsic::amdgcn_raw_buffer_load_format:
+  case Intrinsic::amdgcn_raw_buffer_store:
+  case Intrinsic::amdgcn_raw_buffer_store_format:
+  case Intrinsic::amdgcn_raw_ptr_buffer_load:
+  case Intrinsic::amdgcn_raw_ptr_buffer_load_format:
+  case Intrinsic::amdgcn_raw_ptr_buffer_store:
+  case Intrinsic::amdgcn_raw_ptr_buffer_store_format:
+    return visitBufferIntrinsic(I);
   default:
     return false;
   }
@@ -2046,6 +2056,86 @@ bool AMDGPUCodeGenPrepareImpl::visitSqrt(IntrinsicInst &Sqrt) {
   return true;
 }
 
+/// Sink uniform addends in buffer address calculations into soffset.
+///
+/// Transforms buffer loads/stores with voffset = add(uniform, divergent)
+/// into voffset = divergent, soffset = uniform for better address coalescing
+/// Only applies to raw buffer operations with soffset initially zero.
+bool AMDGPUCodeGenPrepareImpl::visitBufferIntrinsic(IntrinsicInst &I) {
+  Intrinsic::ID IID = I.getIntrinsicID();
+  bool IsLoad = (IID == Intrinsic::amdgcn_raw_buffer_load ||
+                 IID == Intrinsic::amdgcn_raw_buffer_load_format ||
+                 IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||
+                 IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format);
+  bool IsStore = (IID == Intrinsic::amdgcn_raw_buffer_store ||
+                  IID == Intrinsic::amdgcn_raw_buffer_store_format ||
+                  IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||
+                  IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format);
+
+  if (!IsLoad && !IsStore)
+    return false;
+
+  // Buffer intrinsic operand layout (same for vector and pointer descriptor):
+  // Load:  (rsrc, voffset, soffset, cachepolicy)
+  // Store: (vdata, rsrc, voffset, soffset, cachepolicy)
+  const unsigned VOffsetIdx = IsStore ? 2 : 1;
+  const unsigned SOffsetIdx = IsStore ? 3 : 2;
+
+  Value *VOffset = I.getArgOperand(VOffsetIdx);
+  Value *SOffset = I.getArgOperand(SOffsetIdx);
+
+  // Only optimize when soffset is currently zero
+  if (!match(SOffset, m_Zero()))
+    return false;
+
+  // Pattern match: voffset = add(uniform, divergent)
+  Value *LHS, *RHS;
+  if (!match(VOffset, m_Add(m_Value(LHS), m_Value(RHS))))
+    return false;
+
+  bool LHSUniform = UA.isUniform(LHS);
+  bool RHSUniform = UA.isUniform(RHS);
+
+  // Need exactly one uniform and one divergent operand.
+  // TODO: Handle the case where both are uniform.
+  if (LHSUniform == RHSUniform)
+    return false;
+
+  Value *UniformAddend = LHSUniform ? LHS : RHS;
+  Value *DivergentAddend = LHSUniform ? RHS : LHS;
+
+  // Skip if the uniform addend is a non-negative constant that fits in the
+  // 12-bit immediate offset field. The backend will fold it into the immediate
+  // field, which avoids consuming an soffset operand.
+  // Negative or large constants must use soffset.
+  if (auto *CI = dyn_cast<ConstantInt>(UniformAddend)) {
+    int64_t Offset = CI->getSExtValue();
+    if (Offset >= 0 && Offset <= 4095)
+      return false;
+  }
+
+  LLVM_DEBUG(dbgs() << "AMDGPUCodeGenPrepare: Sinking uniform addend into "
+                       "soffset for buffer "
+                    << (IsStore ? "store" : "load") << ": " << I << '\n');
+
+  // Clone the instruction and insert it before the old instruction
+  CallInst *NewCall = cast<CallInst>(I.clone());
+  NewCall->insertBefore(I.getIterator());
+
+  // Update voffset and soffset operands
+  NewCall->setArgOperand(VOffsetIdx, DivergentAddend);
+  NewCall->setArgOperand(SOffsetIdx, UniformAddend);
+
+  // Replace and erase the old instruction
+  if (IsLoad) {
+    NewCall->takeName(&I);
+    I.replaceAllUsesWith(NewCall);
+  }
+  I.eraseFromParent();
+
+  return true;
+}
+
 bool AMDGPUCodeGenPrepare::runOnFunction(Function &F) {
   if (skipFunction(F))
     return false;
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll b/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll
new file mode 100644
index 0000000000000..c520062c5a01e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/buffer-offset-to-soffset-loads.ll
@@ -0,0 +1,457 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck -check-prefixes=CHECK %s
+
+; Test comprehensive patterns for ADD(divergent, uniform) optimization in buffer loads
+
+; Basic workitem.id.x + uniform
+define amdgpu_kernel void @test_basic_workitem_uniform(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_basic_workitem_uniform:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Reversed operands (uniform + divergent)
+define amdgpu_kernel void @test_reversed_operands(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_reversed_operands:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %soffset, %voffset  ; Reversed: uniform + divergent
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Multiple buffer loads with same pattern
+define amdgpu_kernel void @test_multiple_loads(ptr addrspace(1) %output, i32 %soffset1, i32 %soffset2) {
+; CHECK-LABEL: test_multiple_loads:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v1, v0, s[4:7], s2 offen
+; CHECK-NEXT:    buffer_load_dword v2, v0, s[4:7], s3 offen
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    v_add_u32_e32 v1, v1, v2
+; CHECK-NEXT:    global_store_dword v0, v1, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+
+  %sum1 = add i32 %voffset, %soffset1
+  %val1 = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum1, i32 0, i32 0)
+
+  %sum2 = add i32 %voffset, %soffset2
+  %val2 = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum2, i32 0, i32 0)
+
+  %result = add i32 %val1, %val2
+  store i32 %result, ptr addrspace(1) %output
+  ret void
+}
+
+; Different buffer load variants - byte load
+define amdgpu_kernel void @test_buffer_load_byte(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_byte:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_ubyte v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i8 @llvm.amdgcn.raw.buffer.load.i8(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  %ext = zext i8 %val to i32
+  store i32 %ext, ptr addrspace(1) %output
+  ret void
+}
+
+; Different buffer load variants - short load
+define amdgpu_kernel void @test_buffer_load_short(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_short:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_ushort v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i16 @llvm.amdgcn.raw.buffer.load.i16(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  %ext = zext i16 %val to i32
+  store i32 %ext, ptr addrspace(1) %output
+  ret void
+}
+
+; Vector loads - v2i32
+define amdgpu_kernel void @test_buffer_load_v2i32(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_v2i32:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dwordx2 v[0:1], v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store <2 x i32> %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Vector loads - v4i32
+define amdgpu_kernel void @test_buffer_load_v4i32(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_v4i32:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v4, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dwordx4 v[0:3], v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store <4 x i32> %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Float loads
+define amdgpu_kernel void @test_buffer_load_float(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_buffer_load_float:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store float %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Complex divergent expression + uniform
+define amdgpu_kernel void @test_complex_divergent(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_complex_divergent:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    v_add_u32_e32 v0, v0, v1
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid_x = call i32 @llvm.amdgcn.workitem.id.x()
+  %tid_y = call i32 @llvm.amdgcn.workitem.id.y()
+  %divergent = add i32 %tid_x, %tid_y  ; Still divergent
+  %sum = add i32 %divergent, %soffset  ; divergent + uniform
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - both operands divergent
+define amdgpu_kernel void @test_both_divergent(ptr addrspace(1) %output) {
+; CHECK-LABEL: test_both_divergent:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    v_add_u32_e32 v0, v0, v1
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], 0 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid_x = call i32 @llvm.amdgcn.workitem.id.x()
+  %tid_y = call i32 @llvm.amdgcn.workitem.id.y()
+  %sum = add i32 %tid_x, %tid_y
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - both operands uniform
+define amdgpu_kernel void @test_both_uniform(ptr addrspace(1) %output, i32 %soffset1, i32 %soffset2) {
+; CHECK-LABEL: test_both_uniform:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_add_i32 s2, s2, s3
+; CHECK-NEXT:    v_mov_b32_e32 v0, s2
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[4:7], 0 offen
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %sum = add i32 %soffset1, %soffset2
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Nested in control flow
+define amdgpu_kernel void @test_control_flow(ptr addrspace(1) %output, i32 %soffset, i32 %condition) {
+; CHECK-LABEL: test_control_flow:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_cmp_lg_u32 s3, 0
+; CHECK-NEXT:    s_cbranch_scc0 .LBB11_4
+; CHECK-NEXT:  ; %bb.1: ; %else
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    global_store_dword v1, v1, s[0:1]
+; CHECK-NEXT:    s_cbranch_execnz .LBB11_3
+; CHECK-NEXT:  .LBB11_2: ; %then
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[4:7], s2 offen
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:  .LBB11_3: ; %end
+; CHECK-NEXT:    s_endpgm
+; CHECK-NEXT:  .LBB11_4:
+; CHECK-NEXT:    s_branch .LBB11_2
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %cmp = icmp eq i32 %condition, 0
+  br i1 %cmp, label %then, label %else
+
+then:
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  br label %end
+
+else:
+  store i32 0, ptr addrspace(1) %output
+  br label %end
+
+end:
+  ret void
+}
+
+; Multiple uses of the ADD result - should still optimize buffer load
+define amdgpu_kernel void @test_multiple_uses(ptr addrspace(1) %output1, ptr addrspace(1) %output2, i32 %soffset) {
+; CHECK-LABEL: test_multiple_uses:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x34
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v1, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; CHECK-NEXT:    v_add_u32_e32 v0, s6, v0
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v2, v1, s[0:1]
+; CHECK-NEXT:    global_store_dword v2, v0, s[2:3]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output1
+  store i32 %sum, ptr addrspace(1) %output2
+  ret void
+}
+
+; Chain of operations - workitem.id -> mul -> add -> buffer_load
+define amdgpu_kernel void @test_operation_chain(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_operation_chain:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    v_mul_u32_u24_e32 v0, 4, v0
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %scaled = mul i32 %tid, 4  ; Still divergent
+  %sum = add i32 %scaled, %soffset  ; divergent + uniform
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 0, i32 0)
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - Buffer load with non-zero soffset field already
+define amdgpu_kernel void @test_existing_soffset(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_existing_soffset:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_add_u32_e32 v0, s6, v0
+; CHECK-NEXT:    s_movk_i32 s6, 0x64
+; CHECK-NEXT:    buffer_load_dword v0, v0, s[0:3], s6 offen
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    global_store_dword v1, v0, s[0:1]
+; CHECK-NEXT:    s_endpgm
+  %desc = call <4 x i32> asm "", "=s"()
+  %voffset = call i32 @llvm.amdgcn.workitem.id.x()
+  %sum = add i32 %voffset, %soffset
+  %val = call i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32> %desc, i32 %sum, i32 100, i32 0)  ; Non-zero soffset
+  store i32 %val, ptr addrspace(1) %output
+  ret void
+}
+
+; Should NOT optimize - Structured buffer loads
+define amdgpu_kernel void @test_struct_buffer_load(ptr addrspace(1) %output, i32 %soffset) {
+; CHECK-LABEL: test_struct_buffer_load:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_load_dword s6, s[4:5], 0x2c
+; CHECK-NEXT:    ;;#ASMSTART
+; CHECK-NEXT:    ;;#ASMEND
+; CHECK-NEXT:    v_mov_b32_e32 v1, 0
+; CHECK-NE...
[truncated]

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pass the right place doing this?

@PrasoonMishra PrasoonMishra force-pushed the amdgpu-mubuf-soffset-sinking branch from c4872b0 to 8d78ffd Compare November 24, 2025 04:47
@PrasoonMishra
Copy link
Contributor Author

PrasoonMishra commented Nov 24, 2025

Is this pass the right place doing this?

Yes, AMDGPUCodeGenPrepare seems like a better place for this optimization. When I initially attempted it in SelectionDAG’s PreprocessISelDAG, @arsenm and @krzysz00 suggested moving it to the IR level (see PR #160939). This location also makes it easier to implement the next optimization, which promotes the buffer load to a scalar buffer load when both voffset and soffset are uniform. I plan to add that in a follow-up PR.

Comment on lines +2088 to +2089
if (!match(SOffset, m_Zero()))
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this code evolve when we want to handle soffsets that are not zero? Or more complex voffsets like (((non_uniform + uniform_a) + uniform_b) + 8).

In this previous case, 8 is kept as it is to sink it later to the constant part. uniform_a and uniform_b are kept in there too, while it'd have been possible to move, at least uniform_b, to the soffset ?

I understand that as a first step we can put this transformation in here; but I would not be surprised about it having its own pass (or having a pass to do "uniform reassociate" more generally).

Copy link
Contributor Author

@PrasoonMishra PrasoonMishra Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. I intentionally targeted the simplest case (soffset = 0, single-level add) as a starting point. I wanted to evolve it incrementally here, as the current pattern is simple enough that a separate pass felt premature.
That said, if you feel it would benefit from its own pass from the start, I'm happy to refactor it out now. What do you think?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 24, 2025

The claim here is that for raw buffer operations there is no semantic reason to separate out the voffset/soffset/immediate parts of the offset. So a more comprehensive implementation would be to canonicalize them as early as possible with a single offset operand which is the sum of the three offsets provided by the programmer. Then instruction selection can split that back into divergent/uniform/constant parts like we already do for lots of addressing modes.

This patch implements an optimization to partition MUBUF load/store offsets
into vector and scalar components for better address coalescing and reduced
VGPR pressure.

Transform buffer operations where voffset = add(uniform, divergent) by
moving the uniform part to soffset and keeping the divergent part in voffset.

Before:
  v_add_u32 v1, v0, sN
  buffer_{load,store}_T v*, v1, s[bufDesc:bufDesc+3] offen

After:
  buffer_{load,store}_T v*, v0, s[bufDesc:bufDesc+3], sN offen

The optimization currently applies to raw buffer loads/stores when soffset is
initially zero.

Tests includes comprehensive validation of both buffer loads and stores
across various supported variants (i8, i16, i32, vectors, floats) with
positive and negative test cases.
@PrasoonMishra PrasoonMishra force-pushed the amdgpu-mubuf-soffset-sinking branch from 8d78ffd to 244905c Compare November 24, 2025 11:53
@PrasoonMishra
Copy link
Contributor Author

The claim here is that for raw buffer operations there is no semantic reason to separate out the voffset/soffset/immediate parts of the offset. So a more comprehensive implementation would be to canonicalize them as early as possible with a single offset operand which is the sum of the three offsets provided by the programmer. Then instruction selection can split that back into divergent/uniform/constant parts like we already do for lots of addressing modes.

I agree that semantically raw buffer addressing is just the sum of voffset + soffset + imm. However, we keep them separate to enable later optimizations like promoting to SMEM when everything is uniform. This makes late splitting tricky because it must preserve uniformity

// Original
voffset = (divergent_tid * stride) + uniform_val1;
soffset = uniform_val2;
// Backend knows soffset is uniform can promote to s_buffer_load if voffset == 0

// After merging:
merged_offset = (divergent_tid * stride) + uniform_val1 + uniform_val2;
// Problem: merged_offset is divergent, hence cannot promote even though most is uniform.

It also needs to carry alignment info for SMEM promotion (see #138975). Without these, it will prevent s_buffer_load promotion.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 24, 2025

I agree that semantically raw buffer addressing is just the sum of voffset + soffset + imm. However, we keep them separate to enable later optimizations like promoting to SMEM when everything is uniform. This makes late splitting tricky because it must preserve uniformity

No, it is simple. You can only promote to SMEM if the entire address is uniform.

// Original
voffset = (divergent_tid * stride) + uniform_val1;
soffset = uniform_val2;
// Backend knows soffset is uniform can promote to s_buffer_load if voffset == 0

Don't understand the "if voffset == 0" condition, since this example has a non-0 voffset.

// After merging:
merged_offset = (divergent_tid * stride) + uniform_val1 + uniform_val2;
// Problem: merged_offset is divergent, hence cannot promote even though most is uniform.

The part that came from voffset is still non-zero and divergent, so it would be incorrect to promote to SMEM.

Comment on lines +2075 to +2076
if (!IsLoad && !IsStore)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!IsLoad && !IsStore)
return false;

This is unreachable

Comment on lines +2066 to +2069
bool IsLoad = (IID == Intrinsic::amdgcn_raw_buffer_load ||
IID == Intrinsic::amdgcn_raw_buffer_load_format ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool IsLoad = (IID == Intrinsic::amdgcn_raw_buffer_load ||
IID == Intrinsic::amdgcn_raw_buffer_load_format ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format);
bool IsLoad = IID == Intrinsic::amdgcn_raw_buffer_load ||
IID == Intrinsic::amdgcn_raw_buffer_load_format ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_load ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_load_format;

Comment on lines +2070 to +2073
bool IsStore = (IID == Intrinsic::amdgcn_raw_buffer_store ||
IID == Intrinsic::amdgcn_raw_buffer_store_format ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool IsStore = (IID == Intrinsic::amdgcn_raw_buffer_store ||
IID == Intrinsic::amdgcn_raw_buffer_store_format ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format);
bool IsStore = IID == Intrinsic::amdgcn_raw_buffer_store ||
IID == Intrinsic::amdgcn_raw_buffer_store_format ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_store ||
IID == Intrinsic::amdgcn_raw_ptr_buffer_store_format;

Comment on lines +2112 to +2113
int64_t Offset = CI->getSExtValue();
if (Offset >= 0 && Offset <= 4095)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be hardcoded, and doesn't this very per subtarget? Check isLegalAddressingMode?


LLVM_DEBUG(dbgs() << "AMDGPUCodeGenPrepare: Sinking uniform addend into "
"soffset for buffer "
<< (IsStore ? "store" : "load") << ": " << I << '\n');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< (IsStore ? "store" : "load") << ": " << I << '\n');
<< ": " << I << '\n');

If you're printing the whole instruction adding the extra word doesn't help

@@ -0,0 +1,457 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck -check-prefixes=CHECK %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck -check-prefixes=CHECK %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s

Only should explicitly disable global-isel if you are also testing global-isel

; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; CHECK-NEXT: global_store_dword v1, v0, s[0:1]
; CHECK-NEXT: s_endpgm
%desc = call <4 x i32> asm "", "=s"()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to test in non-kernel functions and rely on inreg arguments for SGPRs. That avoids a lot of prolog noise

@krzysz00
Copy link
Contributor

(Noting that I have seen this PR and should be reminded to read it after Thanksgiving)

Also, re voffset/soffset/..., I'm just going to write out what I think might be a subtlety here around alignment.

let v0 = ...(align 4)
let s0 = ...(align 4)
let v1 = v0 + s0 + 4

I0: buffer_load_dword rsrc, (voffset = v1) 
<=>
I1: buffer_load_dword rsrc, (vOffset = v0, sOffset = s0, immOffset = 4)

;;; However?

let v2 = v0 + 3
let s1 = s0 + 1
I2: buffer_load_dword rsrc (voffset = v2, soffset = s1)
<?=>
I3: buffer_load_dword rsrc, (vOffset = v0, soffset = s0, immOffset = 4)

These are the bounds checking expressions for those four load instructions (on GCN) (as I understand them currently)

I0: (v0 + s0 + 4) +[unsigned,saturating] 4 >[unsigned] R
I1: v0 +[?] +[usat] 4 >[u] R -[unsigned,negative => 0] s0
I2: (v0 + 3) +[usat] 4 >[u] R -[u,clamping] (s0 + 1)
I3: v0 +[?] +[usat] 4 >[u] R -[u,clamping] s0

(I don't know and don't want to test right this second what the overflow behavior of VOFFSET + IMMOFFSET is, hence the +[?] syntax there - I suspect it's something that allows the existing VOFFSET => VOFFSET + IMMOFFSET transformation to take place ... unless it isn't)

I suspect this'll require a bit of testing just to make sure, because some of this shuffling around the inequality results in alive2 complaining once you try and leave the realm of pure integers.

(This also raises the possibility that going from voffset = v0 + 4 to voffset = v0, immOffset = 4 isn't correct, but ... it is, so I suspect I've got the problem modeled subtly wrong)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants