Skip to content

Conversation

@jmmartinez
Copy link
Contributor

On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:

  • the stride is not a multiple of 4, or
  • the stride is 0 and the num-records is not a multiple of 4

At the moment, these instructions are only generated for PAL.
But in this case, it is guaranteed that the buffers stride/num-records are
aligned to 4.

This patch prevents the emission of scalar subword loads to PAL, where
the bug would never be triggered, and avoid it in HSA (where it could be
triggered, but it's not used).

Solves SWDEV-498239

@jmmartinez jmmartinez requested review from arsenm and jayfoad November 25, 2024 16:40
@jmmartinez jmmartinez self-assigned this Nov 25, 2024
@llvmbot
Copy link
Member

llvmbot commented Nov 25, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Juan Manuel Martinez Caamaño (jmmartinez)

Changes

On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:

  • the stride is not a multiple of 4, or
  • the stride is 0 and the num-records is not a multiple of 4

At the moment, these instructions are only generated for PAL.
But in this case, it is guaranteed that the buffers stride/num-records are
aligned to 4.

This patch prevents the emission of scalar subword loads to PAL, where
the bug would never be triggered, and avoid it in HSA (where it could be
triggered, but it's not used).

Solves SWDEV-498239


Patch is 30.82 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117576.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+29-9)
  • (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+15)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+43-38)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll (+246-124)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 9bf1f281c32a09..bf60ae32b46108 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -6803,8 +6803,36 @@ bool AMDGPULegalizerInfo::legalizeSBufferLoad(LegalizerHelper &Helper,
   unsigned Size = Ty.getSizeInBits();
   MachineFunction &MF = B.getMF();
   unsigned Opc = 0;
+
+  const unsigned MemSize = (Size + 7) / 8;
+  const Align MemAlign = B.getDataLayout().getABITypeAlign(
+      getTypeForLLT(Ty, MF.getFunction().getContext()));
+
+  // FIXME: When intrinsic definition is fixed, this should have an MMO already.
+  MachineMemOperand *MMO = MF.getMachineMemOperand(
+      MachinePointerInfo(),
+      MachineMemOperand::MOLoad | MachineMemOperand::MODereferenceable |
+          MachineMemOperand::MOInvariant,
+      MemSize, MemAlign);
+
   if (Size < 32 && ST.hasScalarSubwordLoads()) {
     assert(Size == 8 || Size == 16);
+    if (!ST.hasScalarSubwordBufferLoads()) {
+      // fallback to S_BUFFER_LOAD_UBYTE/USHORT
+      MI.getOperand(1).setIntrinsicID(Intrinsic::amdgcn_raw_buffer_load);
+
+      Register ZeroReg =
+          B.getMRI()->createGenericVirtualRegister(LLT::scalar(32));
+      B.buildConstant(ZeroReg, 0);
+
+      MI.insert(MI.operands_begin() + 4,
+                {MachineOperand::CreateReg(ZeroReg, false)});
+
+      MI.addMemOperand(MF, MMO);
+      Observer.changedInstr(MI);
+      return true;
+    }
+
     Opc = Size == 8 ? AMDGPU::G_AMDGPU_S_BUFFER_LOAD_UBYTE
                     : AMDGPU::G_AMDGPU_S_BUFFER_LOAD_USHORT;
     // The 8-bit and 16-bit scalar buffer load instructions have 32-bit
@@ -6834,16 +6862,8 @@ bool AMDGPULegalizerInfo::legalizeSBufferLoad(LegalizerHelper &Helper,
   MI.setDesc(B.getTII().get(Opc));
   MI.removeOperand(1); // Remove intrinsic ID
 
-  // FIXME: When intrinsic definition is fixed, this should have an MMO already.
-  const unsigned MemSize = (Size + 7) / 8;
-  const Align MemAlign = B.getDataLayout().getABITypeAlign(
-      getTypeForLLT(Ty, MF.getFunction().getContext()));
-  MachineMemOperand *MMO = MF.getMachineMemOperand(
-      MachinePointerInfo(),
-      MachineMemOperand::MOLoad | MachineMemOperand::MODereferenceable |
-          MachineMemOperand::MOInvariant,
-      MemSize, MemAlign);
   MI.addMemOperand(MF, MMO);
+
   if (Dst != OrigDst) {
     MI.getOperand(0).setReg(Dst);
     B.setInsertPt(B.getMBB(), ++B.getInsertPt());
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index 18219174b16b1e..15d67d478465d6 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -460,6 +460,21 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   bool hasScalarSubwordLoads() const { return getGeneration() >= GFX12; }
 
+  bool hasScalarSubwordBufferLoads() const {
+    Generation Gen = getGeneration();
+
+    // On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:
+    // * the stride is not a multiple of 4, or
+    // * the stride is 0 and the num-records is not a multiple of 4
+    // At the moment, llvm.amdgcn.s.buffer.loads instruction are only generated
+    // for PAL by LLPC. In this case, it is guaranteed that the buffers
+    // stride/num-records are aligned to 4. In the HSA/Mesa case, we simply
+    // avoid these instructions.
+    if (Gen == GFX12)
+      return isAmdPalOS();
+    return hasScalarSubwordLoads();
+  }
+
   TrapHandlerAbi getTrapHandlerAbi() const {
     return isAmdHsaOS() ? TrapHandlerAbi::AMDHSA : TrapHandlerAbi::NONE;
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index f3b5e6985e8e0d..5b9bcfe8e39628 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6430,7 +6430,7 @@ void SITargetLowering::ReplaceNodeResults(SDNode *N,
               MachineMemOperand::MOInvariant,
           VT.getStoreSize(), Alignment);
       SDValue LoadVal;
-      if (!Offset->isDivergent()) {
+      if (!Offset->isDivergent() && Subtarget->hasScalarSubwordBufferLoads()) {
         SDValue Ops[] = {Rsrc, // source register
                          Offset, CachePolicy};
         SDValue BufferLoad =
@@ -8359,52 +8359,57 @@ SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc,
           MachineMemOperand::MOInvariant,
       VT.getStoreSize(), Alignment);
 
-  if (!Offset->isDivergent()) {
-    SDValue Ops[] = {Rsrc, Offset, CachePolicy};
-
-    // Lower llvm.amdgcn.s.buffer.load.{i16, u16} intrinsics. Initially, the
-    // s_buffer_load_u16 instruction is emitted for both signed and unsigned
-    // loads. Later, DAG combiner tries to combine s_buffer_load_u16 with sext
-    // and generates s_buffer_load_i16 (performSignExtendInRegCombine).
-    if (VT == MVT::i16 && Subtarget->hasScalarSubwordLoads()) {
-      SDValue BufferLoad =
-          DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD_USHORT, DL,
-                                  DAG.getVTList(MVT::i32), Ops, VT, MMO);
+  // We have a divergent offset. Emit a MUBUF buffer load instead. We can
+  // assume that the buffer is unswizzled.
+  SDValue BufferLoadOps[] = {
+      DAG.getEntryNode(),                    // Chain
+      Rsrc,                                  // rsrc
+      DAG.getConstant(0, DL, MVT::i32),      // vindex
+      {},                                    // voffset
+      {},                                    // soffset
+      {},                                    // offset
+      CachePolicy,                           // cachepolicy
+      DAG.getTargetConstant(0, DL, MVT::i1), // idxen
+  };
+
+  if (VT == MVT::i16 && Subtarget->hasScalarSubwordLoads()) {
+    if (!Offset->isDivergent() && Subtarget->hasScalarSubwordBufferLoads()) {
+      // Lower llvm.amdgcn.s.buffer.load.{i16, u16} intrinsics. Initially, the
+      // s_buffer_load_u16 instruction is emitted for both signed and unsigned
+      // loads. Later, DAG combiner tries to combine s_buffer_load_u16 with sext
+      // and generates s_buffer_load_i16 (performSignExtendInRegCombine).
+      SDValue SBufferLoadOps[] = {Rsrc, Offset, CachePolicy};
+      SDValue BufferLoad = DAG.getMemIntrinsicNode(
+          AMDGPUISD::SBUFFER_LOAD_USHORT, DL, DAG.getVTList(MVT::i32),
+          SBufferLoadOps, VT, MMO);
       return DAG.getNode(ISD::TRUNCATE, DL, VT, BufferLoad);
     }
 
+    // If s_buffer_load_u16/u8 is not supported by the platform (gfx12 when we
+    // cannot ensure the buffer's num-records/stride is not properly aligned)
+    // lower to a buffer_load_u8/u16
+    setBufferOffsets(Offset, DAG, &BufferLoadOps[3], Align(4));
+    return handleByteShortBufferLoads(DAG, VT, DL, BufferLoadOps, MMO);
+  }
+
+  if (!Offset->isDivergent()) {
+    SDValue SBufferLoadOps[] = {Rsrc, Offset, CachePolicy};
+
     // Widen vec3 load to vec4.
     if (VT.isVector() && VT.getVectorNumElements() == 3 &&
         !Subtarget->hasScalarDwordx3Loads()) {
       EVT WidenedVT =
           EVT::getVectorVT(*DAG.getContext(), VT.getVectorElementType(), 4);
       auto WidenedOp = DAG.getMemIntrinsicNode(
-          AMDGPUISD::SBUFFER_LOAD, DL, DAG.getVTList(WidenedVT), Ops, WidenedVT,
-          MF.getMachineMemOperand(MMO, 0, WidenedVT.getStoreSize()));
+          AMDGPUISD::SBUFFER_LOAD, DL, DAG.getVTList(WidenedVT), SBufferLoadOps,
+          WidenedVT, MF.getMachineMemOperand(MMO, 0, WidenedVT.getStoreSize()));
       auto Subvector = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, WidenedOp,
                                    DAG.getVectorIdxConstant(0, DL));
       return Subvector;
     }
 
     return DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
-                                   DAG.getVTList(VT), Ops, VT, MMO);
-  }
-
-  // We have a divergent offset. Emit a MUBUF buffer load instead. We can
-  // assume that the buffer is unswizzled.
-  SDValue Ops[] = {
-      DAG.getEntryNode(),                    // Chain
-      Rsrc,                                  // rsrc
-      DAG.getConstant(0, DL, MVT::i32),      // vindex
-      {},                                    // voffset
-      {},                                    // soffset
-      {},                                    // offset
-      CachePolicy,                           // cachepolicy
-      DAG.getTargetConstant(0, DL, MVT::i1), // idxen
-  };
-  if (VT == MVT::i16 && Subtarget->hasScalarSubwordLoads()) {
-    setBufferOffsets(Offset, DAG, &Ops[3], Align(4));
-    return handleByteShortBufferLoads(DAG, VT, DL, Ops, MMO);
+                                   DAG.getVTList(VT), SBufferLoadOps, VT, MMO);
   }
 
   SmallVector<SDValue, 4> Loads;
@@ -8423,14 +8428,14 @@ SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc,
 
   // Use the alignment to ensure that the required offsets will fit into the
   // immediate offsets.
-  setBufferOffsets(Offset, DAG, &Ops[3],
+  setBufferOffsets(Offset, DAG, &BufferLoadOps[3],
                    NumLoads > 1 ? Align(16 * NumLoads) : Align(4));
 
-  uint64_t InstOffset = Ops[5]->getAsZExtVal();
+  uint64_t InstOffset = BufferLoadOps[5]->getAsZExtVal();
   for (unsigned i = 0; i < NumLoads; ++i) {
-    Ops[5] = DAG.getTargetConstant(InstOffset + 16 * i, DL, MVT::i32);
-    Loads.push_back(getMemIntrinsicNode(AMDGPUISD::BUFFER_LOAD, DL, VTList, Ops,
-                                        LoadVT, MMO, DAG));
+    BufferLoadOps[5] = DAG.getTargetConstant(InstOffset + 16 * i, DL, MVT::i32);
+    Loads.push_back(getMemIntrinsicNode(AMDGPUISD::BUFFER_LOAD, DL, VTList,
+                                        BufferLoadOps, LoadVT, MMO, DAG));
   }
 
   if (NumElts == 8 || NumElts == 16)
@@ -12672,7 +12677,7 @@ SITargetLowering::performSignExtendInRegCombine(SDNode *N,
         VTSign->getVT() == MVT::i8) ||
        (Src.getOpcode() == AMDGPUISD::SBUFFER_LOAD_USHORT &&
         VTSign->getVT() == MVT::i16))) {
-    assert(Subtarget->hasScalarSubwordLoads() &&
+    assert(Subtarget->hasScalarSubwordBufferLoads() &&
            "s_buffer_load_{u8, i8} are supported "
            "in GFX12 (or newer) architectures.");
     EVT VT = Src.getValueType();
diff --git a/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll b/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll
index 020c9dc130bb2a..ad89b1f91143b4 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll
@@ -1,6 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,DAG %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -global-isel=1 < %s | FileCheck -check-prefixes=GCN,GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,DAG,DEFAULT %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -global-isel=1 < %s | FileCheck -check-prefixes=GCN,GISEL,DEFAULT %s
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,DAG,PAL %s
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs -global-isel=1 < %s | FileCheck -check-prefixes=GCN,GISEL,PAL,PAL-GISEL %s
 
 define amdgpu_ps void @test_s_load_i8(ptr addrspace(4) inreg %in, ptr addrspace(1) %out) {
 ; GCN-LABEL: test_s_load_i8:
@@ -419,13 +421,20 @@ define amdgpu_ps void @test_s_load_u16_divergent(ptr addrspace(4) inreg %in, i32
 }
 
 define amdgpu_ps void @s_buffer_load_byte_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out) {
-; GCN-LABEL: s_buffer_load_byte_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_i8 s0, s[0:3], 0x4
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    buffer_load_i8 v2, off, s[0:3], null offset:4
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_byte_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_i8 s0, s[0:3], 0x4
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 4, i32 0)
   %sext = sext i8 %ld to i32
@@ -434,13 +443,21 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_byte_sgpr(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %offset) {
-; GCN-LABEL: s_buffer_load_byte_sgpr:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x0
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_sgpr:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_i8 v2, v2, s[0:3], null offen
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_byte_sgpr:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x0
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 %offset, i32 0)
   %sext = sext i8 %ld to i32
@@ -449,13 +466,21 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_byte_sgpr_or_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %in) {
-; GCN-LABEL: s_buffer_load_byte_sgpr_or_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x64
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_sgpr_or_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_i8 v2, v2, s[0:3], null offen offset:100
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_byte_sgpr_or_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x64
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %off = add nuw nsw i32 %in, 100
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 %off, i32 0)
@@ -472,12 +497,19 @@ define amdgpu_ps void @s_buffer_load_byte_sgpr_or_imm_offset_divergent(<4 x i32>
 ; DAG-NEXT:    global_store_b32 v[0:1], v2, off
 ; DAG-NEXT:    s_endpgm
 ;
-; GISEL-LABEL: s_buffer_load_byte_sgpr_or_imm_offset_divergent:
-; GISEL:       ; %bb.0: ; %main_body
-; GISEL-NEXT:    buffer_load_b32 v2, v2, s[0:3], null offen
-; GISEL-NEXT:    s_wait_loadcnt 0x0
-; GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GISEL-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_sgpr_or_imm_offset_divergent:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    buffer_load_i8 v2, v2, s[0:3], null offen
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-GISEL-LABEL: s_buffer_load_byte_sgpr_or_imm_offset_divergent:
+; PAL-GISEL:       ; %bb.0: ; %main_body
+; PAL-GISEL-NEXT:    buffer_load_b32 v2, v2, s[0:3], null offen
+; PAL-GISEL-NEXT:    s_wait_loadcnt 0x0
+; PAL-GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-GISEL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 %offset, i32 0)
   %sext = sext i8 %ld to i32
@@ -486,15 +518,22 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_ubyte_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out) {
-; GCN-LABEL: s_buffer_load_ubyte_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_u8 s0, s[0:3], 0x4
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    s_and_b32 s0, s0, 0xff
-; GCN-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_ubyte_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    buffer_load_u8 v2, off, s[0:3], null offset:4
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_ubyte_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_u8 s0, s[0:3], 0x4
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    s_and_b32 s0, s0, 0xff
+; PAL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.u8(<4 x i32> %src, i32 4, i32 0)
   %zext = zext i8 %ld to i32
@@ -503,15 +542,23 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_ubyte_sgpr(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %offset) {
-; GCN-LABEL: s_buffer_load_ubyte_sgpr:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x0
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    s_and_b32 s0, s0, 0xff
-; GCN-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_ubyte_sgpr:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_u8 v2, v2, s[0:3], null offen
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_ubyte_sgpr:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x0
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    s_and_b32 s0, s0, 0xff
+; PAL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.u8(<4 x i32> %src, i32 %offset, i32 0)
   %zext = zext i8 %ld to i32
@@ -520,15 +567,23 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_ubyte_sgpr_or_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %in) {
-; GCN-LABEL: s_buffer_load_ubyte_sgpr_or_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x64
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    s_and_b32 s0, s0, 0xff
-; GCN-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_ubyte_sgpr_or_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_u8 v2, v2, s[0:3], null offen offset:100
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_ubyte_sgpr_or_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x64
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    s_and_b32 s0, s0, 0xff
+; PAL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %off = add nuw nsw i32 %in, 100
   %ld = call i8 @llvm.amdgcn.s.buffer.load.u8(<4 x i32> %src, i32 %off, i32 0)
@@ -545,13 +600,20 @@ define amdgpu_ps void @s_buffer_load_ubyte_sgpr_or_imm_offset_divergent(<4 x i32
 ; DAG-NEXT:    global_store_b32 v[0:1], v2, off
 ; DAG-NEXT:    s_endpgm
 ;
-; GISEL-LABEL: s_buffer_load_ubyte_sgpr_o...
[truncated]

@arsenm
Copy link
Contributor

arsenm commented Nov 25, 2024

But in this case, it is guaranteed that the buffers stride/num-records are aligned to 4.

How is this guaranteed? This also doesn't seem like a property that should be baked into the platform

@jmmartinez
Copy link
Contributor Author

But in this case, it is guaranteed that the buffers stride/num-records are aligned to 4.

How is this guaranteed? This also doesn't seem like a property that should be baked into the platform

Thanks for asking this, I never had to deal with an issue like this one.

In the ticket, the proposed workaround consists of 3 parts:
A) The driver always rounds buffer sizes to a multiple of 4 bytes (this applies to Vulkan)
B) The buffer is a strided (structured) buffer whose stride is known to be a multiple of 4 bytes (this can apply to DX12 structured buffers)
C) Fall back to using buffer_load_[iu]{8,16}

So I assume that the vulkan and DX driver would conform to this.

I'm not happy with using the platform for this, but I wasn't sure if adding another option / another function attribute was the way to go.

…g BUFFER_LOAD_XXX

In some tests code generation diverged between isel and selection-dag

For exmaple, this intrinsic

    call i16 @llvm.amdgcn.s.buffer.load.u16(<4 x i32> %src, i32 %offset, i32
0)

would be lowered into these two cases:
* buffer_load_u16 v2, v2, s[0:3], null offen
* buffer_load_b32 v2, v2, s[0:3], null offen

This patch fixes this issue.
@arsenm
Copy link
Contributor

arsenm commented Nov 26, 2024

A) The driver always rounds buffer sizes to a multiple of 4 bytes (this applies to Vulkan)

User code can also synthesize buffers wherever they like

@shiltian
Copy link
Contributor

My 0.02 is that this should be considered solely a HW bug (instead of a combination of platform and HW), and the backend should simply stop emitting the corresponding code, regardless of the platform.

@jmmartinez
Copy link
Contributor Author

A) The driver always rounds buffer sizes to a multiple of 4 bytes (this applies to Vulkan)

User code can also synthesize buffers wherever they like

When you say user code you mean users writing custom shaders? Or users of OCL/HIP?

And I guess there is no way of distinguishing one kind of buffers from the other. So in the general case this workaround would never work (unless the users commit to aligning their buffers, which I guess they won't).

@arsenm
Copy link
Contributor

arsenm commented Nov 26, 2024

The backend should only reason about IR and we have an intrinsic to produce a buffer descriptor. You also could always produce the buffer-as-vector elements

@jmmartinez
Copy link
Contributor Author

In that case I propose by default not emitting the subscalar s_buffer_loads for gfx12 (for all platforms), unless some -gfx12-force-subscalar-s-buffer-loads option is passed to the backend (as a workaround for SWDEV-498239, and useful for us for testing).

What do you think?

On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:
* the stride is not a multiple of 4, or
* the stride is 0 and the num-records is not a multiple of 4

For Vulkan and DX, it is guaranteed that the buffers stride/num-records are
aligned to 4.

This patch prevents the emission of scalar subword loads unless an
option forcing it is passed to the backend.

Solves SWDEV-498239
On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:
* the stride is not a multiple of 4, or
* the stride is 0 and the num-records is not a multiple of 4

For Vulkan and DX, it is guaranteed that the buffers stride/num-records are
aligned to 4.

This patch prevents the emission of scalar subword loads unless an
option forcing it is passed to the backend.

Solves SWDEV-498239
Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fundamentally disagree with this change. We should not do anything here.

llvm.amdgcn.s.buffer.load is part of the family of intrinsics that is specifically designed to expose ISA instructions to whatever sits on top of LLVM. It is the user of the intrinsic who has to know what they're doing. This has always been the case in the past, e.g. the user must be aware that the intrinsic cannot be used with swizzled buffer descriptors.

This change just adds a whole bunch of complexity to the backend while making the design worse, and I already told you as much offline, so it's disappointing to see this here.

The only exception to this rule is that if there is logic in LLVM that create an 8/16-bit s.buffer.load without the frontend explicitly requesting one -- such code may have to be fixed. But all the changes I'm seeing are of the form "refuse to create an s_buffer_load_[iu]{8,16} instruction even though the frontend has explicitly requested the s.buffer.load intrinsic", and that's just wrong.

@nhaehnle
Copy link
Collaborator

Also, we should not add more differences between PAL and other environments. If anything, we should strive to reduce them.

@arsenm
Copy link
Contributor

arsenm commented Dec 3, 2024

llvm.amdgcn.s.buffer.load is part of the family of intrinsics that is specifically designed to expose ISA instructions to whatever sits on top of LLVM. It is the user of the intrinsic who has to know what they're doing. This has always been the case in the past, e.g. the user must be aware that the intrinsic cannot be used with swizzled buffer descriptors.

We should better document this then. The bugs don't always make it into the public isa docs, and are not mentioned in the (limited) intrinsic documentation either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants