[AMDGPU][GFX12] Restrict scalar subword loads to PAL #117576

jmmartinez · 2024-11-25T16:40:27Z

On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:

the stride is not a multiple of 4, or
the stride is 0 and the num-records is not a multiple of 4

At the moment, these instructions are only generated for PAL.
But in this case, it is guaranteed that the buffers stride/num-records are
aligned to 4.

This patch prevents the emission of scalar subword loads to PAL, where
the bug would never be triggered, and avoid it in HSA (where it could be
triggered, but it's not used).

Solves SWDEV-498239

llvmbot · 2024-11-25T16:41:08Z

@llvm/pr-subscribers-backend-amdgpu

Author: Juan Manuel Martinez Caamaño (jmmartinez)

Changes

On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:

the stride is not a multiple of 4, or
the stride is 0 and the num-records is not a multiple of 4

At the moment, these instructions are only generated for PAL.
But in this case, it is guaranteed that the buffers stride/num-records are
aligned to 4.

This patch prevents the emission of scalar subword loads to PAL, where
the bug would never be triggered, and avoid it in HSA (where it could be
triggered, but it's not used).

Solves SWDEV-498239

Patch is 30.82 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117576.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+29-9)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+15)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+43-38)
(modified) llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll (+246-124)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 9bf1f281c32a09..bf60ae32b46108 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -6803,8 +6803,36 @@ bool AMDGPULegalizerInfo::legalizeSBufferLoad(LegalizerHelper &Helper,
   unsigned Size = Ty.getSizeInBits();
   MachineFunction &MF = B.getMF();
   unsigned Opc = 0;
+
+  const unsigned MemSize = (Size + 7) / 8;
+  const Align MemAlign = B.getDataLayout().getABITypeAlign(
+      getTypeForLLT(Ty, MF.getFunction().getContext()));
+
+  // FIXME: When intrinsic definition is fixed, this should have an MMO already.
+  MachineMemOperand *MMO = MF.getMachineMemOperand(
+      MachinePointerInfo(),
+      MachineMemOperand::MOLoad | MachineMemOperand::MODereferenceable |
+          MachineMemOperand::MOInvariant,
+      MemSize, MemAlign);
+
   if (Size < 32 && ST.hasScalarSubwordLoads()) {
     assert(Size == 8 || Size == 16);
+    if (!ST.hasScalarSubwordBufferLoads()) {
+      // fallback to S_BUFFER_LOAD_UBYTE/USHORT
+      MI.getOperand(1).setIntrinsicID(Intrinsic::amdgcn_raw_buffer_load);
+
+      Register ZeroReg =
+          B.getMRI()->createGenericVirtualRegister(LLT::scalar(32));
+      B.buildConstant(ZeroReg, 0);
+
+      MI.insert(MI.operands_begin() + 4,
+                {MachineOperand::CreateReg(ZeroReg, false)});
+
+      MI.addMemOperand(MF, MMO);
+      Observer.changedInstr(MI);
+      return true;
+    }
+
     Opc = Size == 8 ? AMDGPU::G_AMDGPU_S_BUFFER_LOAD_UBYTE
                     : AMDGPU::G_AMDGPU_S_BUFFER_LOAD_USHORT;
     // The 8-bit and 16-bit scalar buffer load instructions have 32-bit
@@ -6834,16 +6862,8 @@ bool AMDGPULegalizerInfo::legalizeSBufferLoad(LegalizerHelper &Helper,
   MI.setDesc(B.getTII().get(Opc));
   MI.removeOperand(1); // Remove intrinsic ID
 
-  // FIXME: When intrinsic definition is fixed, this should have an MMO already.
-  const unsigned MemSize = (Size + 7) / 8;
-  const Align MemAlign = B.getDataLayout().getABITypeAlign(
-      getTypeForLLT(Ty, MF.getFunction().getContext()));
-  MachineMemOperand *MMO = MF.getMachineMemOperand(
-      MachinePointerInfo(),
-      MachineMemOperand::MOLoad | MachineMemOperand::MODereferenceable |
-          MachineMemOperand::MOInvariant,
-      MemSize, MemAlign);
   MI.addMemOperand(MF, MMO);
+
   if (Dst != OrigDst) {
     MI.getOperand(0).setReg(Dst);
     B.setInsertPt(B.getMBB(), ++B.getInsertPt());
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index 18219174b16b1e..15d67d478465d6 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -460,6 +460,21 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
 
   bool hasScalarSubwordLoads() const { return getGeneration() >= GFX12; }
 
+  bool hasScalarSubwordBufferLoads() const {
+    Generation Gen = getGeneration();
+
+    // On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when:
+    // * the stride is not a multiple of 4, or
+    // * the stride is 0 and the num-records is not a multiple of 4
+    // At the moment, llvm.amdgcn.s.buffer.loads instruction are only generated
+    // for PAL by LLPC. In this case, it is guaranteed that the buffers
+    // stride/num-records are aligned to 4. In the HSA/Mesa case, we simply
+    // avoid these instructions.
+    if (Gen == GFX12)
+      return isAmdPalOS();
+    return hasScalarSubwordLoads();
+  }
+
   TrapHandlerAbi getTrapHandlerAbi() const {
     return isAmdHsaOS() ? TrapHandlerAbi::AMDHSA : TrapHandlerAbi::NONE;
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index f3b5e6985e8e0d..5b9bcfe8e39628 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6430,7 +6430,7 @@ void SITargetLowering::ReplaceNodeResults(SDNode *N,
               MachineMemOperand::MOInvariant,
           VT.getStoreSize(), Alignment);
       SDValue LoadVal;
-      if (!Offset->isDivergent()) {
+      if (!Offset->isDivergent() && Subtarget->hasScalarSubwordBufferLoads()) {
         SDValue Ops[] = {Rsrc, // source register
                          Offset, CachePolicy};
         SDValue BufferLoad =
@@ -8359,52 +8359,57 @@ SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc,
           MachineMemOperand::MOInvariant,
       VT.getStoreSize(), Alignment);
 
-  if (!Offset->isDivergent()) {
-    SDValue Ops[] = {Rsrc, Offset, CachePolicy};
-
-    // Lower llvm.amdgcn.s.buffer.load.{i16, u16} intrinsics. Initially, the
-    // s_buffer_load_u16 instruction is emitted for both signed and unsigned
-    // loads. Later, DAG combiner tries to combine s_buffer_load_u16 with sext
-    // and generates s_buffer_load_i16 (performSignExtendInRegCombine).
-    if (VT == MVT::i16 && Subtarget->hasScalarSubwordLoads()) {
-      SDValue BufferLoad =
-          DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD_USHORT, DL,
-                                  DAG.getVTList(MVT::i32), Ops, VT, MMO);
+  // We have a divergent offset. Emit a MUBUF buffer load instead. We can
+  // assume that the buffer is unswizzled.
+  SDValue BufferLoadOps[] = {
+      DAG.getEntryNode(),                    // Chain
+      Rsrc,                                  // rsrc
+      DAG.getConstant(0, DL, MVT::i32),      // vindex
+      {},                                    // voffset
+      {},                                    // soffset
+      {},                                    // offset
+      CachePolicy,                           // cachepolicy
+      DAG.getTargetConstant(0, DL, MVT::i1), // idxen
+  };
+
+  if (VT == MVT::i16 && Subtarget->hasScalarSubwordLoads()) {
+    if (!Offset->isDivergent() && Subtarget->hasScalarSubwordBufferLoads()) {
+      // Lower llvm.amdgcn.s.buffer.load.{i16, u16} intrinsics. Initially, the
+      // s_buffer_load_u16 instruction is emitted for both signed and unsigned
+      // loads. Later, DAG combiner tries to combine s_buffer_load_u16 with sext
+      // and generates s_buffer_load_i16 (performSignExtendInRegCombine).
+      SDValue SBufferLoadOps[] = {Rsrc, Offset, CachePolicy};
+      SDValue BufferLoad = DAG.getMemIntrinsicNode(
+          AMDGPUISD::SBUFFER_LOAD_USHORT, DL, DAG.getVTList(MVT::i32),
+          SBufferLoadOps, VT, MMO);
       return DAG.getNode(ISD::TRUNCATE, DL, VT, BufferLoad);
     }
 
+    // If s_buffer_load_u16/u8 is not supported by the platform (gfx12 when we
+    // cannot ensure the buffer's num-records/stride is not properly aligned)
+    // lower to a buffer_load_u8/u16
+    setBufferOffsets(Offset, DAG, &BufferLoadOps[3], Align(4));
+    return handleByteShortBufferLoads(DAG, VT, DL, BufferLoadOps, MMO);
+  }
+
+  if (!Offset->isDivergent()) {
+    SDValue SBufferLoadOps[] = {Rsrc, Offset, CachePolicy};
+
     // Widen vec3 load to vec4.
     if (VT.isVector() && VT.getVectorNumElements() == 3 &&
         !Subtarget->hasScalarDwordx3Loads()) {
       EVT WidenedVT =
           EVT::getVectorVT(*DAG.getContext(), VT.getVectorElementType(), 4);
       auto WidenedOp = DAG.getMemIntrinsicNode(
-          AMDGPUISD::SBUFFER_LOAD, DL, DAG.getVTList(WidenedVT), Ops, WidenedVT,
-          MF.getMachineMemOperand(MMO, 0, WidenedVT.getStoreSize()));
+          AMDGPUISD::SBUFFER_LOAD, DL, DAG.getVTList(WidenedVT), SBufferLoadOps,
+          WidenedVT, MF.getMachineMemOperand(MMO, 0, WidenedVT.getStoreSize()));
       auto Subvector = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, WidenedOp,
                                    DAG.getVectorIdxConstant(0, DL));
       return Subvector;
     }
 
     return DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
-                                   DAG.getVTList(VT), Ops, VT, MMO);
-  }
-
-  // We have a divergent offset. Emit a MUBUF buffer load instead. We can
-  // assume that the buffer is unswizzled.
-  SDValue Ops[] = {
-      DAG.getEntryNode(),                    // Chain
-      Rsrc,                                  // rsrc
-      DAG.getConstant(0, DL, MVT::i32),      // vindex
-      {},                                    // voffset
-      {},                                    // soffset
-      {},                                    // offset
-      CachePolicy,                           // cachepolicy
-      DAG.getTargetConstant(0, DL, MVT::i1), // idxen
-  };
-  if (VT == MVT::i16 && Subtarget->hasScalarSubwordLoads()) {
-    setBufferOffsets(Offset, DAG, &Ops[3], Align(4));
-    return handleByteShortBufferLoads(DAG, VT, DL, Ops, MMO);
+                                   DAG.getVTList(VT), SBufferLoadOps, VT, MMO);
   }
 
   SmallVector<SDValue, 4> Loads;
@@ -8423,14 +8428,14 @@ SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc,
 
   // Use the alignment to ensure that the required offsets will fit into the
   // immediate offsets.
-  setBufferOffsets(Offset, DAG, &Ops[3],
+  setBufferOffsets(Offset, DAG, &BufferLoadOps[3],
                    NumLoads > 1 ? Align(16 * NumLoads) : Align(4));
 
-  uint64_t InstOffset = Ops[5]->getAsZExtVal();
+  uint64_t InstOffset = BufferLoadOps[5]->getAsZExtVal();
   for (unsigned i = 0; i < NumLoads; ++i) {
-    Ops[5] = DAG.getTargetConstant(InstOffset + 16 * i, DL, MVT::i32);
-    Loads.push_back(getMemIntrinsicNode(AMDGPUISD::BUFFER_LOAD, DL, VTList, Ops,
-                                        LoadVT, MMO, DAG));
+    BufferLoadOps[5] = DAG.getTargetConstant(InstOffset + 16 * i, DL, MVT::i32);
+    Loads.push_back(getMemIntrinsicNode(AMDGPUISD::BUFFER_LOAD, DL, VTList,
+                                        BufferLoadOps, LoadVT, MMO, DAG));
   }
 
   if (NumElts == 8 || NumElts == 16)
@@ -12672,7 +12677,7 @@ SITargetLowering::performSignExtendInRegCombine(SDNode *N,
         VTSign->getVT() == MVT::i8) ||
        (Src.getOpcode() == AMDGPUISD::SBUFFER_LOAD_USHORT &&
         VTSign->getVT() == MVT::i16))) {
-    assert(Subtarget->hasScalarSubwordLoads() &&
+    assert(Subtarget->hasScalarSubwordBufferLoads() &&
            "s_buffer_load_{u8, i8} are supported "
            "in GFX12 (or newer) architectures.");
     EVT VT = Src.getValueType();
diff --git a/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll b/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll
index 020c9dc130bb2a..ad89b1f91143b4 100644
--- a/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll
+++ b/llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll
@@ -1,6 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,DAG %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -global-isel=1 < %s | FileCheck -check-prefixes=GCN,GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,DAG,DEFAULT %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -verify-machineinstrs -global-isel=1 < %s | FileCheck -check-prefixes=GCN,GISEL,DEFAULT %s
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,DAG,PAL %s
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 -verify-machineinstrs -global-isel=1 < %s | FileCheck -check-prefixes=GCN,GISEL,PAL,PAL-GISEL %s
 
 define amdgpu_ps void @test_s_load_i8(ptr addrspace(4) inreg %in, ptr addrspace(1) %out) {
 ; GCN-LABEL: test_s_load_i8:
@@ -419,13 +421,20 @@ define amdgpu_ps void @test_s_load_u16_divergent(ptr addrspace(4) inreg %in, i32
 }
 
 define amdgpu_ps void @s_buffer_load_byte_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out) {
-; GCN-LABEL: s_buffer_load_byte_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_i8 s0, s[0:3], 0x4
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    buffer_load_i8 v2, off, s[0:3], null offset:4
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_byte_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_i8 s0, s[0:3], 0x4
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 4, i32 0)
   %sext = sext i8 %ld to i32
@@ -434,13 +443,21 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_byte_sgpr(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %offset) {
-; GCN-LABEL: s_buffer_load_byte_sgpr:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x0
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_sgpr:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_i8 v2, v2, s[0:3], null offen
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_byte_sgpr:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x0
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 %offset, i32 0)
   %sext = sext i8 %ld to i32
@@ -449,13 +466,21 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_byte_sgpr_or_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %in) {
-; GCN-LABEL: s_buffer_load_byte_sgpr_or_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x64
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_sgpr_or_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_i8 v2, v2, s[0:3], null offen offset:100
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_byte_sgpr_or_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_i8 s0, s[0:3], s4 offset:0x64
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %off = add nuw nsw i32 %in, 100
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 %off, i32 0)
@@ -472,12 +497,19 @@ define amdgpu_ps void @s_buffer_load_byte_sgpr_or_imm_offset_divergent(<4 x i32>
 ; DAG-NEXT:    global_store_b32 v[0:1], v2, off
 ; DAG-NEXT:    s_endpgm
 ;
-; GISEL-LABEL: s_buffer_load_byte_sgpr_or_imm_offset_divergent:
-; GISEL:       ; %bb.0: ; %main_body
-; GISEL-NEXT:    buffer_load_b32 v2, v2, s[0:3], null offen
-; GISEL-NEXT:    s_wait_loadcnt 0x0
-; GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GISEL-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_byte_sgpr_or_imm_offset_divergent:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    buffer_load_i8 v2, v2, s[0:3], null offen
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-GISEL-LABEL: s_buffer_load_byte_sgpr_or_imm_offset_divergent:
+; PAL-GISEL:       ; %bb.0: ; %main_body
+; PAL-GISEL-NEXT:    buffer_load_b32 v2, v2, s[0:3], null offen
+; PAL-GISEL-NEXT:    s_wait_loadcnt 0x0
+; PAL-GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-GISEL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.i8(<4 x i32> %src, i32 %offset, i32 0)
   %sext = sext i8 %ld to i32
@@ -486,15 +518,22 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_ubyte_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out) {
-; GCN-LABEL: s_buffer_load_ubyte_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_u8 s0, s[0:3], 0x4
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    s_and_b32 s0, s0, 0xff
-; GCN-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_ubyte_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    buffer_load_u8 v2, off, s[0:3], null offset:4
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_ubyte_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_u8 s0, s[0:3], 0x4
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    s_and_b32 s0, s0, 0xff
+; PAL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.u8(<4 x i32> %src, i32 4, i32 0)
   %zext = zext i8 %ld to i32
@@ -503,15 +542,23 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_ubyte_sgpr(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %offset) {
-; GCN-LABEL: s_buffer_load_ubyte_sgpr:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x0
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    s_and_b32 s0, s0, 0xff
-; GCN-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_ubyte_sgpr:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_u8 v2, v2, s[0:3], null offen
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_ubyte_sgpr:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x0
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    s_and_b32 s0, s0, 0xff
+; PAL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %ld = call i8 @llvm.amdgcn.s.buffer.load.u8(<4 x i32> %src, i32 %offset, i32 0)
   %zext = zext i8 %ld to i32
@@ -520,15 +567,23 @@ main_body:
 }
 
 define amdgpu_ps void @s_buffer_load_ubyte_sgpr_or_imm_offset(<4 x i32> inreg %src, ptr addrspace(1) nocapture %out, i32 inreg %in) {
-; GCN-LABEL: s_buffer_load_ubyte_sgpr_or_imm_offset:
-; GCN:       ; %bb.0: ; %main_body
-; GCN-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x64
-; GCN-NEXT:    s_wait_kmcnt 0x0
-; GCN-NEXT:    s_and_b32 s0, s0, 0xff
-; GCN-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GCN-NEXT:    v_mov_b32_e32 v2, s0
-; GCN-NEXT:    global_store_b32 v[0:1], v2, off
-; GCN-NEXT:    s_endpgm
+; DEFAULT-LABEL: s_buffer_load_ubyte_sgpr_or_imm_offset:
+; DEFAULT:       ; %bb.0: ; %main_body
+; DEFAULT-NEXT:    v_mov_b32_e32 v2, s4
+; DEFAULT-NEXT:    buffer_load_u8 v2, v2, s[0:3], null offen offset:100
+; DEFAULT-NEXT:    s_wait_loadcnt 0x0
+; DEFAULT-NEXT:    global_store_b32 v[0:1], v2, off
+; DEFAULT-NEXT:    s_endpgm
+;
+; PAL-LABEL: s_buffer_load_ubyte_sgpr_or_imm_offset:
+; PAL:       ; %bb.0: ; %main_body
+; PAL-NEXT:    s_buffer_load_u8 s0, s[0:3], s4 offset:0x64
+; PAL-NEXT:    s_wait_kmcnt 0x0
+; PAL-NEXT:    s_and_b32 s0, s0, 0xff
+; PAL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; PAL-NEXT:    v_mov_b32_e32 v2, s0
+; PAL-NEXT:    global_store_b32 v[0:1], v2, off
+; PAL-NEXT:    s_endpgm
 main_body:
   %off = add nuw nsw i32 %in, 100
   %ld = call i8 @llvm.amdgcn.s.buffer.load.u8(<4 x i32> %src, i32 %off, i32 0)
@@ -545,13 +600,20 @@ define amdgpu_ps void @s_buffer_load_ubyte_sgpr_or_imm_offset_divergent(<4 x i32
 ; DAG-NEXT:    global_store_b32 v[0:1], v2, off
 ; DAG-NEXT:    s_endpgm
 ;
-; GISEL-LABEL: s_buffer_load_ubyte_sgpr_o...
[truncated]

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll

arsenm · 2024-11-25T16:44:06Z

But in this case, it is guaranteed that the buffers stride/num-records are aligned to 4.

How is this guaranteed? This also doesn't seem like a property that should be baked into the platform

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp

jmmartinez · 2024-11-25T17:08:13Z

But in this case, it is guaranteed that the buffers stride/num-records are aligned to 4.

How is this guaranteed? This also doesn't seem like a property that should be baked into the platform

Thanks for asking this, I never had to deal with an issue like this one.

In the ticket, the proposed workaround consists of 3 parts:
A) The driver always rounds buffer sizes to a multiple of 4 bytes (this applies to Vulkan)
B) The buffer is a strided (structured) buffer whose stride is known to be a multiple of 4 bytes (this can apply to DX12 structured buffers)
C) Fall back to using buffer_load_[iu]{8,16}

So I assume that the vulkan and DX driver would conform to this.

I'm not happy with using the platform for this, but I wasn't sure if adding another option / another function attribute was the way to go.

…g BUFFER_LOAD_XXX In some tests code generation diverged between isel and selection-dag For exmaple, this intrinsic call i16 @llvm.amdgcn.s.buffer.load.u16(<4 x i32> %src, i32 %offset, i32 0) would be lowered into these two cases: * buffer_load_u16 v2, v2, s[0:3], null offen * buffer_load_b32 v2, v2, s[0:3], null offen This patch fixes this issue.

arsenm · 2024-11-26T14:20:26Z

A) The driver always rounds buffer sizes to a multiple of 4 bytes (this applies to Vulkan)

User code can also synthesize buffers wherever they like

shiltian · 2024-11-26T14:33:55Z

My 0.02 is that this should be considered solely a HW bug (instead of a combination of platform and HW), and the backend should simply stop emitting the corresponding code, regardless of the platform.

jmmartinez · 2024-11-26T15:09:20Z

A) The driver always rounds buffer sizes to a multiple of 4 bytes (this applies to Vulkan)

User code can also synthesize buffers wherever they like

When you say user code you mean users writing custom shaders? Or users of OCL/HIP?

And I guess there is no way of distinguishing one kind of buffers from the other. So in the general case this workaround would never work (unless the users commit to aligning their buffers, which I guess they won't).

arsenm · 2024-11-26T15:20:52Z

The backend should only reason about IR and we have an intrinsic to produce a buffer descriptor. You also could always produce the buffer-as-vector elements

jmmartinez · 2024-11-26T15:39:02Z

In that case I propose by default not emitting the subscalar s_buffer_loads for gfx12 (for all platforms), unless some -gfx12-force-subscalar-s-buffer-loads option is passed to the backend (as a workaround for SWDEV-498239, and useful for us for testing).

What do you think?

…FX12

On gfx12, s_buffer_load_(i/u)(8/16) have a hw-bug that is triggered when: * the stride is not a multiple of 4, or * the stride is 0 and the num-records is not a multiple of 4 For Vulkan and DX, it is guaranteed that the buffers stride/num-records are aligned to 4. This patch prevents the emission of scalar subword loads unless an option forcing it is passed to the backend. Solves SWDEV-498239

nhaehnle

I fundamentally disagree with this change. We should not do anything here.

llvm.amdgcn.s.buffer.load is part of the family of intrinsics that is specifically designed to expose ISA instructions to whatever sits on top of LLVM. It is the user of the intrinsic who has to know what they're doing. This has always been the case in the past, e.g. the user must be aware that the intrinsic cannot be used with swizzled buffer descriptors.

This change just adds a whole bunch of complexity to the backend while making the design worse, and I already told you as much offline, so it's disappointing to see this here.

The only exception to this rule is that if there is logic in LLVM that create an 8/16-bit s.buffer.load without the frontend explicitly requesting one -- such code may have to be fixed. But all the changes I'm seeing are of the form "refuse to create an s_buffer_load_[iu]{8,16} instruction even though the frontend has explicitly requested the s.buffer.load intrinsic", and that's just wrong.

nhaehnle · 2024-11-27T10:00:11Z

Also, we should not add more differences between PAL and other environments. If anything, we should strive to reduce them.

arsenm · 2024-12-03T23:57:07Z

llvm.amdgcn.s.buffer.load is part of the family of intrinsics that is specifically designed to expose ISA instructions to whatever sits on top of LLVM. It is the user of the intrinsic who has to know what they're doing. This has always been the case in the past, e.g. the user must be aware that the intrinsic cannot be used with swizzled buffer descriptors.

We should better document this then. The bugs don't always make it into the public isa docs, and are not mentioned in the (limited) intrinsic documentation either

jmmartinez requested review from arsenm and jayfoad November 25, 2024 16:40

jmmartinez self-assigned this Nov 25, 2024

llvmbot added the backend:AMDGPU label Nov 25, 2024

jmmartinez commented Nov 25, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp Outdated Show resolved Hide resolved

llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll Outdated Show resolved Hide resolved

arsenm reviewed Nov 25, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp Outdated Show resolved Hide resolved

jmmartinez force-pushed the pal_s_buffer_load branch from 6fb9a6f to d0f8358 Compare November 26, 2024 08:58

jmmartinez added 3 commits November 26, 2024 17:05

[AMDGPU][GFX12] Pre-commit tests: Restrict scalar subword loads for G…

74f05bc

…FX12

jmmartinez force-pushed the pal_s_buffer_load branch from d0f8358 to d875c5e Compare November 26, 2024 16:27

nhaehnle requested changes Nov 27, 2024

View reviewed changes

[AMDGPU][GFX12] Restrict scalar subword loads to PAL #117576

Are you sure you want to change the base?

[AMDGPU][GFX12] Restrict scalar subword loads to PAL #117576

Conversation

jmmartinez commented Nov 25, 2024

Uh oh!

llvmbot commented Nov 25, 2024

Uh oh!

Uh oh!

Uh oh!

arsenm commented Nov 25, 2024

Uh oh!

Uh oh!

jmmartinez commented Nov 25, 2024

Uh oh!

arsenm commented Nov 26, 2024

Uh oh!

shiltian commented Nov 26, 2024

Uh oh!

jmmartinez commented Nov 26, 2024

Uh oh!

arsenm commented Nov 26, 2024

Uh oh!

jmmartinez commented Nov 26, 2024

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

nhaehnle commented Nov 27, 2024

Uh oh!

arsenm commented Dec 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants