[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from its lower 32-bit #168458

shiltian · 2025-11-17T23:41:22Z

On some targets, a packed f32 instruction can only read 32 bits from a scalar operand (SGPR or literal) and replicates the bits to both channels. In this case, we should not fold an immediate value if it can't be replicated from its lower 32-bit.

Fixes SWDEV-567139.

shiltian · 2025-11-17T23:41:38Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-11-17T23:41:55Z

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

On some targets, a packed f32 instruction can only read 32 bits from a scalar operand (SGPR or literal) and replicates the bits to both channels. In this case, we should not fold an immediate value if it can't be replicated from its lower 32-bit.

Full diff: https://github.com/llvm/llvm-project/pull/168458.diff

5 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+8)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+6)
(modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+41)
(added) llvm/test/CodeGen/AMDGPU/bug-pk-f32-imm-fold.mir (+64)
(modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+5-4)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index b008354cfd462..7b795b46885e6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1473,6 +1473,13 @@ def FeatureWaitsBeforeSystemScopeStores : SubtargetFeature<
   "Target requires waits for loads and atomics before system scope stores"
 >;
 
+def FeaturePKF32Insts : SubtargetFeature<"pk-f32-insts",
+  "HasPKF32Insts",
+  "true",
+  "Has packed F32 instructions that only read 32 bits from a scalar operand "
+  "(SGPR or literal) and replicates the bits to both channels."
+>;
+
 // Dummy feature used to disable assembler instructions.
 def FeatureDisable : SubtargetFeature<"",
   "FeatureDisable","true",
@@ -2145,6 +2152,7 @@ def FeatureISAVersion12_50 : FeatureSet<
    FeatureXNACK,
    FeatureClusters,
    FeatureD16Writes32BitVgpr,
+   FeaturePKF32Insts,
 ]>;
 
 def FeatureISAVersion12_51 : FeatureSet<
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index f377b8aaf1333..b0fa907cc3715 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -190,6 +190,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool HasEmulatedSystemScopeAtomics = false;
   bool HasDefaultComponentBroadcast = false;
   bool HasXF32Insts = false;
+  bool HasPKF32Insts = false;
   /// The maximum number of instructions that may be placed within an S_CLAUSE,
   /// which is one greater than the maximum argument to S_CLAUSE. A value of 0
   /// indicates a lack of S_CLAUSE support.
@@ -1420,6 +1421,11 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   /// \returns true if the target has instructions with xf32 format support.
   bool hasXF32Insts() const { return HasXF32Insts; }
 
+  /// \returns true if the target has packed f32 instructions that only read 32
+  /// bits from a scalar operand (SGPR or literal) and replicates the bits to
+  /// both channels.
+  bool hasPKF32Insts() const { return HasPKF32Insts; }
+
   bool hasBitOp3Insts() const { return HasBitOp3Insts; }
 
   bool hasPermlane16Swap() const { return HasPermlane16Swap; }
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 18d90346d1d88..cf5e25bd9e0f8 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -89,6 +89,11 @@ struct FoldableDef {
 
   bool isImm() const { return Kind == MachineOperand::MO_Immediate; }
 
+  uint64_t getImm() const {
+    assert(isImm());
+    return ImmToFold;
+  }
+
   bool isFI() const {
     return Kind == MachineOperand::MO_FrameIndex;
   }
@@ -766,6 +771,34 @@ static void appendFoldCandidate(SmallVectorImpl<FoldCandidate> &FoldList,
                       FoldCandidate(MI, OpNo, FoldOp, Commuted, ShrinkOp));
 }
 
+// Returns true if the instruction is a packed f32 instruction that only reads
+// 32 bits from a scalar operand (SGPR or literal) and replicates the bits to
+// both channels.
+static bool isPKF32Instr(const GCNSubtarget *ST, MachineInstr *MI) {
+  if (!ST->hasPKF32Insts())
+    return false;
+  switch (MI->getOpcode()) {
+  case AMDGPU::V_PK_ADD_F32:
+  case AMDGPU::V_PK_MUL_F32:
+  case AMDGPU::V_PK_FMA_F32:
+    return true;
+  default:
+    return false;
+  }
+  llvm_unreachable("unknown instruction");
+}
+
+// Packed FP32 instructions only read 32 bits from a scalar operand (SGPR or
+// literal) and replicates the bits to both channels. Therefore, if the hi and
+// lo are not same, we can't fold it.
+static bool checkImmOpForPKF32Instr(const FoldableDef &OpToFold) {
+  assert(OpToFold.isImm() && "Expected immediate operand");
+  uint64_t ImmVal = OpToFold.getImm();
+  uint32_t Lo = Lo_32(ImmVal);
+  uint32_t Hi = Hi_32(ImmVal);
+  return Lo == Hi;
+}
+
 bool SIFoldOperandsImpl::tryAddToFoldList(
     SmallVectorImpl<FoldCandidate> &FoldList, MachineInstr *MI, unsigned OpNo,
     const FoldableDef &OpToFold) const {
@@ -919,6 +952,12 @@ bool SIFoldOperandsImpl::tryAddToFoldList(
       return true;
   }
 
+  // Special case for PK_F32 instructions if we are trying to fold an imm to
+  // src0 or src1.
+  if (OpToFold.isImm() && isPKF32Instr(ST, MI) &&
+      !checkImmOpForPKF32Instr(OpToFold))
+    return false;
+
   appendFoldCandidate(FoldList, MI, OpNo, OpToFold);
   return true;
 }
@@ -1134,6 +1173,8 @@ bool SIFoldOperandsImpl::tryToFoldACImm(
     return false;
 
   if (OpToFold.isImm() && OpToFold.isOperandLegal(*TII, *UseMI, UseOpIdx)) {
+    if (isPKF32Instr(ST, UseMI) && !checkImmOpForPKF32Instr(OpToFold))
+      return false;
     appendFoldCandidate(FoldList, UseMI, UseOpIdx, OpToFold);
     return true;
   }
diff --git a/llvm/test/CodeGen/AMDGPU/bug-pk-f32-imm-fold.mir b/llvm/test/CodeGen/AMDGPU/bug-pk-f32-imm-fold.mir
new file mode 100644
index 0000000000000..a725fab5e40c1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/bug-pk-f32-imm-fold.mir
@@ -0,0 +1,64 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-amd-hsa -mcpu=gfx1250 -run-pass=si-fold-operands -o - %s | FileCheck %s
+
+---
+name: pk_add_f32_imm_fold
+body: |
+  bb.0.entry:
+    liveins: $sgpr0_sgpr1
+
+    ; CHECK-LABEL: name: pk_add_f32_imm_fold
+    ; CHECK: liveins: $sgpr0_sgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[V_MOV_B:%[0-9]+]]:vreg_64_align2 = V_MOV_B64_PSEUDO 1065353216, implicit $exec
+    ; CHECK-NEXT: [[V_PK_ADD_F32_:%[0-9]+]]:vreg_64_align2 = V_PK_ADD_F32 11, [[DEF]], 8, [[V_MOV_B]], 0, 0, 0, 0, 0, implicit $mode, implicit $exec
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vreg_64_align2 = IMPLICIT_DEF
+    %1:sreg_64 = S_MOV_B64 1065353216
+    %2:vreg_64_align2 = COPY killed %1
+    %3:vreg_64_align2 = V_PK_ADD_F32 11, %0, 8, %2, 0, 0, 0, 0, 0, implicit $mode, implicit $exec
+    S_ENDPGM 0
+...
+
+---
+name: pk_mul_f32_imm_fold
+body: |
+  bb.0.entry:
+    liveins: $sgpr0_sgpr1
+
+    ; CHECK-LABEL: name: pk_mul_f32_imm_fold
+    ; CHECK: liveins: $sgpr0_sgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[V_MOV_B:%[0-9]+]]:vreg_64_align2 = V_MOV_B64_PSEUDO 1065353216, implicit $exec
+    ; CHECK-NEXT: [[V_PK_MUL_F32_:%[0-9]+]]:vreg_64_align2 = V_PK_MUL_F32 11, [[DEF]], 8, [[V_MOV_B]], 0, 0, 0, 0, 0, implicit $mode, implicit $exec
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vreg_64_align2 = IMPLICIT_DEF
+    %1:sreg_64 = S_MOV_B64 1065353216
+    %2:vreg_64_align2 = COPY killed %1
+    %3:vreg_64_align2 = V_PK_MUL_F32 11, %0, 8, %2, 0, 0, 0, 0, 0, implicit $mode, implicit $exec
+    S_ENDPGM 0
+...
+
+---
+name: pk_fma_f32_imm_fold
+body: |
+  bb.0.entry:
+    liveins: $sgpr0_sgpr1
+
+    ; CHECK-LABEL: name: pk_fma_f32_imm_fold
+    ; CHECK: liveins: $sgpr0_sgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[V_MOV_B:%[0-9]+]]:vreg_64_align2 = V_MOV_B64_PSEUDO 1065353216, implicit $exec
+    ; CHECK-NEXT: [[V_PK_FMA_F32_:%[0-9]+]]:vreg_64_align2 = V_PK_FMA_F32 0, [[DEF]], 8, [[DEF1]], 11, [[V_MOV_B]], 0, 0, 0, 0, 0, implicit $mode, implicit $exec
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vreg_64_align2 = IMPLICIT_DEF
+    %1:vreg_64_align2 = IMPLICIT_DEF
+    %2:sreg_64 = S_MOV_B64 1065353216
+    %3:vreg_64_align2 = COPY killed %2
+    %4:vreg_64_align2 = V_PK_FMA_F32 0, %0, 8, %1, 11, %3, 0, 0, 0, 0, 0, implicit $mode, implicit $exec
+    S_ENDPGM 0
+...
diff --git a/llvm/test/CodeGen/AMDGPU/packed-fp32.ll b/llvm/test/CodeGen/AMDGPU/packed-fp32.ll
index 1177474f5b4f5..6b45d31da0e95 100644
--- a/llvm/test/CodeGen/AMDGPU/packed-fp32.ll
+++ b/llvm/test/CodeGen/AMDGPU/packed-fp32.ll
@@ -732,12 +732,13 @@ define amdgpu_kernel void @fadd_v2_v_lit_hi0(ptr addrspace(1) %a) {
 ; GFX1250-SDAG-LABEL: fadd_v2_v_lit_hi0:
 ; GFX1250-SDAG:       ; %bb.0:
 ; GFX1250-SDAG-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1250-SDAG-NEXT:    v_and_b32_e32 v2, 0x3ff, v0
+; GFX1250-SDAG-NEXT:    v_and_b32_e32 v4, 0x3ff, v0
+; GFX1250-SDAG-NEXT:    v_mov_b64_e32 v[2:3], 0x3f800000
 ; GFX1250-SDAG-NEXT:    s_wait_kmcnt 0x0
-; GFX1250-SDAG-NEXT:    global_load_b64 v[0:1], v2, s[0:1] scale_offset
+; GFX1250-SDAG-NEXT:    global_load_b64 v[0:1], v4, s[0:1] scale_offset
 ; GFX1250-SDAG-NEXT:    s_wait_loadcnt 0x0
-; GFX1250-SDAG-NEXT:    v_pk_add_f32 v[0:1], v[0:1], 1.0
-; GFX1250-SDAG-NEXT:    global_store_b64 v2, v[0:1], s[0:1] scale_offset
+; GFX1250-SDAG-NEXT:    v_pk_add_f32 v[0:1], v[0:1], v[2:3]
+; GFX1250-SDAG-NEXT:    global_store_b64 v4, v[0:1], s[0:1] scale_offset
 ; GFX1250-SDAG-NEXT:    s_endpgm
 ;
 ; GFX1250-GISEL-LABEL: fadd_v2_v_lit_hi0:

llvm/lib/Target/AMDGPU/AMDGPU.td

arsenm · 2025-11-17T23:48:14Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

+  if (!ST->hasPKF32Insts())
+    return false;
+  switch (MI->getOpcode()) {
+  case AMDGPU::V_PK_ADD_F32:


Should be able to get this from the operand type, not the opcode

But not all v2f32 instructions have this issue. Only those with OPF_PK_F32 flag.

There are only these 3 of them anyway?

If you pass an operand here, I believe only these 3 instructions will have OPERAND_REG_IMM_V2FP32 type.

but then we need to get MCInstrDesc, and then MCOperandInfo, but eventually what's the point?

And the logic still doesn't add up here. The operand of the instruction is not OPERAND_REG_IMM_V2FP32 type, because of the move here.

It is in the MCInstrDesc regardless of an actual operand. The point is not to forget real instructions, just in case. I do not insist though.

That's not remotely difficult?

Say MI is the PK_FP32 instruction: %4:vreg_64_align2 = V_PK_FMA_F32 0, %0, 8, %1, 11, %3, 0, 0, 0, 0, 0, implicit $mode, implicit $exec. Then if we use MCOperandInfo, it'd be something like (assuming OpNo corresponds to %3):

const MCOperandInfo &OpDesc = MI->getDesc().operands()[OpNo]; if (ST->hasPKF32InstsReplicatingLow32BitsOfScalarInput() && OpDesc.OperandType == AMDGPU::OPERAND_REG_IMM_V2FP32) ...

Will OpDesc.OperandType be AMDGPU::OPERAND_REG_IMM_V2FP32 here? Also, will the condition help narrow down to the specific three instructions?

Yes, it does not depend on the MachineInstr. It takes the info from MCInstrDesc.

github-actions · 2025-11-18T00:24:24Z

🐧 Linux x64 Test Results

186262 tests passed
4852 tests skipped

llvm/lib/Target/AMDGPU/GCNSubtarget.h

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

rampitec · 2025-11-18T19:50:08Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

+  if (!ST->hasPKF32Insts())
+    return false;
+  switch (MI->getOpcode()) {
+  case AMDGPU::V_PK_ADD_F32:


If you pass an operand here, I believe only these 3 instructions will have OPERAND_REG_IMM_V2FP32 type.

…from its lower 32-bit On some targets, a packed f32 instruction can only read 32 bits from a scalar operand (SGPR or literal) and replicates the bits to both channels. In this case, we should not fold an immediate value if it can't be replicated from its lower 32-bit.

rampitec

LGTM

A folow-up of #168458.

llvmbot added the backend:AMDGPU label Nov 17, 2025

shiltian requested review from arsenm, changpeng and rampitec November 17, 2025 23:41

rampitec reviewed Nov 17, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPU.td Outdated Show resolved Hide resolved

arsenm reviewed Nov 17, 2025

View reviewed changes

shiltian requested review from arsenm and rampitec November 18, 2025 15:43

shiltian force-pushed the users/shiltian/dont-fold-imm-op-if-not-correct branch from 5d07f7f to df56434 Compare November 18, 2025 19:37

rampitec reviewed Nov 18, 2025

View reviewed changes

shiltian force-pushed the users/shiltian/dont-fold-imm-op-if-not-correct branch from d770308 to 69ff0dd Compare November 18, 2025 20:24

shiltian added 5 commits November 18, 2025 15:27

remove target feature

d679d57

update helper function name

58aa511

use getEffectiveImmVal instead of adding a new API function

a3e8aa6

further name

bb03359

shiltian force-pushed the users/shiltian/dont-fold-imm-op-if-not-correct branch from 69ff0dd to bb03359 Compare November 18, 2025 20:28

rampitec approved these changes Nov 18, 2025

View reviewed changes

shiltian merged commit 6665642 into main Nov 18, 2025
8 of 9 checks passed

shiltian deleted the users/shiltian/dont-fold-imm-op-if-not-correct branch November 18, 2025 22:11

shiltian added a commit that referenced this pull request Nov 19, 2025

[NFC] Check operand type instead of opcode

7b7cc0d

A folow-up of #168458.

shiltian mentioned this pull request Nov 19, 2025

[NFC] Check operand type instead of opcode #168641

Merged

shiltian added a commit that referenced this pull request Nov 19, 2025

[NFC] Check operand type instead of opcode (#168641)

b4aa3d3

A folow-up of #168458.

[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from its lower 32-bit #168458

[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from its lower 32-bit #168458

Uh oh!

Conversation

shiltian commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian commented Nov 17, 2025

Uh oh!

llvmbot commented Nov 17, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiltian Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 18, 2025

🐧 Linux x64 Test Results

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rampitec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shiltian commented Nov 17, 2025 •

edited

Loading

shiltian Nov 18, 2025 •

edited

Loading