[AMDGPU] Allow folding of non-subregs through REG_SEQUENCE #151033

JoshHuttonCode · 2025-07-28T20:37:55Z

Allow the folding of copies with subreg operands if they come from a REG_SEQUENCE and the REG_SEQUENCE use was not a subreg. Added tests for various edge cases that came up during development.

This PR addresses the following issue: #125950

github-actions · 2025-07-28T20:38:17Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-07-28T20:38:50Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (JoshHuttonCode)

Changes

When a REG_SEQUENCE has one or more VGPR operands and we are looping through them to insert V_READFIRSTLANE_B32, first check to see if the ultimate source of what is being copied is a virtual SGPR. If it is, just use the SGPR source directly.

This PR addresses the following issue: #125950

Full diff: https://github.com/llvm/llvm-project/pull/151033.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+98-6)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+4)
(modified) llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll (+1-1)
(added) llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir (+271)

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2aa6b4e82f9d5..72d2d7300ec6d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6382,6 +6382,67 @@ void SIInstrInfo::legalizeOperandsVOP3(MachineRegisterInfo &MRI,
     legalizeOpWithMove(MI, VOP3Idx[2]);
 }
 
+// Recursively check to see if the ultimate source of a readfirstlane is SGPR.
+// If it is, readfirstlane can be omitted, and the source of the value can be
+// used directly.
+Register SIInstrInfo::checkIsSourceSGPR(const MachineOperand &MO,
+                                        const MachineRegisterInfo &MRI,
+                                        const SIRegisterInfo *TRI,
+                                        int MaxDepth) const {
+  if (MaxDepth == 0)
+    return Register();
+
+  Register PotentialSGPR = MO.getReg();
+
+  // While we could return a physical SGPR source, we would need to guarantee it
+  // has not been redefined.
+  if (PotentialSGPR.isPhysical())
+    return Register();
+
+  assert(MRI.hasOneDef(PotentialSGPR));
+
+  MachineInstr *MI = MRI.getVRegDef(PotentialSGPR);
+  auto MIOpc = MI->getOpcode();
+  const TargetRegisterClass *RC = MRI.getRegClass(PotentialSGPR);
+
+  if (RI.hasSGPRs(RC))
+    return PotentialSGPR;
+
+  switch (MIOpc) {
+  case AMDGPU::COPY: {
+    MachineOperand CopySource = MI->getOperand(1);
+    return checkIsSourceSGPR(CopySource, MRI, TRI, MaxDepth - 1);
+  }
+  case AMDGPU::REG_SEQUENCE: {
+    unsigned SubRegToFind = MO.getSubReg();
+    unsigned SubRegOperandIndex = 2;
+    unsigned CopySourceIndex = 0;
+
+    // Since subregs may be listed out of order, we need to
+    // loop over operands to find the subreg we are looking for.
+    while (SubRegOperandIndex < MI->getNumOperands()) {
+      assert(MI->isOperandSubregIdx(SubRegOperandIndex));
+
+      unsigned SubRegIndex = MI->getOperand(SubRegOperandIndex).getImm();
+      if (SubRegIndex == SubRegToFind) {
+        CopySourceIndex = SubRegOperandIndex - 1;
+        break;
+      }
+
+      SubRegOperandIndex += 2;
+    }
+
+    if (SubRegOperandIndex >= MI->getNumOperands())
+      return Register();
+
+    MachineOperand CopySource = MI->getOperand(CopySourceIndex);
+    return checkIsSourceSGPR(CopySource, MRI, TRI, MaxDepth - 1);
+  }
+  default:
+    return Register();
+  }
+}
+
 Register SIInstrInfo::readlaneVGPRToSGPR(
     Register SrcReg, MachineInstr &UseMI, MachineRegisterInfo &MRI,
     const TargetRegisterClass *DstRC /*=nullptr*/) const {
@@ -6410,12 +6471,43 @@ Register SIInstrInfo::readlaneVGPRToSGPR(
   }
 
   SmallVector<Register, 8> SRegs;
-  for (unsigned i = 0; i < SubRegs; ++i) {
-    Register SGPR = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
-    BuildMI(*UseMI.getParent(), UseMI, UseMI.getDebugLoc(),
-            get(AMDGPU::V_READFIRSTLANE_B32), SGPR)
-        .addReg(SrcReg, 0, RI.getSubRegFromChannel(i));
-    SRegs.push_back(SGPR);
+  const MachineInstr *CopySourceInstr = MRI.getVRegDef(SrcReg);
+
+  if (CopySourceInstr->getOpcode() != AMDGPU::REG_SEQUENCE) {
+    for (unsigned i = 0; i < SubRegs; ++i) {
+      Register SGPR = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
+      BuildMI(*UseMI.getParent(), UseMI, UseMI.getDebugLoc(),
+              get(AMDGPU::V_READFIRSTLANE_B32), SGPR)
+          .addReg(SrcReg, 0, RI.getSubRegFromChannel(i));
+      SRegs.push_back(SGPR);
+    }
+  } else {
+    SRegs.resize(SubRegs);
+    const SIRegisterInfo *TRI = ST.getRegisterInfo();
+
+    for (unsigned i = 0; i < SubRegs; ++i) {
+      unsigned SubregOperandIndex = 2 + 2 * i;
+      assert(SubregOperandIndex < CopySourceInstr->getNumOperands());
+      assert(CopySourceInstr->isOperandSubregIdx(SubregOperandIndex));
+
+      MachineOperand RegSeqSrcOperand =
+          CopySourceInstr->getOperand(SubregOperandIndex - 1);
+      Register SGPRSource = checkIsSourceSGPR(RegSeqSrcOperand, MRI, TRI);
+
+      unsigned SubRegIndex =
+          CopySourceInstr->getOperand(SubregOperandIndex).getImm();
+      unsigned SubRegChannel = RI.getChannelFromSubReg(SubRegIndex);
+
+      if (SGPRSource) {
+        SRegs[SubRegChannel] = SGPRSource;
+      } else {
+        Register SGPR = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
+        BuildMI(*UseMI.getParent(), UseMI, UseMI.getDebugLoc(),
+                get(AMDGPU::V_READFIRSTLANE_B32), SGPR)
+            .addReg(SrcReg, 0, SubRegIndex);
+        SRegs[SubRegChannel] = SGPR;
+      }
+    }
   }
 
   MachineInstrBuilder MIB =
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index e042b59eb0f04..3e2c681805a3f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -186,6 +186,10 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
 
   bool resultDependsOnExec(const MachineInstr &MI) const;
 
+  Register checkIsSourceSGPR(const MachineOperand &MO,
+                             const MachineRegisterInfo &MRI,
+                             const SIRegisterInfo *TRI, int MaxDepth = 6) const;
+
 protected:
   /// If the specific machine instruction is a instruction that moves/copies
   /// value from one register to another register return destination and source
diff --git a/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll b/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll
index 52ccfe8ba3bfb..384e1c5a4cc50 100644
--- a/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll
@@ -300,8 +300,8 @@ define amdgpu_vs float @load_addr_no_fold(ptr addrspace(6) inreg noalias %p0) #0
 }
 
 ; GCN-LABEL: {{^}}vgpr_arg_src:
-; GCN: v_readfirstlane_b32 s[[READLANE:[0-9]+]], v0
 ; GCN: s_mov_b32 s[[ZERO:[0-9]+]]
+; GCN: v_readfirstlane_b32 s[[READLANE:[0-9]+]], v0
 ; GCN: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s[[[READLANE]]:[[ZERO]]]
 define amdgpu_vs float @vgpr_arg_src(ptr addrspace(6) %arg) {
 main_body:
diff --git a/llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir b/llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir
new file mode 100644
index 0000000000000..4eff26fccddd1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir
@@ -0,0 +1,271 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -march=amdgcn -mcpu=gfx900 -run-pass=si-fix-sgpr-copies -o -  %s | FileCheck %s
+---
+name: v_readfirstlane_b32_omitted
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_omitted
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, killed [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[DEF]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %3:sreg_64 = S_QUADMASK_B64 %2
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_omitted_switched_subregs
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_omitted_switched_subregs
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub1, killed [[COPY1]], %subreg.sub0
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[DEF]], %subreg.sub0, [[V_READFIRSTLANE_B32_]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:sreg_64 = REG_SEQUENCE %0, %subreg.sub1, killed %1, %subreg.sub0
+    %3:sreg_64 = S_QUADMASK_B64 killed %2
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_phys_vgpr_and_sgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $sgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_phys_vgpr_and_sgpr
+    ; CHECK: liveins: $vgpr0, $sgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[COPY1]], implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, killed [[COPY2]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = COPY $sgpr0
+    %2:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %3:sreg_64 = S_QUADMASK_B64 %2
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_both_vgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_both_vgpr
+    ; CHECK: liveins: $vgpr0, $vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[V_READFIRSTLANE_B32_1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:vgpr_32 = COPY $vgpr1
+    %2:sreg_32 = IMPLICIT_DEF
+    %3:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %4:sreg_64 = S_QUADMASK_B64 %3
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_both_sgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $sgpr0, $sgpr1
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_both_sgpr
+    ; CHECK: liveins: $sgpr0, $sgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, killed [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:sgpr_32 = COPY $sgpr0
+    %1:sgpr_32 = COPY $sgpr1
+    %2:sreg_32 = IMPLICIT_DEF
+    %3:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %4:sreg_64 = S_QUADMASK_B64 %3
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_vgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_vgpr
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF3:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128 = REG_SEQUENCE killed [[COPY2]], %subreg.sub1, [[COPY1]], %subreg.sub0, [[COPY1]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:vreg_64 = REG_SEQUENCE killed [[REG_SEQUENCE]].sub3, %subreg.sub1, [[REG_SEQUENCE]].sub1, %subreg.sub0
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE1]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE2:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[DEF]], %subreg.sub0, [[V_READFIRSTLANE_B32_]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE2]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:vgpr_32 = COPY %0
+    %3:sgpr_128 = REG_SEQUENCE killed %1, %subreg.sub1, killed %2, %subreg.sub0, killed %2, %subreg.sub2, killed %2, %subreg.sub3
+    %4:sreg_64 = REG_SEQUENCE killed %3.sub3, %subreg.sub1, %3.sub1, %subreg.sub0
+    %5:sreg_64 = S_QUADMASK_B64 killed %4
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_sgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $sgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_sgpr
+    ; CHECK: liveins: $sgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY [[COPY]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[DEF]], %subreg.sub1, killed [[COPY1]], %subreg.sub0, killed [[COPY1]], %subreg.sub2, killed [[COPY1]], %subreg.sub3
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE killed [[REG_SEQUENCE]].sub3, %subreg.sub1, [[REG_SEQUENCE]].sub1, %subreg.sub0
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:sgpr_32 = COPY $sgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:sgpr_32 = COPY %0
+    %3:sgpr_128 = REG_SEQUENCE killed %1, %subreg.sub1, killed %2, %subreg.sub0, killed %2, %subreg.sub2, killed %2, %subreg.sub3
+    %4:sreg_64 = REG_SEQUENCE killed %3.sub3, %subreg.sub1, %3.sub1, %subreg.sub0
+    %5:sreg_64 = S_QUADMASK_B64 killed %4
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_undef_subreg
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_undef_subreg
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr2
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF3:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128 = REG_SEQUENCE killed [[COPY3]], %subreg.sub1, [[COPY1]], %subreg.sub0, [[COPY]], %subreg.sub2
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:vreg_64 = REG_SEQUENCE killed [[REG_SEQUENCE]].sub3, %subreg.sub1, [[REG_SEQUENCE]].sub1, %subreg.sub0
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE1]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE2:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[DEF]], %subreg.sub0, [[V_READFIRSTLANE_B32_]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE2]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %2:vgpr_32 = COPY $vgpr2
+    %1:vgpr_32 = COPY $vgpr1
+    %0:vgpr_32 = COPY $vgpr0
+    %3:sreg_32 = IMPLICIT_DEF
+    %4:sreg_32 = COPY %1
+    %5:sgpr_128 = REG_SEQUENCE killed %3, %subreg.sub1, killed %1, %subreg.sub0, killed %2, %subreg.sub2
+    %6:sreg_64 = REG_SEQUENCE killed %5.sub3, %subreg.sub1, %5.sub1, %subreg.sub0
+    %7:sreg_64 = S_QUADMASK_B64 killed %6
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_depth_limit
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_depth_limit
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[COPY2]]
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[COPY3]]
+    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY4]]
+    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY6]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[V_READFIRSTLANE_B32_1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:vgpr_32 = COPY %1
+    %3:vgpr_32 = COPY %2
+    %4:vgpr_32 = COPY %3
+    %5:vgpr_32 = COPY %4
+    %6:vgpr_32 = COPY %5
+    %7:vgpr_32 = COPY %6
+    %8:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %7, %subreg.sub1
+    %9:sreg_64 = S_QUADMASK_B64 %8
+    S_ENDPGM 0

llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

jayfoad · 2025-07-29T12:59:01Z

Can you add a more end-to-end test, as an example of where this would help real code?

Normally this kind of thing is handled in SIFoldOperands:

llvm-project/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

Line 1403 in 29e02d7

// %vgpr = COPY %sgpr0

Is there some reason we can't extend that to work for this case too?

arsenm

Needs IR tests, but my gut feeling is this is solving a problem you shouldn't need to be solving

JoshHuttonCode · 2025-07-29T15:48:31Z

If this does not need to be solved, then should #125950 be marked as resolved without this PR?

I can include the IR test as described in the issue.

JoshHuttonCode · 2025-07-29T15:55:26Z

I chose to handle this in the location I did because that is where the v_readfirstlane_b32 was being inserted, but SIFoldOperands does seem like it may be a better fit. I will investigate if this could be moved there.

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

JoshHuttonCode · 2025-09-19T22:35:32Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

-    const TargetRegisterClass *ConstrainRC =
-        TRI->findCommonRegClass(OpRC, Old.getSubReg(), NewRC, New->getSubReg());
+
+    const TargetRegisterClass *ConstrainRC = OpRC;
+    if (New->getSubReg())
+      ConstrainRC =
+          TRI->getMatchingSuperRegClass(NewRC, OpRC, New->getSubReg());
+


Unless there is something I am missing, I do not see a reason why the ConstrainRC needed to be based on the subreg of the old register we are folding into.

Can this be pre-committed as NFC

JoshHuttonCode · 2025-09-19T22:35:53Z

After investigating further, I believe the issue with not folding v_readfirstlane_b32 comes from this case of folding through REG_SEQUENCE. This PR is now ready for review again.

arsenm · 2025-09-20T02:09:45Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

+      !New->getSubReg())
    Old.setSubReg(AMDGPU::NoSubRegister);


Can just change this to setSubReg(New->getSubreg())

It seems like this will change the behavior of the 16-bit SGPR path. Is that right, and should it be a separate PR?

I'm confused about the state of this hack. There are no 16-bit SGPRs, which is the problem. But sometimes there are, artificially?

arsenm · 2025-09-20T02:10:28Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.quadmask.ll

  ret i64 %qm
 }
+
+define amdgpu_ps void @test_quadmask_half_poison_i64(i32 %in, ptr %out) {


Reference issue number

arsenm · 2025-09-20T02:11:57Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

      return;
  }

+  if (!FoldingImmLike && OpToFold.isReg() && ST->needsAlignedVGPRs()) {


This is too much special casing for an unrelated hack (really we should have an aligned 32-bit register class for this case)

If you use getRegClassConstraintEffectForVReg instead of checking the specific operand to compute the constraint, do you avoid the need for this.

I don't think that will work actually. I was going to suggest checking if the operand is tied, but that hack is not actually tying the operand and nothing is really ensuring that the two operands will remain the same

I think we could remove this hack properly now. We can now use RegClassByHwMode, and will then just need to adjust the selection patterns to pad with undef on the relevant targets

Could you elaborate on your comment? Is your suggestion that I add an aligned 32-bit RC, and then use RegClassByHwMode to have the relevant DS_GWS instructions use that RC on the subtargets that require it? If we do not have a 32-bit aligned RC, I don't understand how this feature and padding during selection helps enforce the alignment at the time of folding.

This would be a separate PR, since it's not really related to this.

Adding an even aligned 32-bit register class would be difficult. It would be easier to model this as a 64-bit operand, and continue doing the pad-with-undef it does today. The issue is the hardware treats this like a 64-bit value anyway, so we can model it as a 64-bit input.

It would require doing something like:

def GWS_SpecialCaseRC : RegClassByHwMode< [DefaultMode, AVAlign2LoadStoreMode, AlignedVGPRNoAGPRMode], [VGPR_32, VReg_64_Align2, VReg_64_Align2]>, SIRegisterClassLike<32, true, true>;

And using that instead of VGPR_32 for the special case operands in the GWS instruction definitions. Then we'd need separate patterns for the FeatureRequiresAlignedVGPRs and !FeatureRequiresAlignedVGPRs cases

Due to the way that subregisters are handled when folding the source of a copy through REG_SEQUENCE, we only ever try to fold if the subregister operand in the REG_SEQUENCE exactly matches the subregister of the use of the REG_SEQUENCE.

To make sure I understand, are the high 32-bits of the 64-bit read required to be undef, or are they just discarded? If it is required, then I am not sure if we should be folding into these instructions anyway. If it is not, then I think we would never fold the 32-bit operand into the 64-bit operand (maybe with special casing the REG_SEQUENCE recursion for these instructions, along with special casing on the Register Class constraining).

For the GWS cases, the high 32-bits are just discarded. The requirement is just the register used is even aligned, which currently we can only encode by using the 64-bit aligned register classes.

This case is far removed from something this pass should have to worry about. If the instruction were properly defined, you wouldn't have to worry about it. Ideally instructions accurately express their constraints and don't require special casing anywhere else

That makes sense to me that this isn't something core for the pass to worry about, I just don't understand how RegClassByHwMode solves this issue. I did add a FIXME comment. There are two pieces that we would need to fix it: RegClassByHwMode for these specific instructions, having access to a 32-bit aligned register class. Without having a 32-bit aligned register class, I don't think we would ever actually perform the folding without some special casing anyway. As you mentioned, this PR is not for addressing this underlying issue. Do you think this needs to be addressed before this PR can be accepted?

Additionally, I borrowed this code from SIInstrInfo::verifyInstruction(). It may make sense to extract this code so it does not need to be redefined.

arsenm · 2025-09-20T02:13:09Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

      // between the in-place mutations and adding to the fold list.
      foldOperand(OpToFold, RSUseMI, RSUseMI->getOperandNo(RSUse), FoldList,
-                  CopiesToReplace);
+                  CopiesToReplace, RegSeqSourceWasSubreg);


I don't understand this new argument, it's too specific. This should remain in terms of operands and sub registers

arsenm · 2025-09-20T02:13:57Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

-        TRI->findCommonRegClass(OpRC, Old.getSubReg(), NewRC, New->getSubReg());
+
+    const TargetRegisterClass *ConstrainRC = OpRC;
+    if (New->getSubReg())


arsenm · 2025-09-20T02:14:10Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

-    const TargetRegisterClass *ConstrainRC =
-        TRI->findCommonRegClass(OpRC, Old.getSubReg(), NewRC, New->getSubReg());
+
+    const TargetRegisterClass *ConstrainRC = OpRC;
+    if (New->getSubReg())
+      ConstrainRC =
+          TRI->getMatchingSuperRegClass(NewRC, OpRC, New->getSubReg());
+


Can this be pre-committed as NFC

…60743) The RC of the folded operand does not need to be constrained based on the RC of the current operand we are folding into. The purpose of this PR is to facilitate this PR: #151033

…d (NFC) (#160743) The RC of the folded operand does not need to be constrained based on the RC of the current operand we are folding into. The purpose of this PR is to facilitate this PR: llvm/llvm-project#151033

…vm#160743) The RC of the folded operand does not need to be constrained based on the RC of the current operand we are folding into. The purpose of this PR is to facilitate this PR: llvm#151033

JoshHuttonCode · 2025-11-06T23:26:26Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

-  // Rework once the VS_16 register class is updated to include proper
-  // 16-bit SGPRs instead of 32-bit ones.
-  if (Old.getSubReg() == AMDGPU::lo16 && TRI->isSGPRReg(*MRI, New->getReg()))
-    Old.setSubReg(AMDGPU::NoSubRegister);


@broxigarchen Is this previous piece of code necessary? Currently, and in #128929 when this if was introduced, this branch is not taken in any of the tests. None of the tests fold into a lo16.

I believe this is to handle case like:

%1:vgpr32 = COPY %0:sreg32 %3:vgpr16 = inst_true16 %1.lo16:vgpr32

fold to

%3:vgpr16 = inst_true16 %0:sreg32

since inst_true16 can take both vgpr16 and sreg32 for 16bit operands.

However I run a quick test and seems the this case is not added to the FoldCandidate and thus not triggered. Maybe we should fix this? @Sisyph to help commenting on this

I would have thought the line was exercised when added. Maybe the calling code flow changed some assumed precondition. I observed you can add
assert(Old.getSubReg() == AMDGPU::NoSubRegister) at this line. Is it expected we never have subregisters on Old in this function?
I don't think the new version of the code in this PR is correct if Old did have a subregister (not sure about the old version).

JoshHuttonCode · 2025-11-07T19:44:19Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

-    Old.substVirtReg(New->getReg(), New->getSubReg(), *TRI);
+    Old.substVirtReg(New->getReg(), 0, *TRI);


I did have to change this to not pass in any subreg in order to make Old.setSubReg(New->getSubReg()) work, if that is fine.

llvmbot added the backend:AMDGPU label Jul 28, 2025

shiltian reviewed Jul 28, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir Outdated Show resolved Hide resolved

shiltian reviewed Jul 28, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Outdated Show resolved Hide resolved

shiltian reviewed Jul 28, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Outdated Show resolved Hide resolved

shiltian reviewed Jul 28, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Outdated Show resolved Hide resolved

arsenm reviewed Jul 29, 2025

View reviewed changes

JoshHuttonCode commented Jul 30, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp Outdated Show resolved Hide resolved

JoshHuttonCode force-pushed the optimize-away-v_readfirstlane_b32 branch from a487d9d to 06ddf3c Compare September 19, 2025 22:34

JoshHuttonCode changed the title ~~[AMDGPU] Optimize away v_readfirstlane_b32 on SGPR input~~ [AMDGPU] Allow folding of non-subregs through REG_SEQUENCE Sep 19, 2025

JoshHuttonCode commented Sep 19, 2025

View reviewed changes

arsenm reviewed Sep 20, 2025

View reviewed changes

JoshHuttonCode mentioned this pull request Sep 25, 2025

[AMDGPU] Avoid constraining RC based on folded into operand (NFC) #160743

Merged

JoshHuttonCode force-pushed the optimize-away-v_readfirstlane_b32 branch from 06ddf3c to 4275fe9 Compare November 6, 2025 23:23

JoshHuttonCode commented Nov 6, 2025

View reviewed changes

[AMDGPU] Allow folding of non-subregs through REG_SEQUENCE

331902a

JoshHuttonCode force-pushed the optimize-away-v_readfirstlane_b32 branch from 4275fe9 to 331902a Compare November 7, 2025 19:39

JoshHuttonCode commented Nov 7, 2025

View reviewed changes

		Old.substVirtReg(New->getReg(), New->getSubReg(), *TRI);
		Old.substVirtReg(New->getReg(), 0, *TRI);

[AMDGPU] Allow folding of non-subregs through REG_SEQUENCE #151033

Are you sure you want to change the base?

[AMDGPU] Allow folding of non-subregs through REG_SEQUENCE #151033

Conversation

JoshHuttonCode commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

llvmbot commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jayfoad commented Jul 29, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

JoshHuttonCode commented Jul 29, 2025

Uh oh!

JoshHuttonCode commented Jul 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshHuttonCode commented Sep 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

broxigarchen Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JoshHuttonCode commented Jul 28, 2025 •

edited

Loading

broxigarchen Nov 10, 2025 •

edited

Loading