Skip to content

Conversation

@JoshHuttonCode
Copy link
Contributor

@JoshHuttonCode JoshHuttonCode commented Jul 28, 2025

Allow the folding of copies with subreg operands if they come from a REG_SEQUENCE and the REG_SEQUENCE use was not a subreg. Added tests for various edge cases that came up during development.

This PR addresses the following issue: #125950

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Jul 28, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: None (JoshHuttonCode)

Changes

When a REG_SEQUENCE has one or more VGPR operands and we are looping through them to insert V_READFIRSTLANE_B32, first check to see if the ultimate source of what is being copied is a virtual SGPR. If it is, just use the SGPR source directly.

This PR addresses the following issue: #125950


Full diff: https://github.com/llvm/llvm-project/pull/151033.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+98-6)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll (+1-1)
  • (added) llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir (+271)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2aa6b4e82f9d5..72d2d7300ec6d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6382,6 +6382,67 @@ void SIInstrInfo::legalizeOperandsVOP3(MachineRegisterInfo &MRI,
     legalizeOpWithMove(MI, VOP3Idx[2]);
 }
 
+// Recursively check to see if the ultimate source of a readfirstlane is SGPR.
+// If it is, readfirstlane can be omitted, and the source of the value can be
+// used directly.
+Register SIInstrInfo::checkIsSourceSGPR(const MachineOperand &MO,
+                                        const MachineRegisterInfo &MRI,
+                                        const SIRegisterInfo *TRI,
+                                        int MaxDepth) const {
+  if (MaxDepth == 0)
+    return Register();
+
+  Register PotentialSGPR = MO.getReg();
+
+  // While we could return a physical SGPR source, we would need to guarantee it
+  // has not been redefined.
+  if (PotentialSGPR.isPhysical())
+    return Register();
+
+  assert(MRI.hasOneDef(PotentialSGPR));
+
+  MachineInstr *MI = MRI.getVRegDef(PotentialSGPR);
+  auto MIOpc = MI->getOpcode();
+  const TargetRegisterClass *RC = MRI.getRegClass(PotentialSGPR);
+
+  if (RI.hasSGPRs(RC))
+    return PotentialSGPR;
+
+  switch (MIOpc) {
+  case AMDGPU::COPY: {
+    MachineOperand CopySource = MI->getOperand(1);
+    return checkIsSourceSGPR(CopySource, MRI, TRI, MaxDepth - 1);
+  }
+  case AMDGPU::REG_SEQUENCE: {
+    unsigned SubRegToFind = MO.getSubReg();
+    unsigned SubRegOperandIndex = 2;
+    unsigned CopySourceIndex = 0;
+
+    // Since subregs may be listed out of order, we need to
+    // loop over operands to find the subreg we are looking for.
+    while (SubRegOperandIndex < MI->getNumOperands()) {
+      assert(MI->isOperandSubregIdx(SubRegOperandIndex));
+
+      unsigned SubRegIndex = MI->getOperand(SubRegOperandIndex).getImm();
+      if (SubRegIndex == SubRegToFind) {
+        CopySourceIndex = SubRegOperandIndex - 1;
+        break;
+      }
+
+      SubRegOperandIndex += 2;
+    }
+
+    if (SubRegOperandIndex >= MI->getNumOperands())
+      return Register();
+
+    MachineOperand CopySource = MI->getOperand(CopySourceIndex);
+    return checkIsSourceSGPR(CopySource, MRI, TRI, MaxDepth - 1);
+  }
+  default:
+    return Register();
+  }
+}
+
 Register SIInstrInfo::readlaneVGPRToSGPR(
     Register SrcReg, MachineInstr &UseMI, MachineRegisterInfo &MRI,
     const TargetRegisterClass *DstRC /*=nullptr*/) const {
@@ -6410,12 +6471,43 @@ Register SIInstrInfo::readlaneVGPRToSGPR(
   }
 
   SmallVector<Register, 8> SRegs;
-  for (unsigned i = 0; i < SubRegs; ++i) {
-    Register SGPR = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
-    BuildMI(*UseMI.getParent(), UseMI, UseMI.getDebugLoc(),
-            get(AMDGPU::V_READFIRSTLANE_B32), SGPR)
-        .addReg(SrcReg, 0, RI.getSubRegFromChannel(i));
-    SRegs.push_back(SGPR);
+  const MachineInstr *CopySourceInstr = MRI.getVRegDef(SrcReg);
+
+  if (CopySourceInstr->getOpcode() != AMDGPU::REG_SEQUENCE) {
+    for (unsigned i = 0; i < SubRegs; ++i) {
+      Register SGPR = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
+      BuildMI(*UseMI.getParent(), UseMI, UseMI.getDebugLoc(),
+              get(AMDGPU::V_READFIRSTLANE_B32), SGPR)
+          .addReg(SrcReg, 0, RI.getSubRegFromChannel(i));
+      SRegs.push_back(SGPR);
+    }
+  } else {
+    SRegs.resize(SubRegs);
+    const SIRegisterInfo *TRI = ST.getRegisterInfo();
+
+    for (unsigned i = 0; i < SubRegs; ++i) {
+      unsigned SubregOperandIndex = 2 + 2 * i;
+      assert(SubregOperandIndex < CopySourceInstr->getNumOperands());
+      assert(CopySourceInstr->isOperandSubregIdx(SubregOperandIndex));
+
+      MachineOperand RegSeqSrcOperand =
+          CopySourceInstr->getOperand(SubregOperandIndex - 1);
+      Register SGPRSource = checkIsSourceSGPR(RegSeqSrcOperand, MRI, TRI);
+
+      unsigned SubRegIndex =
+          CopySourceInstr->getOperand(SubregOperandIndex).getImm();
+      unsigned SubRegChannel = RI.getChannelFromSubReg(SubRegIndex);
+
+      if (SGPRSource) {
+        SRegs[SubRegChannel] = SGPRSource;
+      } else {
+        Register SGPR = MRI.createVirtualRegister(&AMDGPU::SGPR_32RegClass);
+        BuildMI(*UseMI.getParent(), UseMI, UseMI.getDebugLoc(),
+                get(AMDGPU::V_READFIRSTLANE_B32), SGPR)
+            .addReg(SrcReg, 0, SubRegIndex);
+        SRegs[SubRegChannel] = SGPR;
+      }
+    }
   }
 
   MachineInstrBuilder MIB =
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index e042b59eb0f04..3e2c681805a3f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -186,6 +186,10 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
 
   bool resultDependsOnExec(const MachineInstr &MI) const;
 
+  Register checkIsSourceSGPR(const MachineOperand &MO,
+                             const MachineRegisterInfo &MRI,
+                             const SIRegisterInfo *TRI, int MaxDepth = 6) const;
+
 protected:
   /// If the specific machine instruction is a instruction that moves/copies
   /// value from one register to another register return destination and source
diff --git a/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll b/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll
index 52ccfe8ba3bfb..384e1c5a4cc50 100644
--- a/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/constant-address-space-32bit.ll
@@ -300,8 +300,8 @@ define amdgpu_vs float @load_addr_no_fold(ptr addrspace(6) inreg noalias %p0) #0
 }
 
 ; GCN-LABEL: {{^}}vgpr_arg_src:
-; GCN: v_readfirstlane_b32 s[[READLANE:[0-9]+]], v0
 ; GCN: s_mov_b32 s[[ZERO:[0-9]+]]
+; GCN: v_readfirstlane_b32 s[[READLANE:[0-9]+]], v0
 ; GCN: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s[[[READLANE]]:[[ZERO]]]
 define amdgpu_vs float @vgpr_arg_src(ptr addrspace(6) %arg) {
 main_body:
diff --git a/llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir b/llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir
new file mode 100644
index 0000000000000..4eff26fccddd1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/optimize-away-v_readfirstlane_b32-on-sgpr-source.mir
@@ -0,0 +1,271 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -march=amdgcn -mcpu=gfx900 -run-pass=si-fix-sgpr-copies -o -  %s | FileCheck %s
+---
+name: v_readfirstlane_b32_omitted
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_omitted
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, killed [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[DEF]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %3:sreg_64 = S_QUADMASK_B64 %2
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_omitted_switched_subregs
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_omitted_switched_subregs
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub1, killed [[COPY1]], %subreg.sub0
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[DEF]], %subreg.sub0, [[V_READFIRSTLANE_B32_]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:sreg_64 = REG_SEQUENCE %0, %subreg.sub1, killed %1, %subreg.sub0
+    %3:sreg_64 = S_QUADMASK_B64 killed %2
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_phys_vgpr_and_sgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $sgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_phys_vgpr_and_sgpr
+    ; CHECK: liveins: $vgpr0, $sgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[COPY1]], implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, killed [[COPY2]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = COPY $sgpr0
+    %2:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %3:sreg_64 = S_QUADMASK_B64 %2
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_both_vgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_both_vgpr
+    ; CHECK: liveins: $vgpr0, $vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[V_READFIRSTLANE_B32_1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:vgpr_32 = COPY $vgpr1
+    %2:sreg_32 = IMPLICIT_DEF
+    %3:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %4:sreg_64 = S_QUADMASK_B64 %3
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_both_sgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $sgpr0, $sgpr1
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_both_sgpr
+    ; CHECK: liveins: $sgpr0, $sgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, killed [[COPY1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:sgpr_32 = COPY $sgpr0
+    %1:sgpr_32 = COPY $sgpr1
+    %2:sreg_32 = IMPLICIT_DEF
+    %3:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %1, %subreg.sub1
+    %4:sreg_64 = S_QUADMASK_B64 %3
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_vgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_vgpr
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF3:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128 = REG_SEQUENCE killed [[COPY2]], %subreg.sub1, [[COPY1]], %subreg.sub0, [[COPY1]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:vreg_64 = REG_SEQUENCE killed [[REG_SEQUENCE]].sub3, %subreg.sub1, [[REG_SEQUENCE]].sub1, %subreg.sub0
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE1]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE2:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[DEF]], %subreg.sub0, [[V_READFIRSTLANE_B32_]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE2]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:vgpr_32 = COPY %0
+    %3:sgpr_128 = REG_SEQUENCE killed %1, %subreg.sub1, killed %2, %subreg.sub0, killed %2, %subreg.sub2, killed %2, %subreg.sub3
+    %4:sreg_64 = REG_SEQUENCE killed %3.sub3, %subreg.sub1, %3.sub1, %subreg.sub0
+    %5:sreg_64 = S_QUADMASK_B64 killed %4
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_sgpr
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $sgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_recursive_look_through_copies_and_reg_sequence_for_sgpr
+    ; CHECK: liveins: $sgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY [[COPY]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[DEF]], %subreg.sub1, killed [[COPY1]], %subreg.sub0, killed [[COPY1]], %subreg.sub2, killed [[COPY1]], %subreg.sub3
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE killed [[REG_SEQUENCE]].sub3, %subreg.sub1, [[REG_SEQUENCE]].sub1, %subreg.sub0
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:sgpr_32 = COPY $sgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:sgpr_32 = COPY %0
+    %3:sgpr_128 = REG_SEQUENCE killed %1, %subreg.sub1, killed %2, %subreg.sub0, killed %2, %subreg.sub2, killed %2, %subreg.sub3
+    %4:sreg_64 = REG_SEQUENCE killed %3.sub3, %subreg.sub1, %3.sub1, %subreg.sub0
+    %5:sreg_64 = S_QUADMASK_B64 killed %4
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_undef_subreg
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_undef_subreg
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr2
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF3:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128 = REG_SEQUENCE killed [[COPY3]], %subreg.sub1, [[COPY1]], %subreg.sub0, [[COPY]], %subreg.sub2
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:vreg_64 = REG_SEQUENCE killed [[REG_SEQUENCE]].sub3, %subreg.sub1, [[REG_SEQUENCE]].sub1, %subreg.sub0
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE1]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE2:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[DEF]], %subreg.sub0, [[V_READFIRSTLANE_B32_]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 killed [[REG_SEQUENCE2]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %2:vgpr_32 = COPY $vgpr2
+    %1:vgpr_32 = COPY $vgpr1
+    %0:vgpr_32 = COPY $vgpr0
+    %3:sreg_32 = IMPLICIT_DEF
+    %4:sreg_32 = COPY %1
+    %5:sgpr_128 = REG_SEQUENCE killed %3, %subreg.sub1, killed %1, %subreg.sub0, killed %2, %subreg.sub2
+    %6:sreg_64 = REG_SEQUENCE killed %5.sub3, %subreg.sub1, %5.sub1, %subreg.sub0
+    %7:sreg_64 = S_QUADMASK_B64 killed %6
+    S_ENDPGM 0
+
+...
+
+---
+name: v_readfirstlane_b32_depth_limit
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK-LABEL: name: v_readfirstlane_b32_depth_limit
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF]]
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[COPY2]]
+    ; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[COPY3]]
+    ; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY4]]
+    ; CHECK-NEXT: [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
+    ; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY6]], %subreg.sub1
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub0, implicit $exec
+    ; CHECK-NEXT: [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sgpr_32 = V_READFIRSTLANE_B32 [[REG_SEQUENCE]].sub1, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[V_READFIRSTLANE_B32_]], %subreg.sub0, [[V_READFIRSTLANE_B32_1]], %subreg.sub1
+    ; CHECK-NEXT: [[S_QUADMASK_B64_:%[0-9]+]]:sreg_64 = S_QUADMASK_B64 [[REG_SEQUENCE1]]
+    ; CHECK-NEXT: S_ENDPGM 0
+    %0:vgpr_32 = COPY $vgpr0
+    %1:sreg_32 = IMPLICIT_DEF
+    %2:vgpr_32 = COPY %1
+    %3:vgpr_32 = COPY %2
+    %4:vgpr_32 = COPY %3
+    %5:vgpr_32 = COPY %4
+    %6:vgpr_32 = COPY %5
+    %7:vgpr_32 = COPY %6
+    %8:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, killed %7, %subreg.sub1
+    %9:sreg_64 = S_QUADMASK_B64 %8
+    S_ENDPGM 0

@jayfoad
Copy link
Contributor

jayfoad commented Jul 29, 2025

Can you add a more end-to-end test, as an example of where this would help real code?

Normally this kind of thing is handled in SIFoldOperands:


Is there some reason we can't extend that to work for this case too?

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs IR tests, but my gut feeling is this is solving a problem you shouldn't need to be solving

@JoshHuttonCode
Copy link
Contributor Author

If this does not need to be solved, then should #125950 be marked as resolved without this PR?

I can include the IR test as described in the issue.

@JoshHuttonCode
Copy link
Contributor Author

I chose to handle this in the location I did because that is where the v_readfirstlane_b32 was being inserted, but SIFoldOperands does seem like it may be a better fit. I will investigate if this could be moved there.

@JoshHuttonCode JoshHuttonCode force-pushed the optimize-away-v_readfirstlane_b32 branch from a487d9d to 06ddf3c Compare September 19, 2025 22:34
@JoshHuttonCode JoshHuttonCode changed the title [AMDGPU] Optimize away v_readfirstlane_b32 on SGPR input [AMDGPU] Allow folding of non-subregs through REG_SEQUENCE Sep 19, 2025
Comment on lines 715 to 721
const TargetRegisterClass *ConstrainRC =
TRI->findCommonRegClass(OpRC, Old.getSubReg(), NewRC, New->getSubReg());

const TargetRegisterClass *ConstrainRC = OpRC;
if (New->getSubReg())
ConstrainRC =
TRI->getMatchingSuperRegClass(NewRC, OpRC, New->getSubReg());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is something I am missing, I do not see a reason why the ConstrainRC needed to be based on the subreg of the old register we are folding into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be pre-committed as NFC

@JoshHuttonCode
Copy link
Contributor Author

After investigating further, I believe the issue with not folding v_readfirstlane_b32 comes from this case of folding through REG_SEQUENCE. This PR is now ready for review again.

Comment on lines 736 to 737
!New->getSubReg())
Old.setSubReg(AMDGPU::NoSubRegister);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just change this to setSubReg(New->getSubreg())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will change the behavior of the 16-bit SGPR path. Is that right, and should it be a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the state of this hack. There are no 16-bit SGPRs, which is the problem. But sometimes there are, artificially?

ret i64 %qm
}

define amdgpu_ps void @test_quadmask_half_poison_i64(i32 %in, ptr %out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number

return;
}

if (!FoldingImmLike && OpToFold.isReg() && ST->needsAlignedVGPRs()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too much special casing for an unrelated hack (really we should have an aligned 32-bit register class for this case)

If you use getRegClassConstraintEffectForVReg instead of checking the specific operand to compute the constraint, do you avoid the need for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that will work actually. I was going to suggest checking if the operand is tied, but that hack is not actually tying the operand and nothing is really ensuring that the two operands will remain the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could remove this hack properly now. We can now use RegClassByHwMode, and will then just need to adjust the selection patterns to pad with undef on the relevant targets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on your comment? Is your suggestion that I add an aligned 32-bit RC, and then use RegClassByHwMode to have the relevant DS_GWS instructions use that RC on the subtargets that require it? If we do not have a 32-bit aligned RC, I don't understand how this feature and padding during selection helps enforce the alignment at the time of folding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a separate PR, since it's not really related to this.

Adding an even aligned 32-bit register class would be difficult. It would be easier to model this as a 64-bit operand, and continue doing the pad-with-undef it does today. The issue is the hardware treats this like a 64-bit value anyway, so we can model it as a 64-bit input.

It would require doing something like:


def GWS_SpecialCaseRC : RegClassByHwMode<
    [DefaultMode, AVAlign2LoadStoreMode, AlignedVGPRNoAGPRMode],
    [VGPR_32, VReg_64_Align2, VReg_64_Align2]>, SIRegisterClassLike<32, true, true>;

And using that instead of VGPR_32 for the special case operands in the GWS instruction definitions. Then we'd need separate patterns for the FeatureRequiresAlignedVGPRs and !FeatureRequiresAlignedVGPRs cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the way that subregisters are handled when folding the source of a copy through REG_SEQUENCE, we only ever try to fold if the subregister operand in the REG_SEQUENCE exactly matches the subregister of the use of the REG_SEQUENCE.

To make sure I understand, are the high 32-bits of the 64-bit read required to be undef, or are they just discarded? If it is required, then I am not sure if we should be folding into these instructions anyway. If it is not, then I think we would never fold the 32-bit operand into the 64-bit operand (maybe with special casing the REG_SEQUENCE recursion for these instructions, along with special casing on the Register Class constraining).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the GWS cases, the high 32-bits are just discarded. The requirement is just the register used is even aligned, which currently we can only encode by using the 64-bit aligned register classes.

This case is far removed from something this pass should have to worry about. If the instruction were properly defined, you wouldn't have to worry about it. Ideally instructions accurately express their constraints and don't require special casing anywhere else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me that this isn't something core for the pass to worry about, I just don't understand how RegClassByHwMode solves this issue. I did add a FIXME comment. There are two pieces that we would need to fix it: RegClassByHwMode for these specific instructions, having access to a 32-bit aligned register class. Without having a 32-bit aligned register class, I don't think we would ever actually perform the folding without some special casing anyway. As you mentioned, this PR is not for addressing this underlying issue. Do you think this needs to be addressed before this PR can be accepted?

Additionally, I borrowed this code from SIInstrInfo::verifyInstruction(). It may make sense to extract this code so it does not need to be redefined.

// between the in-place mutations and adding to the fold list.
foldOperand(OpToFold, RSUseMI, RSUseMI->getOperandNo(RSUse), FoldList,
CopiesToReplace);
CopiesToReplace, RegSeqSourceWasSubreg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this new argument, it's too specific. This should remain in terms of operands and sub registers

TRI->findCommonRegClass(OpRC, Old.getSubReg(), NewRC, New->getSubReg());

const TargetRegisterClass *ConstrainRC = OpRC;
if (New->getSubReg())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces

Comment on lines 715 to 721
const TargetRegisterClass *ConstrainRC =
TRI->findCommonRegClass(OpRC, Old.getSubReg(), NewRC, New->getSubReg());

const TargetRegisterClass *ConstrainRC = OpRC;
if (New->getSubReg())
ConstrainRC =
TRI->getMatchingSuperRegClass(NewRC, OpRC, New->getSubReg());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be pre-committed as NFC

arsenm pushed a commit that referenced this pull request Sep 26, 2025
…60743)

The RC of the folded operand does not need to be constrained based on
the RC of the current operand we are folding into.

The purpose of this PR is to facilitate this PR:
#151033
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Sep 26, 2025
…d (NFC) (#160743)

The RC of the folded operand does not need to be constrained based on
the RC of the current operand we are folding into.

The purpose of this PR is to facilitate this PR:
llvm/llvm-project#151033
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Oct 3, 2025
…vm#160743)

The RC of the folded operand does not need to be constrained based on
the RC of the current operand we are folding into.

The purpose of this PR is to facilitate this PR:
llvm#151033
@JoshHuttonCode JoshHuttonCode force-pushed the optimize-away-v_readfirstlane_b32 branch from 06ddf3c to 4275fe9 Compare November 6, 2025 23:23
Comment on lines -733 to -736
// Rework once the VS_16 register class is updated to include proper
// 16-bit SGPRs instead of 32-bit ones.
if (Old.getSubReg() == AMDGPU::lo16 && TRI->isSGPRReg(*MRI, New->getReg()))
Old.setSubReg(AMDGPU::NoSubRegister);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@broxigarchen Is this previous piece of code necessary? Currently, and in #128929 when this if was introduced, this branch is not taken in any of the tests. None of the tests fold into a lo16.

Copy link
Contributor

@broxigarchen broxigarchen Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is to handle case like:

%1:vgpr32 = COPY %0:sreg32
%3:vgpr16 = inst_true16 %1.lo16:vgpr32

fold to

%3:vgpr16 = inst_true16 %0:sreg32

since inst_true16 can take both vgpr16 and sreg32 for 16bit operands.

However I run a quick test and seems the this case is not added to the FoldCandidate and thus not triggered. Maybe we should fix this? @Sisyph to help commenting on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought the line was exercised when added. Maybe the calling code flow changed some assumed precondition. I observed you can add
assert(Old.getSubReg() == AMDGPU::NoSubRegister) at this line. Is it expected we never have subregisters on Old in this function?
I don't think the new version of the code in this PR is correct if Old did have a subregister (not sure about the old version).

@JoshHuttonCode JoshHuttonCode force-pushed the optimize-away-v_readfirstlane_b32 branch from 4275fe9 to 331902a Compare November 7, 2025 19:39
Comment on lines -740 to +737
Old.substVirtReg(New->getReg(), New->getSubReg(), *TRI);
Old.substVirtReg(New->getReg(), 0, *TRI);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have to change this to not pass in any subreg in order to make Old.setSubReg(New->getSubReg()) work, if that is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants