Skip to content

Conversation

akadutta
Copy link
Contributor

@akadutta akadutta commented Aug 1, 2025

This patch unpacks packed instructions that cannot be issued with MFMAs to allow them to be co-issued with them. Only those instructions are unpacked that are overlapped by the MFMAs latency. Rest are left packed.

Copy link

github-actions bot commented Aug 1, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 1, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Akash Dutta (akadutta)

Changes

This patch unpacks packed instructions that cannot be issued with MFMAs to allow them to be co-issued with them. Only those instructions are unpacked that are overlapped by the MFMAs latency. Rest are left packed.


Patch is 27.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151704.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp (+367-30)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+58-1)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+1)
  • (added) llvm/test/CodeGen/AMDGPU/unpack-non-coissue-insts-post-scheduler.ll (+116)
diff --git a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
index 4deb2a9485e4d..0f7009a6ea394 100644
--- a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
@@ -28,6 +28,12 @@
 /// and a VGPR_16. If we use the VGPR_16 that corresponds to the lo16 bits of
 /// the VGPR_32, the COPY can be completely eliminated.
 ///
+/// Additionally, this pass also unpacks packed instructions (V_PK_MUL_F32 and V_PK_ADD_F32) 
+/// adjacent to MFMAs such that they can be co-issued.
+/// This helps with overlapping MFMA and certain vector instructions in machine schedules
+/// and is expected to improve performance.
+/// Only those packed instructions are unpacked that are overlapped by the MFMA latency.
+/// Rest should remain untouched.
 //===----------------------------------------------------------------------===//
 
 #include "GCNPreRAOptimizations.h"
@@ -38,7 +44,13 @@
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/InitializePasses.h"
-
+#include "llvm/ADT/DenseSet.h"
+#include "SIInstrInfo.h"
+#include "llvm/CodeGen/RegisterScavenging.h"
+#include "llvm/InitializePasses.h"
+#include "GCNSchedStrategy.h"
+#include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineScheduler.h"
 using namespace llvm;
 
 #define DEBUG_TYPE "amdgpu-pre-ra-optimizations"
@@ -53,6 +65,16 @@ class GCNPreRAOptimizationsImpl {
   LiveIntervals *LIS;
 
   bool processReg(Register Reg);
+  bool createListOfPackedInstr(MachineInstr &BeginMI, DenseSet<MachineInstr *> &instrsToUnpack);
+  bool isUnpackingSupportedInstr(MachineInstr &MI) const;
+  void insertMI(MachineInstr &I);
+  uint16_t mapToUnpackedOpcode(MachineInstr &I);
+  SmallVector<MachineInstr *, 2> copyToVregAndInsertMI(MachineInstr &I,
+                                                       unsigned SGPRSrcPos);
+  SmallVector<MachineInstr *, 2>
+  insertUnpackedMI(MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1,
+                   MachineOperand &LoSrcMO2, MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2,
+                   bool isVreg_64);
 
 public:
   GCNPreRAOptimizationsImpl(LiveIntervals *LS) : LIS(LS) {}
@@ -225,6 +247,313 @@ bool GCNPreRAOptimizationsImpl::processReg(Register Reg) {
   return true;
 }
 
+bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(MachineInstr &MI) const {
+  unsigned Opcode = MI.getOpcode();
+  switch (Opcode) {
+    case AMDGPU::V_PK_ADD_F32:
+    case AMDGPU::V_PK_MUL_F32:
+      return true;
+
+    default:
+      return false;
+
+  }
+}
+
+uint16_t GCNPreRAOptimizationsImpl::mapToUnpackedOpcode(MachineInstr &I) {
+  unsigned Opcode = I.getOpcode();
+  // use 64 bit encoding to allow use of VOP3 instructions.
+  // VOP3 instructions allow VOP3P source modifiers to be translated to VOP3
+  // e32 instructions are VOP2 and don't allow source modifiers
+  switch (Opcode) {
+    case AMDGPU::V_PK_ADD_F32:
+      return AMDGPU::V_ADD_F32_e64;
+    case AMDGPU::V_PK_MUL_F32:
+      return AMDGPU::V_MUL_F32_e64;
+    default:
+      return std::numeric_limits<uint16_t>::max();
+
+  }
+}
+
+SmallVector<MachineInstr *, 2>
+GCNPreRAOptimizationsImpl::copyToVregAndInsertMI(MachineInstr &I,
+                                                   unsigned SGPRSrcPos) {
+  SmallVector<MachineInstr *, 2> MIList;
+
+  MachineBasicBlock &MBB = *I.getParent();
+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
+  MachineFunction &MF = *MBB.getParent();
+  const DebugLoc &DL = I.getDebugLoc();
+
+  Register TmpReg = MRI.createVirtualRegister(&AMDGPU::VReg_64_Align2RegClass);
+  MachineInstr *CopySGPR1 =
+      BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY))
+          .addDef(TmpReg, RegState::Undef)
+          .addReg(I.getOperand(SGPRSrcPos).getReg(), 0, AMDGPU::sub0);
+  unsigned SubIdx = TRI->composeSubRegIndices(
+      AMDGPU::sub0, CopySGPR1->getOperand(0).getSubReg());
+  CopySGPR1->getOperand(0).setReg(CopySGPR1->getOperand(0).getReg());
+  CopySGPR1->getOperand(0).setSubReg(SubIdx);
+  LIS->InsertMachineInstrInMaps(*CopySGPR1);
+  MIList.push_back(CopySGPR1);
+
+  MachineInstr *CopySGPR2 =
+      BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY))
+          .addDef(TmpReg)
+          .addReg(I.getOperand(SGPRSrcPos).getReg(), 0, AMDGPU::sub1);
+  SubIdx = TRI->composeSubRegIndices(AMDGPU::sub1,
+                                     CopySGPR2->getOperand(0).getSubReg());
+  CopySGPR2->getOperand(0).setReg(CopySGPR2->getOperand(0).getReg());
+  CopySGPR2->getOperand(0).setSubReg(SubIdx);
+  LIS->InsertMachineInstrInMaps(*CopySGPR2);
+  MIList.push_back(CopySGPR2);
+  return MIList;
+}
+
+bool GCNPreRAOptimizationsImpl::createListOfPackedInstr(
+    MachineInstr &BeginMI, DenseSet<MachineInstr *> &instrsToUnpack) {
+  auto *BB = BeginMI.getParent();
+  auto *MF = BB->getParent();
+  int NumInst = 0;
+
+  auto E = BB->end();
+  auto schedModel = TII->getSchedModel();
+  const MCSchedClassDesc *schedClassDesc = schedModel.resolveSchedClass(&BeginMI);
+  const int NumMFMACycles = schedModel.getWriteProcResBegin(schedClassDesc)->ReleaseAtCycle;
+  int totalCyclesBetweenCandidates = 0;
+  for (auto I = std::next(BeginMI.getIterator()); I != E; ++I) {
+    MachineInstr &Instr = *I;
+    const MCSchedClassDesc *instrSchedClassDesc = schedModel.resolveSchedClass(&Instr);
+    totalCyclesBetweenCandidates += schedModel.getWriteProcResBegin(instrSchedClassDesc)->ReleaseAtCycle;
+    if (Instr.isMetaInstruction())
+      continue;
+
+    if (Instr.isTerminator())
+      return false;
+    
+    if (totalCyclesBetweenCandidates > NumMFMACycles)
+      return false;
+
+    if ((isUnpackingSupportedInstr(Instr)) && TII->isNeverCoissue(Instr)) {
+      totalCyclesBetweenCandidates += 1;
+      instrsToUnpack.insert(&Instr);
+    }
+  }
+  return true;
+}
+
+SmallVector<MachineInstr *, 2> GCNPreRAOptimizationsImpl::insertUnpackedMI(
+    MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1, MachineOperand &LoSrcMO2,
+    MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2, bool isVreg_64) {
+
+  SmallVector<MachineInstr *, 2> MIList;
+  MachineBasicBlock &MBB = *I.getParent();
+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
+  MachineFunction &MF = *MBB.getParent();
+  const DebugLoc &DL = I.getDebugLoc();
+  Register DstReg = DstMO.getReg();
+
+  unsigned SrcSubIdx1 =
+      TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
+  unsigned SrcSubIdx2 =
+      TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
+  unsigned DestSubIdx =
+      TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub0);
+
+  const MCInstrDesc instrDesc = I.getDesc();
+
+  int clampIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
+  int64_t clampVal = I.getOperand(clampIdx).getImm();
+
+  int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
+  int src1_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src1_modifiers);
+  unsigned src0_Mods = I.getOperand(src0_modifiers_Idx).getImm();
+  unsigned src1_Mods = I.getOperand(src1_modifiers_Idx).getImm();
+
+  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  unsigned Lo_src0_mods = 0;
+  unsigned Lo_src1_mods = 0;
+  uint16_t unpackedOpcode = mapToUnpackedOpcode(I);
+  MachineInstrBuilder Op0L_Op1L = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
+  Op0L_Op1L.addDef(DstReg, 0, DestSubIdx); //vdst
+  if (src0_Mods & SISrcMods::OP_SEL_0) {
+    if (src0_Mods & SISrcMods::NEG) {
+      Lo_src0_mods |= SISrcMods::NEG;
+    }
+    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0
+  }
+  else {
+    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel == 0, select register 0 of reg:sub0_sub1
+  }
+
+  if (src1_Mods & SISrcMods::OP_SEL_0) {
+    if (src1_Mods & SISrcMods::NEG) {
+      Lo_src1_mods |= SISrcMods::NEG;
+    }
+    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0
+  }
+  else {
+    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+  }
+  Op0L_Op1L.addImm(clampVal); //clamp
+  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
+  Op0L_Op1L.addImm(0); //omod
+
+  if (isVreg_64) {
+    Op0L_Op1L->getOperand(0).setIsUndef();
+  }
+  else {
+    if (I.getOperand(0).isUndef()) {
+      Op0L_Op1L->getOperand(0).setIsUndef();
+    }
+  }
+
+  LIS->InsertMachineInstrInMaps(*Op0L_Op1L);
+
+  SrcSubIdx1 =
+      TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+  SrcSubIdx2 =
+      TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+  DestSubIdx =
+      TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub1);
+
+  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  unsigned Hi_src0_mods = 0;
+  unsigned Hi_src1_mods = 0;
+
+  MachineInstrBuilder Op0H_Op1H = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
+  Op0H_Op1H.addDef(DstReg, 0, DestSubIdx); //vdst
+  if (src0_Mods & SISrcMods::OP_SEL_1) {
+    if (src0_Mods & SISrcMods::NEG_HI) {
+      Hi_src0_mods |= SISrcMods::NEG;
+    }
+    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0
+  }
+  else {
+    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+  }
+
+  if (src1_Mods & SISrcMods::OP_SEL_1) {
+    if (src1_Mods & SISrcMods::NEG_HI) {
+      Hi_src1_mods |= SISrcMods::NEG;
+    }
+    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0
+  }
+  else {
+    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+  }
+  Op0H_Op1H.addImm(clampVal); //clamp
+  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
+  Op0H_Op1H.addImm(0); //omod
+  LIS->InsertMachineInstrInMaps(*Op0H_Op1H);
+
+  if (I.getFlag(MachineInstr::MIFlag::NoFPExcept)) {
+    Op0L_Op1L->setFlag(MachineInstr::MIFlag::NoFPExcept);
+    Op0H_Op1H->setFlag(MachineInstr::MIFlag::NoFPExcept);
+  }
+  LIS->RemoveMachineInstrFromMaps(I);
+  I.eraseFromParent();
+  LIS->removeInterval(DstReg);
+  LIS->createAndComputeVirtRegInterval(DstReg);
+  MIList.push_back(Op0L_Op1L);
+  MIList.push_back(Op0H_Op1H);
+  return MIList;
+}
+
+void GCNPreRAOptimizationsImpl::insertMI(MachineInstr &I) {
+  MachineBasicBlock &MBB = *I.getParent();
+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
+  MachineFunction &MF = *MBB.getParent();
+
+  Register DstReg = I.getOperand(0).getReg();
+  Register SrcReg1 = I.getOperand(2).getReg();
+  Register SrcReg2 = I.getOperand(4).getReg();
+
+  MachineOperand &DstMO = I.getOperand(0);
+  MachineOperand &SrcMO1 = I.getOperand(2);
+  MachineOperand &SrcMO2 = I.getOperand(4);
+
+  MachineBasicBlock::iterator MII = I;
+  const DebugLoc &DL = I.getDebugLoc();
+  const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
+  const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());
+  const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());
+  const TargetRegisterClass *Src0SubRC =
+      TRI->getSubRegisterClass(Src0RC, AMDGPU::sub0);
+  const TargetRegisterClass *SrcRC = TRI->getSubClassWithSubReg(Src0RC, 1);
+
+  if ((Src1RC->getID() == AMDGPU::SGPR_64RegClassID) ||
+      (Src0RC->getID() == AMDGPU::SGPR_64RegClassID)) {
+    if (Src1RC->getID() == AMDGPU::SGPR_64RegClassID) {
+      // try with sgpr32
+      SmallVector<MachineInstr *, 2> copyInstrs = copyToVregAndInsertMI(I, 4);
+      MachineInstr *CopySGPR1 = copyInstrs[0];
+      MachineInstr *CopySGPR2 = copyInstrs[1];
+
+      if (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID) {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
+            CopySGPR2->getOperand(0), true);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(2).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(2).getReg(), TRI);
+      } else {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
+            CopySGPR2->getOperand(0), false);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(2).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(2).getReg(), TRI);
+      }
+    }
+    else {
+      SmallVector<MachineInstr *, 2> copyInstrs = copyToVregAndInsertMI(I, 2);
+      MachineInstr *CopySGPR1 = copyInstrs[0];
+      MachineInstr *CopySGPR2 = copyInstrs[1];
+
+      if (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID) {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, CopySGPR1->getOperand(0), SrcMO2, CopySGPR2->getOperand(0), SrcMO2, true);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(1).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(1).getReg(), TRI);
+      } else {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, CopySGPR1->getOperand(0), SrcMO2, CopySGPR2->getOperand(0), SrcMO2, false);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(1).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(1).getReg(), TRI);
+      }
+    }
+    return;
+  }
+
+  if (DstRC->getID() == AMDGPU::VReg_512_Align2RegClassID) {
+    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, SrcMO2, SrcMO1,
+            SrcMO2, false);
+  }
+  else if (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID) {
+    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, SrcMO2, SrcMO1,
+            SrcMO2, true);
+  }
+  return;
+}
+
 bool GCNPreRAOptimizationsLegacy::runOnMachineFunction(MachineFunction &MF) {
   if (skipFunction(MF.getFunction()))
     return false;
@@ -260,38 +589,46 @@ bool GCNPreRAOptimizationsImpl::run(MachineFunction &MF) {
     Changed |= processReg(Reg);
   }
 
-  if (!ST.useRealTrue16Insts())
-    return Changed;
-
   // Add RA hints to improve True16 COPY elimination.
-  for (const MachineBasicBlock &MBB : MF) {
-    for (const MachineInstr &MI : MBB) {
-      if (MI.getOpcode() != AMDGPU::COPY)
-        continue;
-      Register Dst = MI.getOperand(0).getReg();
-      Register Src = MI.getOperand(1).getReg();
-      if (Dst.isVirtual() &&
-          MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
-          Src.isPhysical() &&
-          TRI->getRegClassForReg(*MRI, Src) == &AMDGPU::VGPR_32RegClass)
-        MRI->setRegAllocationHint(Dst, 0, TRI->getSubReg(Src, AMDGPU::lo16));
-      if (Src.isVirtual() &&
-          MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass &&
-          Dst.isPhysical() &&
-          TRI->getRegClassForReg(*MRI, Dst) == &AMDGPU::VGPR_32RegClass)
-        MRI->setRegAllocationHint(Src, 0, TRI->getSubReg(Dst, AMDGPU::lo16));
-      if (!Dst.isVirtual() || !Src.isVirtual())
-        continue;
-      if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_32RegClass &&
-          MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass) {
-        MRI->setRegAllocationHint(Dst, AMDGPURI::Size32, Src);
-        MRI->setRegAllocationHint(Src, AMDGPURI::Size16, Dst);
+  // Unpack packed instructions to overlap MFMAs. This allows the compiler to co-issue unpacked instructions with MFMA
+  for (MachineBasicBlock &MBB : MF) {
+    DenseSet<MachineInstr *> instrsToUnpack;
+    for (MachineInstr &MI : MBB) {
+      if (SIInstrInfo::isMFMA(MI)){
+        createListOfPackedInstr(MI, instrsToUnpack);
+      }
+      if (ST.useRealTrue16Insts()){
+        if (MI.getOpcode() != AMDGPU::COPY)
+          continue;
+        Register Dst = MI.getOperand(0).getReg();
+        Register Src = MI.getOperand(1).getReg();
+        if (Dst.isVirtual() &&
+            MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
+            Src.isPhysical() &&
+            TRI->getRegClassForReg(*MRI, Src) == &AMDGPU::VGPR_32RegClass)
+          MRI->setRegAllocationHint(Dst, 0, TRI->getSubReg(Src, AMDGPU::lo16));
+        if (Src.isVirtual() &&
+            MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass &&
+            Dst.isPhysical() &&
+            TRI->getRegClassForReg(*MRI, Dst) == &AMDGPU::VGPR_32RegClass)
+          MRI->setRegAllocationHint(Src, 0, TRI->getSubReg(Dst, AMDGPU::lo16));
+        if (!Dst.isVirtual() || !Src.isVirtual())
+          continue;
+        if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_32RegClass &&
+            MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass) {
+          MRI->setRegAllocationHint(Dst, AMDGPURI::Size32, Src);
+          MRI->setRegAllocationHint(Src, AMDGPURI::Size16, Dst);
+        }
+        if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
+            MRI->getRegClass(Src) == &AMDGPU::VGPR_32RegClass)
+          MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
       }
-      if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
-          MRI->getRegClass(Src) == &AMDGPU::VGPR_32RegClass)
-        MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
+    }
+    
+    if (!instrsToUnpack.empty()) {
+      for (MachineInstr *MI : instrsToUnpack) 
+        insertMI(*MI);
     }
   }
-
   return Changed;
 }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index c2da937552240..5562ff590b71d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -15,7 +15,6 @@
 #include "AMDGPU.h"
 #include "AMDGPUInstrInfo.h"
 #include "GCNHazardRecognizer.h"
-#include "GCNSubtarget.h"
 #include "SIMachineFunctionInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
 #include "llvm/Analysis/ValueTracking.h"
@@ -6173,6 +6172,64 @@ bool SIInstrInfo::isOperandLegal(const MachineInstr &MI, unsigned OpIdx,
   return isImmOperandLegal(MI, OpIdx, *MO);
 }
 
+bool SIInstrInfo::isNeverCoissue(MachineInstr &MI) const {
+  bool IsGFX950Only = ST.hasGFX950Insts();
+  if (!IsGFX950Only)
+    return false;
+
+  if (!isVALU(MI))
+    return false;
+
+  // V_COS, V_EXP, V_RCP, etc.
+  if (isTRANS(MI))
+    return true;
+
+  // DOT2, DOT2C, DOT4, etc.
+  if (isDOT(MI))
+    return true;
+
+  // MFMA, SMFMA
+  if (isMFMA(MI))
+    return true;
+
+  unsigned Opcode = MI.getOpcode();
+  switch (Opcode) {
+    case AMDGPU::V_CVT_PK_BF8_F32_e64:
+    case AMDGPU::V_CVT_PK_FP8_F32_e64:
+    case AMDGPU::V_MQSAD_PK_U16_U8_e64:
+    case AMDGPU::V_MQSAD_U32_U8_e64:
+    case AMDGPU::V_PK_ADD_F16:
+    case AMDGPU::V_PK_ADD_F32:
+    case AMDGPU::V_PK_ADD_I16:
+    case AMDGPU::V_PK_ADD_U16:
+    case AMDGPU::V_PK_ASHRREV_I16:
+    case AMDGPU::V_PK_FMA_F16:
+    case AMDGPU::V_PK_FMA_F32:
+    case AMDGPU::V_PK_FMAC_F16_e32:
+    case AMDGPU::V_PK_FMAC_F16_e64:
+    case AMDGPU::V_PK_LSHLREV_B16:
+    case AMDGPU::V_PK_LSHRREV_B16:
+    case AMDGPU::V_PK_MAD_I16:
+    case AMDGPU::V_PK_MAD_U16:
+    case AMDGPU...
[truncated]

@akadutta
Copy link
Contributor Author

akadutta commented Aug 1, 2025

Requesting feedback on this patch. @hidekisaito @bcahoon @jrbyrnes @arsenm @ronlieb

@bcahoon bcahoon requested review from arsenm, jrbyrnes and kerbowa August 1, 2025 17:54
#include "llvm/InitializePasses.h"
#include "GCNSchedStrategy.h"
#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineScheduler.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include files should be sorted alphabetically by category. Probably useful to run clang-format on your changes.

if (SIInstrInfo::isMFMA(MI)){
createListOfPackedInstr(MI, instrsToUnpack);
}
if (ST.useRealTrue16Insts()){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use an early exit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an early exit there for True16. However, leaving that there would mean code bloat, as the MF scan code for detecting the appropriate packed instructions needs to run anyhow. This way, it's just one block of code.

MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
}

if (!instrsToUnpack.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you create instrsToUnpack at the top of the loop, but do the insertion at the bottom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids insertions into the BB while I'm iterating over them. If I try multiple insertions while iterating over the BB, that breaks application compilation as we're also deleting the packed instruction from the BB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe save the instructions to be replaced in a data structure. Then, after the loop iterate over the saved instructions to replace them? I think that would avoid the problems with adding/removing while iterating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I am doing right now, unless I misunderstood the comment. I'm using a DenseSet to hold the instructions.

DenseSet<MachineInstr *> instrsToUnpack;
for (MachineInstr &MI : MBB) {
if (SIInstrInfo::isMFMA(MI)){
createListOfPackedInstr(MI, instrsToUnpack);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you really need to create a list of instructions to unpack by scanning for each MFMA. It may be possible to keep some small amount of state and replace the instructions with a single scan. I think you may just need totalCyclesBetweenCandidates?

return MIList;
}

void GCNPreRAOptimizationsImpl::insertMI(MachineInstr &I) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function replace the packed MI with the unpacked version. The name "insertMI" seems very generic.

@bcahoon
Copy link
Contributor

bcahoon commented Aug 4, 2025

Some more tests are needed. Especially, MIR tests that can help with test coverage

Comment on lines 6176 to 6178
bool IsGFX950Only = ST.hasGFX950Insts();
if (!IsGFX950Only)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't require target checks, ideally would derive this from the used resources in the sched model

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest to limit the scope of this patch to MI300X and MI350 for now and do local testing, with the promise of expanding this as suggested in a forthcoming patch.

@@ -0,0 +1,116 @@
; TODO: change variable names. Make test smaller if possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be much smaller, use named values, and have a run line.

This also requires MIR tests

int NumInst = 0;

auto E = BB->end();
auto schedModel = TII->getSchedModel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize the variables here and elsewhere.

int clampIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
int64_t clampVal = I.getOperand(clampIdx).getImm();

int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
int Src0ModifierxIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);

I think LLVM style guide suggests something like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks JP. I'll update the variable names

Comment on lines 381 to 383
if (src0_Mods & SISrcMods::NEG) {
Lo_src0_mods |= SISrcMods::NEG;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (src0_Mods & SISrcMods::NEG) {
Lo_src0_mods |= SISrcMods::NEG;
}
if (src0_Mods & SISrcMods::NEG)
Lo_src0_mods |= SISrcMods::NEG;

Single statement if do not require curly braces. Here and elsewhere

unsigned Lo_src0_mods = 0;
unsigned Lo_src1_mods = 0;
uint16_t unpackedOpcode = mapToUnpackedOpcode(I);
MachineInstrBuilder Op0L_Op1L = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting up Op0L_Op1L and Op0H_Op1H seems to be very similar / duplicated code. Can this be pulled out to a function and reused for both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it some more thought. As of now, Op0L_Op1L and Op0H_Op1H are designed to be separate instructions representing the lower and upper dwords/16 bits in packed instructions, and as such needs to be separately inserted into maps and reg intervals.

Copy link

github-actions bot commented Aug 8, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp,h -- llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.h
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
index f56d73e99..6fa0efca3 100644
--- a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
@@ -28,29 +28,28 @@
 /// and a VGPR_16. If we use the VGPR_16 that corresponds to the lo16 bits of
 /// the VGPR_32, the COPY can be completely eliminated.
 ///
-/// Additionally, this pass also unpacks packed instructions (V_PK_MUL_F32 and V_PK_ADD_F32) 
-/// adjacent to MFMAs such that they can be co-issued.
-/// This helps with overlapping MFMA and certain vector instructions in machine schedules
+/// Additionally, this pass also unpacks packed instructions (V_PK_MUL_F32 and
+/// V_PK_ADD_F32) adjacent to MFMAs such that they can be co-issued. This helps
+/// with overlapping MFMA and certain vector instructions in machine schedules
 /// and is expected to improve performance.
-/// Only those packed instructions are unpacked that are overlapped by the MFMA latency.
-/// Rest should remain untouched.
+/// Only those packed instructions are unpacked that are overlapped by the MFMA
+/// latency. Rest should remain untouched.
 //===----------------------------------------------------------------------===//
 
 #include "GCNPreRAOptimizations.h"
 #include "AMDGPU.h"
+#include "GCNSchedStrategy.h"
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "SIInstrInfo.h"
 #include "SIRegisterInfo.h"
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
-#include "llvm/InitializePasses.h"
-#include "llvm/ADT/DenseSet.h"
-#include "SIInstrInfo.h"
-#include "llvm/CodeGen/RegisterScavenging.h"
-#include "llvm/InitializePasses.h"
-#include "GCNSchedStrategy.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineScheduler.h"
+#include "llvm/CodeGen/RegisterScavenging.h"
+#include "llvm/InitializePasses.h"
 using namespace llvm;
 
 #define DEBUG_TYPE "amdgpu-pre-ra-optimizations"
@@ -65,15 +64,17 @@ private:
   LiveIntervals *LIS;
 
   bool processReg(Register Reg);
-  bool createListOfPackedInstr(MachineInstr &BeginMI, DenseSet<MachineInstr *> &instrsToUnpack);
+  bool createListOfPackedInstr(MachineInstr &BeginMI,
+                               DenseSet<MachineInstr *> &instrsToUnpack);
   bool isUnpackingSupportedInstr(MachineInstr &MI) const;
   void insertMI(MachineInstr &I);
   uint16_t mapToUnpackedOpcode(MachineInstr &I);
   SmallVector<MachineInstr *, 2> copyToVregAndInsertMI(MachineInstr &I,
                                                        unsigned SGPRSrcPos);
   SmallVector<MachineInstr *, 2>
-  insertUnpackedMI(MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1,
-                   MachineOperand &LoSrcMO2, MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2,
+  insertUnpackedMI(MachineInstr &I, MachineOperand &DstMO,
+                   MachineOperand &LoSrcMO1, MachineOperand &LoSrcMO2,
+                   MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2,
                    bool isVreg_64);
 
 public:
@@ -247,16 +248,16 @@ bool GCNPreRAOptimizationsImpl::processReg(Register Reg) {
   return true;
 }
 
-bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(MachineInstr &MI) const {
+bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(
+    MachineInstr &MI) const {
   unsigned Opcode = MI.getOpcode();
   switch (Opcode) {
-    case AMDGPU::V_PK_ADD_F32:
-    case AMDGPU::V_PK_MUL_F32:
-      return true;
-
-    default:
-      return false;
+  case AMDGPU::V_PK_ADD_F32:
+  case AMDGPU::V_PK_MUL_F32:
+    return true;
 
+  default:
+    return false;
   }
 }
 
@@ -266,19 +267,18 @@ uint16_t GCNPreRAOptimizationsImpl::mapToUnpackedOpcode(MachineInstr &I) {
   // VOP3 instructions allow VOP3P source modifiers to be translated to VOP3
   // e32 instructions are VOP2 and don't allow source modifiers
   switch (Opcode) {
-    case AMDGPU::V_PK_ADD_F32:
-      return AMDGPU::V_ADD_F32_e64;
-    case AMDGPU::V_PK_MUL_F32:
-      return AMDGPU::V_MUL_F32_e64;
-    default:
-      return std::numeric_limits<uint16_t>::max();
-
+  case AMDGPU::V_PK_ADD_F32:
+    return AMDGPU::V_ADD_F32_e64;
+  case AMDGPU::V_PK_MUL_F32:
+    return AMDGPU::V_MUL_F32_e64;
+  default:
+    return std::numeric_limits<uint16_t>::max();
   }
 }
 
 SmallVector<MachineInstr *, 2>
 GCNPreRAOptimizationsImpl::copyToVregAndInsertMI(MachineInstr &I,
-                                                   unsigned SGPRSrcPos) {
+                                                 unsigned SGPRSrcPos) {
   SmallVector<MachineInstr *, 2> MIList;
 
   MachineBasicBlock &MBB = *I.getParent();
@@ -319,19 +319,23 @@ bool GCNPreRAOptimizationsImpl::createListOfPackedInstr(
 
   auto E = BB->end();
   auto SchedModel = TII->getSchedModel();
-  const MCSchedClassDesc *SchedClassDesc = SchedModel.resolveSchedClass(&BeginMI);
-  const int NumMFMACycles = SchedModel.getWriteProcResBegin(SchedClassDesc)->ReleaseAtCycle;
+  const MCSchedClassDesc *SchedClassDesc =
+      SchedModel.resolveSchedClass(&BeginMI);
+  const int NumMFMACycles =
+      SchedModel.getWriteProcResBegin(SchedClassDesc)->ReleaseAtCycle;
   int TotalCyclesBetweenCandidates = 0;
   for (auto I = std::next(BeginMI.getIterator()); I != E; ++I) {
     MachineInstr &Instr = *I;
-    const MCSchedClassDesc *instrSchedClassDesc = SchedModel.resolveSchedClass(&Instr);
-    TotalCyclesBetweenCandidates += SchedModel.getWriteProcResBegin(instrSchedClassDesc)->ReleaseAtCycle;
+    const MCSchedClassDesc *instrSchedClassDesc =
+        SchedModel.resolveSchedClass(&Instr);
+    TotalCyclesBetweenCandidates +=
+        SchedModel.getWriteProcResBegin(instrSchedClassDesc)->ReleaseAtCycle;
     if (Instr.isMetaInstruction())
       continue;
 
     if (Instr.isTerminator())
       return false;
-    
+
     if (TotalCyclesBetweenCandidates > NumMFMACycles)
       return false;
 
@@ -344,8 +348,9 @@ bool GCNPreRAOptimizationsImpl::createListOfPackedInstr(
 }
 
 SmallVector<MachineInstr *, 2> GCNPreRAOptimizationsImpl::insertUnpackedMI(
-    MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1, MachineOperand &LoSrcMO2,
-    MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2, bool isVreg_64) {
+    MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1,
+    MachineOperand &LoSrcMO2, MachineOperand &HiSrcMO1,
+    MachineOperand &HiSrcMO2, bool isVreg_64) {
 
   SmallVector<MachineInstr *, 2> MIList;
   MachineBasicBlock &MBB = *I.getParent();
@@ -363,103 +368,118 @@ SmallVector<MachineInstr *, 2> GCNPreRAOptimizationsImpl::insertUnpackedMI(
 
   const MCInstrDesc instrDesc = I.getDesc();
 
-  int clampIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
+  int clampIdx =
+      AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
   int64_t clampVal = I.getOperand(clampIdx).getImm();
 
-  int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
-  int src1_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src1_modifiers);
+  int src0_modifiers_Idx =
+      AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
+  int src1_modifiers_Idx =
+      AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src1_modifiers);
   unsigned src0_Mods = I.getOperand(src0_modifiers_Idx).getImm();
   unsigned src1_Mods = I.getOperand(src1_modifiers_Idx).getImm();
 
-  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  // don't worry about abs values. Packed instructions (VOP3P) do not support
+  // them
   unsigned Lo_src0_mods = 0;
   unsigned Lo_src1_mods = 0;
   uint16_t unpackedOpcode = mapToUnpackedOpcode(I);
   MachineInstrBuilder Op0L_Op1L = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
-  Op0L_Op1L.addDef(DstReg, 0, DestSubIdx); //vdst
+  Op0L_Op1L.addDef(DstReg, 0, DestSubIdx); // vdst
   if (src0_Mods & SISrcMods::OP_SEL_0) {
     if (src0_Mods & SISrcMods::NEG) {
       Lo_src0_mods |= SISrcMods::NEG;
     }
-    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
-    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0
-  }
-  else {
-    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
-    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel == 0, select register 0 of reg:sub0_sub1
+    Op0L_Op1L.addImm(Lo_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); // src0
+  } else {
+    Op0L_Op1L.addImm(Lo_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0,
+                     Src0SubIdx); // src0 //if op_sel == 0, select register 0 of
+                                  // reg:sub0_sub1
   }
 
   if (src1_Mods & SISrcMods::OP_SEL_0) {
     if (src1_Mods & SISrcMods::NEG) {
       Lo_src1_mods |= SISrcMods::NEG;
     }
-    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
-    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0
+    Op0L_Op1L.addImm(Lo_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); // src0
+  } else {
+    Op0L_Op1L.addImm(Lo_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0,
+                     Src1SubIdx); // src0 //if op_sel_hi == 0, select register 0
+                                  // of reg:sub0_sub1
   }
-  else {
-    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
-    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
-  }
-  Op0L_Op1L.addImm(clampVal); //clamp
-  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
-  Op0L_Op1L.addImm(0); //omod
+  Op0L_Op1L.addImm(clampVal); // clamp
+  // packed instructions do not support output modifiers. safe to assign them 0
+  // for this use case
+  Op0L_Op1L.addImm(0); // omod
 
   if (isVreg_64) {
     Op0L_Op1L->getOperand(0).setIsUndef();
-  }
-  else if (I.getOperand(0).isUndef()){
+  } else if (I.getOperand(0).isUndef()) {
     Op0L_Op1L->getOperand(0).setIsUndef();
   }
 
   LIS->InsertMachineInstrInMaps(*Op0L_Op1L);
 
-  SrcSubIdx1 =
-      TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
-  SrcSubIdx2 =
-      TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
-  DestSubIdx =
-      TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub1);
+  SrcSubIdx1 = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+  SrcSubIdx2 = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+  DestSubIdx = TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub1);
 
-  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  // don't worry about abs values. Packed instructions (VOP3P) do not support
+  // them
   unsigned Hi_src0_mods = 0;
   unsigned Hi_src1_mods = 0;
 
   MachineInstrBuilder Op0H_Op1H = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
-  Op0H_Op1H.addDef(DstReg, 0, DestSubIdx); //vdst
+  Op0H_Op1H.addDef(DstReg, 0, DestSubIdx); // vdst
   if (src0_Mods & SISrcMods::OP_SEL_1) {
     if (src0_Mods & SISrcMods::NEG_HI) {
       Hi_src0_mods |= SISrcMods::NEG;
     }
-    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub1);
-    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0
-  }
-  else {
-    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub0);
-    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+    Op0H_Op1H.addImm(Hi_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); // src0
+  } else {
+    Op0H_Op1H.addImm(Hi_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0,
+                     Src0SubIdx); // src0 //if op_sel_hi == 0, select register 0
+                                  // of reg:sub0_sub1
   }
 
   if (src1_Mods & SISrcMods::OP_SEL_1) {
     if (src1_Mods & SISrcMods::NEG_HI) {
       Hi_src1_mods |= SISrcMods::NEG;
     }
-    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub1);
-    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0
+    Op0H_Op1H.addImm(Hi_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); // src0
+  } else {
+    Op0H_Op1H.addImm(Hi_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0,
+                     Src1SubIdx); // src0 //if op_sel_hi == 0, select register 0
+                                  // of reg:sub0_sub1
   }
-  else {
-    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub0);
-    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
-  }
-  Op0H_Op1H.addImm(clampVal); //clamp
-  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
-  Op0H_Op1H.addImm(0); //omod
+  Op0H_Op1H.addImm(clampVal); // clamp
+  // packed instructions do not support output modifiers. safe to assign them 0
+  // for this use case
+  Op0H_Op1H.addImm(0); // omod
   LIS->InsertMachineInstrInMaps(*Op0H_Op1H);
 
   if (I.getFlag(MachineInstr::MIFlag::NoFPExcept)) {
@@ -504,30 +524,33 @@ void GCNPreRAOptimizationsImpl::insertMI(MachineInstr &I) {
     MachineInstr *CopySGPR2 = copyInstrs[1];
 
     bool isVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);
-    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
-        I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
-        CopySGPR2->getOperand(0), isVReg64);
-    unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(2).getReg(), TRI);
-    unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(2).getReg(), TRI);
+    SmallVector<MachineInstr *, 2> unpackedInstrs =
+        insertUnpackedMI(I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
+                         CopySGPR2->getOperand(0), isVReg64);
+    unpackedInstrs[0]->addRegisterKilled(
+        unpackedInstrs[0]->getOperand(2).getReg(), TRI);
+    unpackedInstrs[1]->addRegisterKilled(
+        unpackedInstrs[1]->getOperand(2).getReg(), TRI);
     return;
-  }
-  else if (Src0RC->getID() == AMDGPU::SGPR_64RegClassID) {
+  } else if (Src0RC->getID() == AMDGPU::SGPR_64RegClassID) {
     SmallVector<MachineInstr *, 2> copyInstrs = copyToVregAndInsertMI(I, 2);
     MachineInstr *CopySGPR1 = copyInstrs[0];
     MachineInstr *CopySGPR2 = copyInstrs[1];
 
     bool isVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);
-    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
-        I, DstMO, CopySGPR1->getOperand(0), SrcMO2, CopySGPR2->getOperand(0), SrcMO2, isVReg64);
-    unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(1).getReg(), TRI);
-    unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(1).getReg(), TRI);
+    SmallVector<MachineInstr *, 2> unpackedInstrs =
+        insertUnpackedMI(I, DstMO, CopySGPR1->getOperand(0), SrcMO2,
+                         CopySGPR2->getOperand(0), SrcMO2, isVReg64);
+    unpackedInstrs[0]->addRegisterKilled(
+        unpackedInstrs[0]->getOperand(1).getReg(), TRI);
+    unpackedInstrs[1]->addRegisterKilled(
+        unpackedInstrs[1]->getOperand(1).getReg(), TRI);
     return;
   }
 
   bool isVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);
-  SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
-          I, DstMO, SrcMO1, SrcMO2, SrcMO1,
-          SrcMO2, isVReg64);
+  SmallVector<MachineInstr *, 2> unpackedInstrs =
+      insertUnpackedMI(I, DstMO, SrcMO1, SrcMO2, SrcMO1, SrcMO2, isVReg64);
   return;
 }
 
@@ -567,14 +590,15 @@ bool GCNPreRAOptimizationsImpl::run(MachineFunction &MF) {
   }
 
   // Add RA hints to improve True16 COPY elimination.
-  // Unpack packed instructions to overlap MFMAs. This allows the compiler to co-issue unpacked instructions with MFMA
+  // Unpack packed instructions to overlap MFMAs. This allows the compiler to
+  // co-issue unpacked instructions with MFMA
   for (MachineBasicBlock &MBB : MF) {
     DenseSet<MachineInstr *> instrsToUnpack;
     for (MachineInstr &MI : MBB) {
-      if (SIInstrInfo::isMFMA(MI)){
+      if (SIInstrInfo::isMFMA(MI)) {
         createListOfPackedInstr(MI, instrsToUnpack);
       }
-      if (ST.useRealTrue16Insts()){
+      if (ST.useRealTrue16Insts()) {
         if (MI.getOpcode() != AMDGPU::COPY)
           continue;
         Register Dst = MI.getOperand(0).getReg();
@@ -601,9 +625,9 @@ bool GCNPreRAOptimizationsImpl::run(MachineFunction &MF) {
           MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
       }
     }
-    
+
     if (!instrsToUnpack.empty()) {
-      for (MachineInstr *MI : instrsToUnpack) 
+      for (MachineInstr *MI : instrsToUnpack)
         insertMI(*MI);
     }
   }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2a6720d48..7947069b2 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6197,40 +6197,40 @@ bool SIInstrInfo::isNeverCoissue(MachineInstr &MI) const {
 
   unsigned Opcode = MI.getOpcode();
   switch (Opcode) {
-    case AMDGPU::V_CVT_PK_BF8_F32_e64:
-    case AMDGPU::V_CVT_PK_FP8_F32_e64:
-    case AMDGPU::V_MQSAD_PK_U16_U8_e64:
-    case AMDGPU::V_MQSAD_U32_U8_e64:
-    case AMDGPU::V_PK_ADD_F16:
-    case AMDGPU::V_PK_ADD_F32:
-    case AMDGPU::V_PK_ADD_I16:
-    case AMDGPU::V_PK_ADD_U16:
-    case AMDGPU::V_PK_ASHRREV_I16:
-    case AMDGPU::V_PK_FMA_F16:
-    case AMDGPU::V_PK_FMA_F32:
-    case AMDGPU::V_PK_FMAC_F16_e32:
-    case AMDGPU::V_PK_FMAC_F16_e64:
-    case AMDGPU::V_PK_LSHLREV_B16:
-    case AMDGPU::V_PK_LSHRREV_B16:
-    case AMDGPU::V_PK_MAD_I16:
-    case AMDGPU::V_PK_MAD_U16:
-    case AMDGPU::V_PK_MAX_F16:
-    case AMDGPU::V_PK_MAX_I16:
-    case AMDGPU::V_PK_MAX_U16:
-    case AMDGPU::V_PK_MIN_F16:
-    case AMDGPU::V_PK_MIN_I16:
-    case AMDGPU::V_PK_MIN_U16:
-    case AMDGPU::V_PK_MOV_B32:
-    case AMDGPU::V_PK_MUL_F16:
-    case AMDGPU::V_PK_MUL_F32:
-    case AMDGPU::V_PK_MUL_LO_U16:
-    case AMDGPU::V_PK_SUB_I16:
-    case AMDGPU::V_PK_SUB_U16:
-    case AMDGPU::V_QSAD_PK_U16_U8_e64:
-      return true;
-    default:
-      return false;
-    }
+  case AMDGPU::V_CVT_PK_BF8_F32_e64:
+  case AMDGPU::V_CVT_PK_FP8_F32_e64:
+  case AMDGPU::V_MQSAD_PK_U16_U8_e64:
+  case AMDGPU::V_MQSAD_U32_U8_e64:
+  case AMDGPU::V_PK_ADD_F16:
+  case AMDGPU::V_PK_ADD_F32:
+  case AMDGPU::V_PK_ADD_I16:
+  case AMDGPU::V_PK_ADD_U16:
+  case AMDGPU::V_PK_ASHRREV_I16:
+  case AMDGPU::V_PK_FMA_F16:
+  case AMDGPU::V_PK_FMA_F32:
+  case AMDGPU::V_PK_FMAC_F16_e32:
+  case AMDGPU::V_PK_FMAC_F16_e64:
+  case AMDGPU::V_PK_LSHLREV_B16:
+  case AMDGPU::V_PK_LSHRREV_B16:
+  case AMDGPU::V_PK_MAD_I16:
+  case AMDGPU::V_PK_MAD_U16:
+  case AMDGPU::V_PK_MAX_F16:
+  case AMDGPU::V_PK_MAX_I16:
+  case AMDGPU::V_PK_MAX_U16:
+  case AMDGPU::V_PK_MIN_F16:
+  case AMDGPU::V_PK_MIN_I16:
+  case AMDGPU::V_PK_MIN_U16:
+  case AMDGPU::V_PK_MOV_B32:
+  case AMDGPU::V_PK_MUL_F16:
+  case AMDGPU::V_PK_MUL_F32:
+  case AMDGPU::V_PK_MUL_LO_U16:
+  case AMDGPU::V_PK_SUB_I16:
+  case AMDGPU::V_PK_SUB_U16:
+  case AMDGPU::V_QSAD_PK_U16_U8_e64:
+    return true;
+  default:
+    return false;
+  }
 }
 
 void SIInstrInfo::legalizeOperandsVOP2(MachineRegisterInfo &MRI,

MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
MachineFunction &MF = *MBB.getParent();

Register DstReg = I.getOperand(0).getReg();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.getOperand(0).getReg() -> DstMO.getReg()

MachineFunction &MF = *MBB.getParent();

Register DstReg = I.getOperand(0).getReg();
Register SrcReg1 = I.getOperand(2).getReg();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

MachineFunction &MF = *MBB.getParent();

Register DstReg = I.getOperand(0).getReg();
Register SrcReg1 = I.getOperand(2).getReg();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if the operand is a register?


const DebugLoc &DL = I.getDebugLoc();
const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

const DebugLoc &DL = I.getDebugLoc();
const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());
const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());
const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());
const TargetRegisterClass *Src2RC = MRI.getRegClass(I.getOperand(6).getReg());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());
const TargetRegisterClass *Src2RC = MRI.getRegClass(I.getOperand(6).getReg());

bool IsVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use TRI->getRegSizeInBits(DstRC) == 64


LIS->InsertMachineInstrInMaps(*Op0L_Op1L);

SrcSubIdx1 = TRI->composeSubRegIndices(SrcMO1.getSubReg(), AMDGPU::sub1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract common code into helper

Copy link
Contributor

@jrbyrnes jrbyrnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally this would be done in a way such that we could rerun preRA scheduling after this and perhaps undo the unpack if it has made things worse. This may reduce the reliance on having an accurate profitability model.

} else
TotalCyclesBetweenCandidates += 1;

if (!(TotalCyclesBetweenCandidates > NumMFMACycles))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tie between the cyclecount should go to the packed version I think

if (SIInstrInfo::isMFMA(MI)) {
const MCSchedClassDesc *SchedClassDesc =
SchedModel.resolveSchedClass(&MI);
NumMFMACycles =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reduce for the issue latency of the MFMA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

MachineInstr &Instr = *I;
const MCSchedClassDesc *InstrSchedClassDesc =
SchedModel.resolveSchedClass(&Instr);
TotalCyclesBetweenCandidates +=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cycle count modelling is inaccurate and doesn't properly account for observed latency between instructions. If there are dependencies in the instructions between the MFMA and unpack instruction, we must observe the full latency of the dependee.

For example, if we have a load and a use of the load between the MFMA and the unpack instruction, we will incur the full latency of the load between the MFMA and the unpack candidate.

Another interesting case is if the unpack candidate uses some value from the MFMA. In this case, the unpack candidate must wait for the MFMA regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving the cycle count modelling is not something that we should attempt to handle -- I view it more as a downside to the implementation choice.

To do proper cycle count modelling we would basically need to do ad-hoc reconstructions of the schedule DAG for the dependency modelling.

If we do believe that such modelling is needed, we should probably just move this unpacking into scheduling somewhere, and use the schedulers mechanisms to do latency profitability checks.

LIS->InsertMachineInstrInMaps(*HiInput0_MI);
}

SubRegID = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to extract common code?

void GCNPreRAOptimizationsImpl::processF16Unpacking(MachineInstr &I,
uint16_t AvailableBudget) {
MachineBasicBlock &MBB = *I.getParent();
MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MRI is local to the class.

LIS->RemoveMachineInstrFromMaps(I);
I.eraseFromParent();
LIS->removeInterval(DstReg);
LIS->createAndComputeVirtRegInterval(DstReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you also [re]computing the virtRegInterval for the use regs -- these may shift slightly due to SlotIndex numbering. Or, in the case where we are creating new regs to hold temporary results for 16 bit unpack, we'll need LiveIntervals for all those.

return;
}

void GCNPreRAOptimizationsImpl::processF16Unpacking(MachineInstr &I,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is AvailableBudget needed for?

MaskLo = RegAndLatency.first; // mask for lower 16 bits
RegAndLatency = BuildImm(16);
ShiftAmt = RegAndLatency.first; // mask for higher 16 bits
IsF16MaskSet = true;
Copy link
Contributor

@jrbyrnes jrbyrnes Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this just sets MaskLo to the masked Lo bits of the first unpacked F16 instruction? What if the next F16 unpack candidate uses a different register?

Can you add a test where we unpack multiple F16 instructions and they have different source registers? You may need multiple MFMAs. Also a test where only 1 operand is reused.

return true;
}

bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow for unpacking with immediate operands

default:
return false;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

default:
return std::numeric_limits<uint16_t>::max();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

return;
}

void GCNPreRAOptimizationsImpl::processFMAF32Unpacking(MachineInstr &I) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a separate function for this? Seems like we're mainly changing some bool immediates in createUnpackedMI

return;

MachineInstrBuilder Op0L_Op1L =
createUnpackedMI(MBB, I, DL, UnpackedOpcode, false, false);
Copy link
Contributor

@jrbyrnes jrbyrnes Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createUnpackedMI(MBB, I, DL, UnpackedOpcode, /*IsHiBits=*/false, /*IsFMA=*/false);

}
}
NewMI.addImm(ClampVal); // clamp
// packed instructions do not support output modifiers. safe to assign them 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

NewMI.addReg(SrcMO1.getReg(), 0, Src0SubIdx);
}

if (Src1_Mods & NegModifier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is a lot of common code between the src0/src1/src2 handling, is it possible to extract this into function?

return;
}

MachineOperand &DstMO = I.getOperand(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't using any of these variables in this function scope except to pass them into insertUnpackedF32MI -- can we just create those variables from the instruction at that scope instead?

if (SIInstrInfo::isMFMA(MI)) {
const MCSchedClassDesc *SchedClassDesc =
SchedModel.resolveSchedClass(&MI);
NumMFMACycles =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

@jrbyrnes
Copy link
Contributor

jrbyrnes commented Sep 3, 2025

The PostRA scheduler can freely move around any of the instructions (unpacked / packed instructions). Using a issue cycle model between PreRA and PostRA scheduling then somewhat loosely corresponds with the actual profitability. We may unpack some instructions which get moved much later, or we may not unpack an instruction which is moved directly after MFMA. The latter is just a missed optimization whereas the former is degrading performance.

Doing it between scheduling passes makes some sense if our SchedModel knew about non-coissue latency, however it doesn't. Without that, I think it makes the most sense to have this handling after the PostRA scheduler. Otherwise, we should use a flag because this may cause regressions.

@jrbyrnes
Copy link
Contributor

Can this be abandoned ?

@akadutta akadutta closed this Sep 20, 2025
@akadutta
Copy link
Contributor Author

Corresponding change merged through #157968. That patch moves the optimization after post-RA scheduling. Closing this as this is an older iteration of 157968.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment