Co-issue packed instructions by unpacking #151704

akadutta · 2025-08-01T14:17:59Z

This patch unpacks packed instructions that cannot be issued with MFMAs to allow them to be co-issued with them. Only those instructions are unpacked that are overlapped by the MFMAs latency. Rest are left packed.

github-actions · 2025-08-01T14:18:19Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-08-01T14:18:50Z

@llvm/pr-subscribers-backend-amdgpu

Author: Akash Dutta (akadutta)

Changes

This patch unpacks packed instructions that cannot be issued with MFMAs to allow them to be co-issued with them. Only those instructions are unpacked that are overlapped by the MFMAs latency. Rest are left packed.

Patch is 27.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151704.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp (+367-30)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+58-1)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+1)
(added) llvm/test/CodeGen/AMDGPU/unpack-non-coissue-insts-post-scheduler.ll (+116)

diff --git a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
index 4deb2a9485e4d..0f7009a6ea394 100644
--- a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
@@ -28,6 +28,12 @@
 /// and a VGPR_16. If we use the VGPR_16 that corresponds to the lo16 bits of
 /// the VGPR_32, the COPY can be completely eliminated.
 ///
+/// Additionally, this pass also unpacks packed instructions (V_PK_MUL_F32 and V_PK_ADD_F32) 
+/// adjacent to MFMAs such that they can be co-issued.
+/// This helps with overlapping MFMA and certain vector instructions in machine schedules
+/// and is expected to improve performance.
+/// Only those packed instructions are unpacked that are overlapped by the MFMA latency.
+/// Rest should remain untouched.
 //===----------------------------------------------------------------------===//
 
 #include "GCNPreRAOptimizations.h"
@@ -38,7 +44,13 @@
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/InitializePasses.h"
-
+#include "llvm/ADT/DenseSet.h"
+#include "SIInstrInfo.h"
+#include "llvm/CodeGen/RegisterScavenging.h"
+#include "llvm/InitializePasses.h"
+#include "GCNSchedStrategy.h"
+#include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineScheduler.h"
 using namespace llvm;
 
 #define DEBUG_TYPE "amdgpu-pre-ra-optimizations"
@@ -53,6 +65,16 @@ class GCNPreRAOptimizationsImpl {
   LiveIntervals *LIS;
 
   bool processReg(Register Reg);
+  bool createListOfPackedInstr(MachineInstr &BeginMI, DenseSet<MachineInstr *> &instrsToUnpack);
+  bool isUnpackingSupportedInstr(MachineInstr &MI) const;
+  void insertMI(MachineInstr &I);
+  uint16_t mapToUnpackedOpcode(MachineInstr &I);
+  SmallVector<MachineInstr *, 2> copyToVregAndInsertMI(MachineInstr &I,
+                                                       unsigned SGPRSrcPos);
+  SmallVector<MachineInstr *, 2>
+  insertUnpackedMI(MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1,
+                   MachineOperand &LoSrcMO2, MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2,
+                   bool isVreg_64);
 
 public:
   GCNPreRAOptimizationsImpl(LiveIntervals *LS) : LIS(LS) {}
@@ -225,6 +247,313 @@ bool GCNPreRAOptimizationsImpl::processReg(Register Reg) {
   return true;
 }
 
+bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(MachineInstr &MI) const {
+  unsigned Opcode = MI.getOpcode();
+  switch (Opcode) {
+    case AMDGPU::V_PK_ADD_F32:
+    case AMDGPU::V_PK_MUL_F32:
+      return true;
+
+    default:
+      return false;
+
+  }
+}
+
+uint16_t GCNPreRAOptimizationsImpl::mapToUnpackedOpcode(MachineInstr &I) {
+  unsigned Opcode = I.getOpcode();
+  // use 64 bit encoding to allow use of VOP3 instructions.
+  // VOP3 instructions allow VOP3P source modifiers to be translated to VOP3
+  // e32 instructions are VOP2 and don't allow source modifiers
+  switch (Opcode) {
+    case AMDGPU::V_PK_ADD_F32:
+      return AMDGPU::V_ADD_F32_e64;
+    case AMDGPU::V_PK_MUL_F32:
+      return AMDGPU::V_MUL_F32_e64;
+    default:
+      return std::numeric_limits<uint16_t>::max();
+
+  }
+}
+
+SmallVector<MachineInstr *, 2>
+GCNPreRAOptimizationsImpl::copyToVregAndInsertMI(MachineInstr &I,
+                                                   unsigned SGPRSrcPos) {
+  SmallVector<MachineInstr *, 2> MIList;
+
+  MachineBasicBlock &MBB = *I.getParent();
+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
+  MachineFunction &MF = *MBB.getParent();
+  const DebugLoc &DL = I.getDebugLoc();
+
+  Register TmpReg = MRI.createVirtualRegister(&AMDGPU::VReg_64_Align2RegClass);
+  MachineInstr *CopySGPR1 =
+      BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY))
+          .addDef(TmpReg, RegState::Undef)
+          .addReg(I.getOperand(SGPRSrcPos).getReg(), 0, AMDGPU::sub0);
+  unsigned SubIdx = TRI->composeSubRegIndices(
+      AMDGPU::sub0, CopySGPR1->getOperand(0).getSubReg());
+  CopySGPR1->getOperand(0).setReg(CopySGPR1->getOperand(0).getReg());
+  CopySGPR1->getOperand(0).setSubReg(SubIdx);
+  LIS->InsertMachineInstrInMaps(*CopySGPR1);
+  MIList.push_back(CopySGPR1);
+
+  MachineInstr *CopySGPR2 =
+      BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY))
+          .addDef(TmpReg)
+          .addReg(I.getOperand(SGPRSrcPos).getReg(), 0, AMDGPU::sub1);
+  SubIdx = TRI->composeSubRegIndices(AMDGPU::sub1,
+                                     CopySGPR2->getOperand(0).getSubReg());
+  CopySGPR2->getOperand(0).setReg(CopySGPR2->getOperand(0).getReg());
+  CopySGPR2->getOperand(0).setSubReg(SubIdx);
+  LIS->InsertMachineInstrInMaps(*CopySGPR2);
+  MIList.push_back(CopySGPR2);
+  return MIList;
+}
+
+bool GCNPreRAOptimizationsImpl::createListOfPackedInstr(
+    MachineInstr &BeginMI, DenseSet<MachineInstr *> &instrsToUnpack) {
+  auto *BB = BeginMI.getParent();
+  auto *MF = BB->getParent();
+  int NumInst = 0;
+
+  auto E = BB->end();
+  auto schedModel = TII->getSchedModel();
+  const MCSchedClassDesc *schedClassDesc = schedModel.resolveSchedClass(&BeginMI);
+  const int NumMFMACycles = schedModel.getWriteProcResBegin(schedClassDesc)->ReleaseAtCycle;
+  int totalCyclesBetweenCandidates = 0;
+  for (auto I = std::next(BeginMI.getIterator()); I != E; ++I) {
+    MachineInstr &Instr = *I;
+    const MCSchedClassDesc *instrSchedClassDesc = schedModel.resolveSchedClass(&Instr);
+    totalCyclesBetweenCandidates += schedModel.getWriteProcResBegin(instrSchedClassDesc)->ReleaseAtCycle;
+    if (Instr.isMetaInstruction())
+      continue;
+
+    if (Instr.isTerminator())
+      return false;
+    
+    if (totalCyclesBetweenCandidates > NumMFMACycles)
+      return false;
+
+    if ((isUnpackingSupportedInstr(Instr)) && TII->isNeverCoissue(Instr)) {
+      totalCyclesBetweenCandidates += 1;
+      instrsToUnpack.insert(&Instr);
+    }
+  }
+  return true;
+}
+
+SmallVector<MachineInstr *, 2> GCNPreRAOptimizationsImpl::insertUnpackedMI(
+    MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1, MachineOperand &LoSrcMO2,
+    MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2, bool isVreg_64) {
+
+  SmallVector<MachineInstr *, 2> MIList;
+  MachineBasicBlock &MBB = *I.getParent();
+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
+  MachineFunction &MF = *MBB.getParent();
+  const DebugLoc &DL = I.getDebugLoc();
+  Register DstReg = DstMO.getReg();
+
+  unsigned SrcSubIdx1 =
+      TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
+  unsigned SrcSubIdx2 =
+      TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
+  unsigned DestSubIdx =
+      TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub0);
+
+  const MCInstrDesc instrDesc = I.getDesc();
+
+  int clampIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
+  int64_t clampVal = I.getOperand(clampIdx).getImm();
+
+  int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
+  int src1_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src1_modifiers);
+  unsigned src0_Mods = I.getOperand(src0_modifiers_Idx).getImm();
+  unsigned src1_Mods = I.getOperand(src1_modifiers_Idx).getImm();
+
+  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  unsigned Lo_src0_mods = 0;
+  unsigned Lo_src1_mods = 0;
+  uint16_t unpackedOpcode = mapToUnpackedOpcode(I);
+  MachineInstrBuilder Op0L_Op1L = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
+  Op0L_Op1L.addDef(DstReg, 0, DestSubIdx); //vdst
+  if (src0_Mods & SISrcMods::OP_SEL_0) {
+    if (src0_Mods & SISrcMods::NEG) {
+      Lo_src0_mods |= SISrcMods::NEG;
+    }
+    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0
+  }
+  else {
+    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel == 0, select register 0 of reg:sub0_sub1
+  }
+
+  if (src1_Mods & SISrcMods::OP_SEL_0) {
+    if (src1_Mods & SISrcMods::NEG) {
+      Lo_src1_mods |= SISrcMods::NEG;
+    }
+    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0
+  }
+  else {
+    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+  }
+  Op0L_Op1L.addImm(clampVal); //clamp
+  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
+  Op0L_Op1L.addImm(0); //omod
+
+  if (isVreg_64) {
+    Op0L_Op1L->getOperand(0).setIsUndef();
+  }
+  else {
+    if (I.getOperand(0).isUndef()) {
+      Op0L_Op1L->getOperand(0).setIsUndef();
+    }
+  }
+
+  LIS->InsertMachineInstrInMaps(*Op0L_Op1L);
+
+  SrcSubIdx1 =
+      TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+  SrcSubIdx2 =
+      TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+  DestSubIdx =
+      TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub1);
+
+  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  unsigned Hi_src0_mods = 0;
+  unsigned Hi_src1_mods = 0;
+
+  MachineInstrBuilder Op0H_Op1H = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
+  Op0H_Op1H.addDef(DstReg, 0, DestSubIdx); //vdst
+  if (src0_Mods & SISrcMods::OP_SEL_1) {
+    if (src0_Mods & SISrcMods::NEG_HI) {
+      Hi_src0_mods |= SISrcMods::NEG;
+    }
+    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0
+  }
+  else {
+    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
+    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+  }
+
+  if (src1_Mods & SISrcMods::OP_SEL_1) {
+    if (src1_Mods & SISrcMods::NEG_HI) {
+      Hi_src1_mods |= SISrcMods::NEG;
+    }
+    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0
+  }
+  else {
+    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
+    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+  }
+  Op0H_Op1H.addImm(clampVal); //clamp
+  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
+  Op0H_Op1H.addImm(0); //omod
+  LIS->InsertMachineInstrInMaps(*Op0H_Op1H);
+
+  if (I.getFlag(MachineInstr::MIFlag::NoFPExcept)) {
+    Op0L_Op1L->setFlag(MachineInstr::MIFlag::NoFPExcept);
+    Op0H_Op1H->setFlag(MachineInstr::MIFlag::NoFPExcept);
+  }
+  LIS->RemoveMachineInstrFromMaps(I);
+  I.eraseFromParent();
+  LIS->removeInterval(DstReg);
+  LIS->createAndComputeVirtRegInterval(DstReg);
+  MIList.push_back(Op0L_Op1L);
+  MIList.push_back(Op0H_Op1H);
+  return MIList;
+}
+
+void GCNPreRAOptimizationsImpl::insertMI(MachineInstr &I) {
+  MachineBasicBlock &MBB = *I.getParent();
+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
+  MachineFunction &MF = *MBB.getParent();
+
+  Register DstReg = I.getOperand(0).getReg();
+  Register SrcReg1 = I.getOperand(2).getReg();
+  Register SrcReg2 = I.getOperand(4).getReg();
+
+  MachineOperand &DstMO = I.getOperand(0);
+  MachineOperand &SrcMO1 = I.getOperand(2);
+  MachineOperand &SrcMO2 = I.getOperand(4);
+
+  MachineBasicBlock::iterator MII = I;
+  const DebugLoc &DL = I.getDebugLoc();
+  const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
+  const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());
+  const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());
+  const TargetRegisterClass *Src0SubRC =
+      TRI->getSubRegisterClass(Src0RC, AMDGPU::sub0);
+  const TargetRegisterClass *SrcRC = TRI->getSubClassWithSubReg(Src0RC, 1);
+
+  if ((Src1RC->getID() == AMDGPU::SGPR_64RegClassID) ||
+      (Src0RC->getID() == AMDGPU::SGPR_64RegClassID)) {
+    if (Src1RC->getID() == AMDGPU::SGPR_64RegClassID) {
+      // try with sgpr32
+      SmallVector<MachineInstr *, 2> copyInstrs = copyToVregAndInsertMI(I, 4);
+      MachineInstr *CopySGPR1 = copyInstrs[0];
+      MachineInstr *CopySGPR2 = copyInstrs[1];
+
+      if (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID) {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
+            CopySGPR2->getOperand(0), true);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(2).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(2).getReg(), TRI);
+      } else {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
+            CopySGPR2->getOperand(0), false);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(2).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(2).getReg(), TRI);
+      }
+    }
+    else {
+      SmallVector<MachineInstr *, 2> copyInstrs = copyToVregAndInsertMI(I, 2);
+      MachineInstr *CopySGPR1 = copyInstrs[0];
+      MachineInstr *CopySGPR2 = copyInstrs[1];
+
+      if (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID) {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, CopySGPR1->getOperand(0), SrcMO2, CopySGPR2->getOperand(0), SrcMO2, true);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(1).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(1).getReg(), TRI);
+      } else {
+        SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, CopySGPR1->getOperand(0), SrcMO2, CopySGPR2->getOperand(0), SrcMO2, false);
+        unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(1).getReg(), TRI);
+        unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(1).getReg(), TRI);
+      }
+    }
+    return;
+  }
+
+  if (DstRC->getID() == AMDGPU::VReg_512_Align2RegClassID) {
+    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, SrcMO2, SrcMO1,
+            SrcMO2, false);
+  }
+  else if (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID) {
+    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
+            I, DstMO, SrcMO1, SrcMO2, SrcMO1,
+            SrcMO2, true);
+  }
+  return;
+}
+
 bool GCNPreRAOptimizationsLegacy::runOnMachineFunction(MachineFunction &MF) {
   if (skipFunction(MF.getFunction()))
     return false;
@@ -260,38 +589,46 @@ bool GCNPreRAOptimizationsImpl::run(MachineFunction &MF) {
     Changed |= processReg(Reg);
   }
 
-  if (!ST.useRealTrue16Insts())
-    return Changed;
-
   // Add RA hints to improve True16 COPY elimination.
-  for (const MachineBasicBlock &MBB : MF) {
-    for (const MachineInstr &MI : MBB) {
-      if (MI.getOpcode() != AMDGPU::COPY)
-        continue;
-      Register Dst = MI.getOperand(0).getReg();
-      Register Src = MI.getOperand(1).getReg();
-      if (Dst.isVirtual() &&
-          MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
-          Src.isPhysical() &&
-          TRI->getRegClassForReg(*MRI, Src) == &AMDGPU::VGPR_32RegClass)
-        MRI->setRegAllocationHint(Dst, 0, TRI->getSubReg(Src, AMDGPU::lo16));
-      if (Src.isVirtual() &&
-          MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass &&
-          Dst.isPhysical() &&
-          TRI->getRegClassForReg(*MRI, Dst) == &AMDGPU::VGPR_32RegClass)
-        MRI->setRegAllocationHint(Src, 0, TRI->getSubReg(Dst, AMDGPU::lo16));
-      if (!Dst.isVirtual() || !Src.isVirtual())
-        continue;
-      if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_32RegClass &&
-          MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass) {
-        MRI->setRegAllocationHint(Dst, AMDGPURI::Size32, Src);
-        MRI->setRegAllocationHint(Src, AMDGPURI::Size16, Dst);
+  // Unpack packed instructions to overlap MFMAs. This allows the compiler to co-issue unpacked instructions with MFMA
+  for (MachineBasicBlock &MBB : MF) {
+    DenseSet<MachineInstr *> instrsToUnpack;
+    for (MachineInstr &MI : MBB) {
+      if (SIInstrInfo::isMFMA(MI)){
+        createListOfPackedInstr(MI, instrsToUnpack);
+      }
+      if (ST.useRealTrue16Insts()){
+        if (MI.getOpcode() != AMDGPU::COPY)
+          continue;
+        Register Dst = MI.getOperand(0).getReg();
+        Register Src = MI.getOperand(1).getReg();
+        if (Dst.isVirtual() &&
+            MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
+            Src.isPhysical() &&
+            TRI->getRegClassForReg(*MRI, Src) == &AMDGPU::VGPR_32RegClass)
+          MRI->setRegAllocationHint(Dst, 0, TRI->getSubReg(Src, AMDGPU::lo16));
+        if (Src.isVirtual() &&
+            MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass &&
+            Dst.isPhysical() &&
+            TRI->getRegClassForReg(*MRI, Dst) == &AMDGPU::VGPR_32RegClass)
+          MRI->setRegAllocationHint(Src, 0, TRI->getSubReg(Dst, AMDGPU::lo16));
+        if (!Dst.isVirtual() || !Src.isVirtual())
+          continue;
+        if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_32RegClass &&
+            MRI->getRegClass(Src) == &AMDGPU::VGPR_16RegClass) {
+          MRI->setRegAllocationHint(Dst, AMDGPURI::Size32, Src);
+          MRI->setRegAllocationHint(Src, AMDGPURI::Size16, Dst);
+        }
+        if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
+            MRI->getRegClass(Src) == &AMDGPU::VGPR_32RegClass)
+          MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
       }
-      if (MRI->getRegClass(Dst) == &AMDGPU::VGPR_16RegClass &&
-          MRI->getRegClass(Src) == &AMDGPU::VGPR_32RegClass)
-        MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
+    }
+    
+    if (!instrsToUnpack.empty()) {
+      for (MachineInstr *MI : instrsToUnpack) 
+        insertMI(*MI);
     }
   }
-
   return Changed;
 }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index c2da937552240..5562ff590b71d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -15,7 +15,6 @@
 #include "AMDGPU.h"
 #include "AMDGPUInstrInfo.h"
 #include "GCNHazardRecognizer.h"
-#include "GCNSubtarget.h"
 #include "SIMachineFunctionInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
 #include "llvm/Analysis/ValueTracking.h"
@@ -6173,6 +6172,64 @@ bool SIInstrInfo::isOperandLegal(const MachineInstr &MI, unsigned OpIdx,
   return isImmOperandLegal(MI, OpIdx, *MO);
 }
 
+bool SIInstrInfo::isNeverCoissue(MachineInstr &MI) const {
+  bool IsGFX950Only = ST.hasGFX950Insts();
+  if (!IsGFX950Only)
+    return false;
+
+  if (!isVALU(MI))
+    return false;
+
+  // V_COS, V_EXP, V_RCP, etc.
+  if (isTRANS(MI))
+    return true;
+
+  // DOT2, DOT2C, DOT4, etc.
+  if (isDOT(MI))
+    return true;
+
+  // MFMA, SMFMA
+  if (isMFMA(MI))
+    return true;
+
+  unsigned Opcode = MI.getOpcode();
+  switch (Opcode) {
+    case AMDGPU::V_CVT_PK_BF8_F32_e64:
+    case AMDGPU::V_CVT_PK_FP8_F32_e64:
+    case AMDGPU::V_MQSAD_PK_U16_U8_e64:
+    case AMDGPU::V_MQSAD_U32_U8_e64:
+    case AMDGPU::V_PK_ADD_F16:
+    case AMDGPU::V_PK_ADD_F32:
+    case AMDGPU::V_PK_ADD_I16:
+    case AMDGPU::V_PK_ADD_U16:
+    case AMDGPU::V_PK_ASHRREV_I16:
+    case AMDGPU::V_PK_FMA_F16:
+    case AMDGPU::V_PK_FMA_F32:
+    case AMDGPU::V_PK_FMAC_F16_e32:
+    case AMDGPU::V_PK_FMAC_F16_e64:
+    case AMDGPU::V_PK_LSHLREV_B16:
+    case AMDGPU::V_PK_LSHRREV_B16:
+    case AMDGPU::V_PK_MAD_I16:
+    case AMDGPU::V_PK_MAD_U16:
+    case AMDGPU...
[truncated]

akadutta · 2025-08-01T17:48:12Z

Requesting feedback on this patch. @hidekisaito @bcahoon @jrbyrnes @arsenm @ronlieb

bcahoon · 2025-08-01T17:56:18Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+#include "llvm/InitializePasses.h"
+#include "GCNSchedStrategy.h"
+#include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineScheduler.h"


include files should be sorted alphabetically by category. Probably useful to run clang-format on your changes.

bcahoon · 2025-08-01T18:06:33Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+      if (SIInstrInfo::isMFMA(MI)){
+        createListOfPackedInstr(MI, instrsToUnpack);
+      }
+      if (ST.useRealTrue16Insts()){


Can you use an early exit here.

There was an early exit there for True16. However, leaving that there would mean code bloat, as the MF scan code for detecting the appropriate packed instructions needs to run anyhow. This way, it's just one block of code.

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

bcahoon · 2025-08-04T20:24:04Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

-        MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
+    }
+
+    if (!instrsToUnpack.empty()) {


Why do you create instrsToUnpack at the top of the loop, but do the insertion at the bottom?

This avoids insertions into the BB while I'm iterating over them. If I try multiple insertions while iterating over the BB, that breaks application compilation as we're also deleting the packed instruction from the BB.

Maybe save the instructions to be replaced in a data structure. Then, after the loop iterate over the saved instructions to replace them? I think that would avoid the problems with adding/removing while iterating.

That's what I am doing right now, unless I misunderstood the comment. I'm using a DenseSet to hold the instructions.

bcahoon · 2025-08-04T20:28:30Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    DenseSet<MachineInstr *> instrsToUnpack;
+    for (MachineInstr &MI : MBB) {
+      if (SIInstrInfo::isMFMA(MI)){
+        createListOfPackedInstr(MI, instrsToUnpack);


I'm not sure you really need to create a list of instructions to unpack by scanning for each MFMA. It may be possible to keep some small amount of state and replace the instructions with a single scan. I think you may just need totalCyclesBetweenCandidates?

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

bcahoon · 2025-08-04T20:44:18Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  return MIList;
+}
+
+void GCNPreRAOptimizationsImpl::insertMI(MachineInstr &I) {


Does this function replace the packed MI with the unpacked version. The name "insertMI" seems very generic.

bcahoon · 2025-08-04T21:48:46Z

Some more tests are needed. Especially, MIR tests that can help with test coverage

arsenm · 2025-08-05T13:55:37Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

+  bool IsGFX950Only = ST.hasGFX950Insts();
+  if (!IsGFX950Only)
+    return false;


Shouldn't require target checks, ideally would derive this from the used resources in the sched model

I'd like to suggest to limit the scope of this patch to MI300X and MI350 for now and do local testing, with the promise of expanding this as suggested in a forthcoming patch.

arsenm · 2025-08-05T13:56:26Z

llvm/test/CodeGen/AMDGPU/unpack-non-coissue-insts-post-scheduler.ll

@@ -0,0 +1,116 @@
+; TODO: change variable names. Make test smaller if possible


This needs to be much smaller, use named values, and have a run line.

This also requires MIR tests

jplehr · 2025-08-06T09:36:43Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  int NumInst = 0;
+
+  auto E = BB->end();
+  auto schedModel = TII->getSchedModel();


Capitalize the variables here and elsewhere.

jplehr · 2025-08-06T09:37:58Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  int clampIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
+  int64_t clampVal = I.getOperand(clampIdx).getImm();
+
+  int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);


Suggested change

int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);

int Src0ModifierxIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);

I think LLVM style guide suggests something like this.

Thanks JP. I'll update the variable names

jplehr · 2025-08-06T09:41:36Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    if (src0_Mods & SISrcMods::NEG) {
+      Lo_src0_mods |= SISrcMods::NEG;
+    }


Suggested change

if (src0_Mods & SISrcMods::NEG) {

Lo_src0_mods |= SISrcMods::NEG;

}

if (src0_Mods & SISrcMods::NEG)

Lo_src0_mods |= SISrcMods::NEG;

Single statement if do not require curly braces. Here and elsewhere

jplehr · 2025-08-06T09:45:23Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  unsigned Lo_src0_mods = 0;
+  unsigned Lo_src1_mods = 0;
+  uint16_t unpackedOpcode = mapToUnpackedOpcode(I);
+  MachineInstrBuilder Op0L_Op1L = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));


Setting up Op0L_Op1L and Op0H_Op1H seems to be very similar / duplicated code. Can this be pulled out to a function and reused for both cases?

I'll give it some more thought. As of now, Op0L_Op1L and Op0H_Op1H are designed to be separate instructions representing the lower and upper dwords/16 bits in packed instructions, and as such needs to be separately inserted into maps and reg intervals.

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

github-actions · 2025-08-08T10:30:32Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff HEAD~1 HEAD --extensions cpp,h -- llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.h

View the diff from clang-format here.

diff --git a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
index f56d73e99..6fa0efca3 100644
--- a/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
@@ -28,29 +28,28 @@
 /// and a VGPR_16. If we use the VGPR_16 that corresponds to the lo16 bits of
 /// the VGPR_32, the COPY can be completely eliminated.
 ///
-/// Additionally, this pass also unpacks packed instructions (V_PK_MUL_F32 and V_PK_ADD_F32) 
-/// adjacent to MFMAs such that they can be co-issued.
-/// This helps with overlapping MFMA and certain vector instructions in machine schedules
+/// Additionally, this pass also unpacks packed instructions (V_PK_MUL_F32 and
+/// V_PK_ADD_F32) adjacent to MFMAs such that they can be co-issued. This helps
+/// with overlapping MFMA and certain vector instructions in machine schedules
 /// and is expected to improve performance.
-/// Only those packed instructions are unpacked that are overlapped by the MFMA latency.
-/// Rest should remain untouched.
+/// Only those packed instructions are unpacked that are overlapped by the MFMA
+/// latency. Rest should remain untouched.
 //===----------------------------------------------------------------------===//
 
 #include "GCNPreRAOptimizations.h"
 #include "AMDGPU.h"
+#include "GCNSchedStrategy.h"
 #include "GCNSubtarget.h"
 #include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "SIInstrInfo.h"
 #include "SIRegisterInfo.h"
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
-#include "llvm/InitializePasses.h"
-#include "llvm/ADT/DenseSet.h"
-#include "SIInstrInfo.h"
-#include "llvm/CodeGen/RegisterScavenging.h"
-#include "llvm/InitializePasses.h"
-#include "GCNSchedStrategy.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineScheduler.h"
+#include "llvm/CodeGen/RegisterScavenging.h"
+#include "llvm/InitializePasses.h"
 using namespace llvm;
 
 #define DEBUG_TYPE "amdgpu-pre-ra-optimizations"
@@ -65,15 +64,17 @@ private:
   LiveIntervals *LIS;
 
   bool processReg(Register Reg);
-  bool createListOfPackedInstr(MachineInstr &BeginMI, DenseSet<MachineInstr *> &instrsToUnpack);
+  bool createListOfPackedInstr(MachineInstr &BeginMI,
+                               DenseSet<MachineInstr *> &instrsToUnpack);
   bool isUnpackingSupportedInstr(MachineInstr &MI) const;
   void insertMI(MachineInstr &I);
   uint16_t mapToUnpackedOpcode(MachineInstr &I);
   SmallVector<MachineInstr *, 2> copyToVregAndInsertMI(MachineInstr &I,
                                                        unsigned SGPRSrcPos);
   SmallVector<MachineInstr *, 2>
-  insertUnpackedMI(MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1,
-                   MachineOperand &LoSrcMO2, MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2,
+  insertUnpackedMI(MachineInstr &I, MachineOperand &DstMO,
+                   MachineOperand &LoSrcMO1, MachineOperand &LoSrcMO2,
+                   MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2,
                    bool isVreg_64);
 
 public:
@@ -247,16 +248,16 @@ bool GCNPreRAOptimizationsImpl::processReg(Register Reg) {
   return true;
 }
 
-bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(MachineInstr &MI) const {
+bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(
+    MachineInstr &MI) const {
   unsigned Opcode = MI.getOpcode();
   switch (Opcode) {
-    case AMDGPU::V_PK_ADD_F32:
-    case AMDGPU::V_PK_MUL_F32:
-      return true;
-
-    default:
-      return false;
+  case AMDGPU::V_PK_ADD_F32:
+  case AMDGPU::V_PK_MUL_F32:
+    return true;
 
+  default:
+    return false;
   }
 }
 
@@ -266,19 +267,18 @@ uint16_t GCNPreRAOptimizationsImpl::mapToUnpackedOpcode(MachineInstr &I) {
   // VOP3 instructions allow VOP3P source modifiers to be translated to VOP3
   // e32 instructions are VOP2 and don't allow source modifiers
   switch (Opcode) {
-    case AMDGPU::V_PK_ADD_F32:
-      return AMDGPU::V_ADD_F32_e64;
-    case AMDGPU::V_PK_MUL_F32:
-      return AMDGPU::V_MUL_F32_e64;
-    default:
-      return std::numeric_limits<uint16_t>::max();
-
+  case AMDGPU::V_PK_ADD_F32:
+    return AMDGPU::V_ADD_F32_e64;
+  case AMDGPU::V_PK_MUL_F32:
+    return AMDGPU::V_MUL_F32_e64;
+  default:
+    return std::numeric_limits<uint16_t>::max();
   }
 }
 
 SmallVector<MachineInstr *, 2>
 GCNPreRAOptimizationsImpl::copyToVregAndInsertMI(MachineInstr &I,
-                                                   unsigned SGPRSrcPos) {
+                                                 unsigned SGPRSrcPos) {
   SmallVector<MachineInstr *, 2> MIList;
 
   MachineBasicBlock &MBB = *I.getParent();
@@ -319,19 +319,23 @@ bool GCNPreRAOptimizationsImpl::createListOfPackedInstr(
 
   auto E = BB->end();
   auto SchedModel = TII->getSchedModel();
-  const MCSchedClassDesc *SchedClassDesc = SchedModel.resolveSchedClass(&BeginMI);
-  const int NumMFMACycles = SchedModel.getWriteProcResBegin(SchedClassDesc)->ReleaseAtCycle;
+  const MCSchedClassDesc *SchedClassDesc =
+      SchedModel.resolveSchedClass(&BeginMI);
+  const int NumMFMACycles =
+      SchedModel.getWriteProcResBegin(SchedClassDesc)->ReleaseAtCycle;
   int TotalCyclesBetweenCandidates = 0;
   for (auto I = std::next(BeginMI.getIterator()); I != E; ++I) {
     MachineInstr &Instr = *I;
-    const MCSchedClassDesc *instrSchedClassDesc = SchedModel.resolveSchedClass(&Instr);
-    TotalCyclesBetweenCandidates += SchedModel.getWriteProcResBegin(instrSchedClassDesc)->ReleaseAtCycle;
+    const MCSchedClassDesc *instrSchedClassDesc =
+        SchedModel.resolveSchedClass(&Instr);
+    TotalCyclesBetweenCandidates +=
+        SchedModel.getWriteProcResBegin(instrSchedClassDesc)->ReleaseAtCycle;
     if (Instr.isMetaInstruction())
       continue;
 
     if (Instr.isTerminator())
       return false;
-    
+
     if (TotalCyclesBetweenCandidates > NumMFMACycles)
       return false;
 
@@ -344,8 +348,9 @@ bool GCNPreRAOptimizationsImpl::createListOfPackedInstr(
 }
 
 SmallVector<MachineInstr *, 2> GCNPreRAOptimizationsImpl::insertUnpackedMI(
-    MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1, MachineOperand &LoSrcMO2,
-    MachineOperand &HiSrcMO1, MachineOperand &HiSrcMO2, bool isVreg_64) {
+    MachineInstr &I, MachineOperand &DstMO, MachineOperand &LoSrcMO1,
+    MachineOperand &LoSrcMO2, MachineOperand &HiSrcMO1,
+    MachineOperand &HiSrcMO2, bool isVreg_64) {
 
   SmallVector<MachineInstr *, 2> MIList;
   MachineBasicBlock &MBB = *I.getParent();
@@ -363,103 +368,118 @@ SmallVector<MachineInstr *, 2> GCNPreRAOptimizationsImpl::insertUnpackedMI(
 
   const MCInstrDesc instrDesc = I.getDesc();
 
-  int clampIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
+  int clampIdx =
+      AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::clamp);
   int64_t clampVal = I.getOperand(clampIdx).getImm();
 
-  int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
-  int src1_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src1_modifiers);
+  int src0_modifiers_Idx =
+      AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
+  int src1_modifiers_Idx =
+      AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src1_modifiers);
   unsigned src0_Mods = I.getOperand(src0_modifiers_Idx).getImm();
   unsigned src1_Mods = I.getOperand(src1_modifiers_Idx).getImm();
 
-  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  // don't worry about abs values. Packed instructions (VOP3P) do not support
+  // them
   unsigned Lo_src0_mods = 0;
   unsigned Lo_src1_mods = 0;
   uint16_t unpackedOpcode = mapToUnpackedOpcode(I);
   MachineInstrBuilder Op0L_Op1L = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
-  Op0L_Op1L.addDef(DstReg, 0, DestSubIdx); //vdst
+  Op0L_Op1L.addDef(DstReg, 0, DestSubIdx); // vdst
   if (src0_Mods & SISrcMods::OP_SEL_0) {
     if (src0_Mods & SISrcMods::NEG) {
       Lo_src0_mods |= SISrcMods::NEG;
     }
-    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
-    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0
-  }
-  else {
-    Op0L_Op1L.addImm(Lo_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
-    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel == 0, select register 0 of reg:sub0_sub1
+    Op0L_Op1L.addImm(Lo_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0, Src0SubIdx); // src0
+  } else {
+    Op0L_Op1L.addImm(Lo_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO1.getReg(), 0,
+                     Src0SubIdx); // src0 //if op_sel == 0, select register 0 of
+                                  // reg:sub0_sub1
   }
 
   if (src1_Mods & SISrcMods::OP_SEL_0) {
     if (src1_Mods & SISrcMods::NEG) {
       Lo_src1_mods |= SISrcMods::NEG;
     }
-    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
-    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0
+    Op0L_Op1L.addImm(Lo_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); // src0
+  } else {
+    Op0L_Op1L.addImm(Lo_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0,
+                     Src1SubIdx); // src0 //if op_sel_hi == 0, select register 0
+                                  // of reg:sub0_sub1
   }
-  else {
-    Op0L_Op1L.addImm(Lo_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub0);
-    Op0L_Op1L.addReg(LoSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
-  }
-  Op0L_Op1L.addImm(clampVal); //clamp
-  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
-  Op0L_Op1L.addImm(0); //omod
+  Op0L_Op1L.addImm(clampVal); // clamp
+  // packed instructions do not support output modifiers. safe to assign them 0
+  // for this use case
+  Op0L_Op1L.addImm(0); // omod
 
   if (isVreg_64) {
     Op0L_Op1L->getOperand(0).setIsUndef();
-  }
-  else if (I.getOperand(0).isUndef()){
+  } else if (I.getOperand(0).isUndef()) {
     Op0L_Op1L->getOperand(0).setIsUndef();
   }
 
   LIS->InsertMachineInstrInMaps(*Op0L_Op1L);
 
-  SrcSubIdx1 =
-      TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
-  SrcSubIdx2 =
-      TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
-  DestSubIdx =
-      TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub1);
+  SrcSubIdx1 = TRI->composeSubRegIndices(LoSrcMO1.getSubReg(), AMDGPU::sub1);
+  SrcSubIdx2 = TRI->composeSubRegIndices(LoSrcMO2.getSubReg(), AMDGPU::sub1);
+  DestSubIdx = TRI->composeSubRegIndices(DstMO.getSubReg(), AMDGPU::sub1);
 
-  //don't worry about abs values. Packed instructions (VOP3P) do not support them
+  // don't worry about abs values. Packed instructions (VOP3P) do not support
+  // them
   unsigned Hi_src0_mods = 0;
   unsigned Hi_src1_mods = 0;
 
   MachineInstrBuilder Op0H_Op1H = BuildMI(MBB, I, DL, TII->get(unpackedOpcode));
-  Op0H_Op1H.addDef(DstReg, 0, DestSubIdx); //vdst
+  Op0H_Op1H.addDef(DstReg, 0, DestSubIdx); // vdst
   if (src0_Mods & SISrcMods::OP_SEL_1) {
     if (src0_Mods & SISrcMods::NEG_HI) {
       Hi_src0_mods |= SISrcMods::NEG;
     }
-    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub1);
-    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0
-  }
-  else {
-    Op0H_Op1H.addImm(Hi_src0_mods); //src0_modifiers
-    unsigned Src0SubIdx = TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub0);
-    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
+    Op0H_Op1H.addImm(Hi_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0, Src0SubIdx); // src0
+  } else {
+    Op0H_Op1H.addImm(Hi_src0_mods); // src0_modifiers
+    unsigned Src0SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO1.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO1.getReg(), 0,
+                     Src0SubIdx); // src0 //if op_sel_hi == 0, select register 0
+                                  // of reg:sub0_sub1
   }
 
   if (src1_Mods & SISrcMods::OP_SEL_1) {
     if (src1_Mods & SISrcMods::NEG_HI) {
       Hi_src1_mods |= SISrcMods::NEG;
     }
-    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub1);
-    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0
+    Op0H_Op1H.addImm(Hi_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub1);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); // src0
+  } else {
+    Op0H_Op1H.addImm(Hi_src1_mods); // src0_modifiers
+    unsigned Src1SubIdx =
+        TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub0);
+    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0,
+                     Src1SubIdx); // src0 //if op_sel_hi == 0, select register 0
+                                  // of reg:sub0_sub1
   }
-  else {
-    Op0H_Op1H.addImm(Hi_src1_mods); //src0_modifiers
-    unsigned Src1SubIdx = TRI->composeSubRegIndices(HiSrcMO2.getSubReg(), AMDGPU::sub0);
-    Op0H_Op1H.addReg(HiSrcMO2.getReg(), 0, Src1SubIdx); //src0 //if op_sel_hi == 0, select register 0 of reg:sub0_sub1
-  }
-  Op0H_Op1H.addImm(clampVal); //clamp
-  //packed instructions do not support output modifiers. safe to assign them 0 for this use case
-  Op0H_Op1H.addImm(0); //omod
+  Op0H_Op1H.addImm(clampVal); // clamp
+  // packed instructions do not support output modifiers. safe to assign them 0
+  // for this use case
+  Op0H_Op1H.addImm(0); // omod
   LIS->InsertMachineInstrInMaps(*Op0H_Op1H);
 
   if (I.getFlag(MachineInstr::MIFlag::NoFPExcept)) {
@@ -504,30 +524,33 @@ void GCNPreRAOptimizationsImpl::insertMI(MachineInstr &I) {
     MachineInstr *CopySGPR2 = copyInstrs[1];
 
     bool isVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);
-    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
-        I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
-        CopySGPR2->getOperand(0), isVReg64);
-    unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(2).getReg(), TRI);
-    unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(2).getReg(), TRI);
+    SmallVector<MachineInstr *, 2> unpackedInstrs =
+        insertUnpackedMI(I, DstMO, SrcMO1, CopySGPR1->getOperand(0), SrcMO1,
+                         CopySGPR2->getOperand(0), isVReg64);
+    unpackedInstrs[0]->addRegisterKilled(
+        unpackedInstrs[0]->getOperand(2).getReg(), TRI);
+    unpackedInstrs[1]->addRegisterKilled(
+        unpackedInstrs[1]->getOperand(2).getReg(), TRI);
     return;
-  }
-  else if (Src0RC->getID() == AMDGPU::SGPR_64RegClassID) {
+  } else if (Src0RC->getID() == AMDGPU::SGPR_64RegClassID) {
     SmallVector<MachineInstr *, 2> copyInstrs = copyToVregAndInsertMI(I, 2);
     MachineInstr *CopySGPR1 = copyInstrs[0];
     MachineInstr *CopySGPR2 = copyInstrs[1];
 
     bool isVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);
-    SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
-        I, DstMO, CopySGPR1->getOperand(0), SrcMO2, CopySGPR2->getOperand(0), SrcMO2, isVReg64);
-    unpackedInstrs[0]->addRegisterKilled(unpackedInstrs[0]->getOperand(1).getReg(), TRI);
-    unpackedInstrs[1]->addRegisterKilled(unpackedInstrs[1]->getOperand(1).getReg(), TRI);
+    SmallVector<MachineInstr *, 2> unpackedInstrs =
+        insertUnpackedMI(I, DstMO, CopySGPR1->getOperand(0), SrcMO2,
+                         CopySGPR2->getOperand(0), SrcMO2, isVReg64);
+    unpackedInstrs[0]->addRegisterKilled(
+        unpackedInstrs[0]->getOperand(1).getReg(), TRI);
+    unpackedInstrs[1]->addRegisterKilled(
+        unpackedInstrs[1]->getOperand(1).getReg(), TRI);
     return;
   }
 
   bool isVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);
-  SmallVector<MachineInstr *, 2> unpackedInstrs = insertUnpackedMI(
-          I, DstMO, SrcMO1, SrcMO2, SrcMO1,
-          SrcMO2, isVReg64);
+  SmallVector<MachineInstr *, 2> unpackedInstrs =
+      insertUnpackedMI(I, DstMO, SrcMO1, SrcMO2, SrcMO1, SrcMO2, isVReg64);
   return;
 }
 
@@ -567,14 +590,15 @@ bool GCNPreRAOptimizationsImpl::run(MachineFunction &MF) {
   }
 
   // Add RA hints to improve True16 COPY elimination.
-  // Unpack packed instructions to overlap MFMAs. This allows the compiler to co-issue unpacked instructions with MFMA
+  // Unpack packed instructions to overlap MFMAs. This allows the compiler to
+  // co-issue unpacked instructions with MFMA
   for (MachineBasicBlock &MBB : MF) {
     DenseSet<MachineInstr *> instrsToUnpack;
     for (MachineInstr &MI : MBB) {
-      if (SIInstrInfo::isMFMA(MI)){
+      if (SIInstrInfo::isMFMA(MI)) {
         createListOfPackedInstr(MI, instrsToUnpack);
       }
-      if (ST.useRealTrue16Insts()){
+      if (ST.useRealTrue16Insts()) {
         if (MI.getOpcode() != AMDGPU::COPY)
           continue;
         Register Dst = MI.getOperand(0).getReg();
@@ -601,9 +625,9 @@ bool GCNPreRAOptimizationsImpl::run(MachineFunction &MF) {
           MRI->setRegAllocationHint(Dst, AMDGPURI::Size16, Src);
       }
     }
-    
+
     if (!instrsToUnpack.empty()) {
-      for (MachineInstr *MI : instrsToUnpack) 
+      for (MachineInstr *MI : instrsToUnpack)
         insertMI(*MI);
     }
   }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2a6720d48..7947069b2 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6197,40 +6197,40 @@ bool SIInstrInfo::isNeverCoissue(MachineInstr &MI) const {
 
   unsigned Opcode = MI.getOpcode();
   switch (Opcode) {
-    case AMDGPU::V_CVT_PK_BF8_F32_e64:
-    case AMDGPU::V_CVT_PK_FP8_F32_e64:
-    case AMDGPU::V_MQSAD_PK_U16_U8_e64:
-    case AMDGPU::V_MQSAD_U32_U8_e64:
-    case AMDGPU::V_PK_ADD_F16:
-    case AMDGPU::V_PK_ADD_F32:
-    case AMDGPU::V_PK_ADD_I16:
-    case AMDGPU::V_PK_ADD_U16:
-    case AMDGPU::V_PK_ASHRREV_I16:
-    case AMDGPU::V_PK_FMA_F16:
-    case AMDGPU::V_PK_FMA_F32:
-    case AMDGPU::V_PK_FMAC_F16_e32:
-    case AMDGPU::V_PK_FMAC_F16_e64:
-    case AMDGPU::V_PK_LSHLREV_B16:
-    case AMDGPU::V_PK_LSHRREV_B16:
-    case AMDGPU::V_PK_MAD_I16:
-    case AMDGPU::V_PK_MAD_U16:
-    case AMDGPU::V_PK_MAX_F16:
-    case AMDGPU::V_PK_MAX_I16:
-    case AMDGPU::V_PK_MAX_U16:
-    case AMDGPU::V_PK_MIN_F16:
-    case AMDGPU::V_PK_MIN_I16:
-    case AMDGPU::V_PK_MIN_U16:
-    case AMDGPU::V_PK_MOV_B32:
-    case AMDGPU::V_PK_MUL_F16:
-    case AMDGPU::V_PK_MUL_F32:
-    case AMDGPU::V_PK_MUL_LO_U16:
-    case AMDGPU::V_PK_SUB_I16:
-    case AMDGPU::V_PK_SUB_U16:
-    case AMDGPU::V_QSAD_PK_U16_U8_e64:
-      return true;
-    default:
-      return false;
-    }
+  case AMDGPU::V_CVT_PK_BF8_F32_e64:
+  case AMDGPU::V_CVT_PK_FP8_F32_e64:
+  case AMDGPU::V_MQSAD_PK_U16_U8_e64:
+  case AMDGPU::V_MQSAD_U32_U8_e64:
+  case AMDGPU::V_PK_ADD_F16:
+  case AMDGPU::V_PK_ADD_F32:
+  case AMDGPU::V_PK_ADD_I16:
+  case AMDGPU::V_PK_ADD_U16:
+  case AMDGPU::V_PK_ASHRREV_I16:
+  case AMDGPU::V_PK_FMA_F16:
+  case AMDGPU::V_PK_FMA_F32:
+  case AMDGPU::V_PK_FMAC_F16_e32:
+  case AMDGPU::V_PK_FMAC_F16_e64:
+  case AMDGPU::V_PK_LSHLREV_B16:
+  case AMDGPU::V_PK_LSHRREV_B16:
+  case AMDGPU::V_PK_MAD_I16:
+  case AMDGPU::V_PK_MAD_U16:
+  case AMDGPU::V_PK_MAX_F16:
+  case AMDGPU::V_PK_MAX_I16:
+  case AMDGPU::V_PK_MAX_U16:
+  case AMDGPU::V_PK_MIN_F16:
+  case AMDGPU::V_PK_MIN_I16:
+  case AMDGPU::V_PK_MIN_U16:
+  case AMDGPU::V_PK_MOV_B32:
+  case AMDGPU::V_PK_MUL_F16:
+  case AMDGPU::V_PK_MUL_F32:
+  case AMDGPU::V_PK_MUL_LO_U16:
+  case AMDGPU::V_PK_SUB_I16:
+  case AMDGPU::V_PK_SUB_U16:
+  case AMDGPU::V_QSAD_PK_U16_U8_e64:
+    return true;
+  default:
+    return false;
+  }
 }
 
 void SIInstrInfo::legalizeOperandsVOP2(MachineRegisterInfo &MRI,

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

jrbyrnes · 2025-08-25T15:36:37Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
+  MachineFunction &MF = *MBB.getParent();
+
+  Register DstReg = I.getOperand(0).getReg();


I.getOperand(0).getReg() -> DstMO.getReg()

jrbyrnes · 2025-08-25T15:36:44Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  MachineFunction &MF = *MBB.getParent();
+
+  Register DstReg = I.getOperand(0).getReg();
+  Register SrcReg1 = I.getOperand(2).getReg();


jrbyrnes · 2025-08-25T15:36:55Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  MachineFunction &MF = *MBB.getParent();
+
+  Register DstReg = I.getOperand(0).getReg();
+  Register SrcReg1 = I.getOperand(2).getReg();


Do we need to check if the operand is a register?

jrbyrnes · 2025-08-25T16:11:05Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+
+  const DebugLoc &DL = I.getDebugLoc();
+  const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
+  const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());


jrbyrnes · 2025-08-25T16:11:10Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  const DebugLoc &DL = I.getDebugLoc();
+  const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
+  const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());
+  const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());


jrbyrnes · 2025-08-25T16:11:14Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  const TargetRegisterClass *DstRC = MRI.getRegClass(I.getOperand(0).getReg());
+  const TargetRegisterClass *Src0RC = MRI.getRegClass(I.getOperand(2).getReg());
+  const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());
+  const TargetRegisterClass *Src2RC = MRI.getRegClass(I.getOperand(6).getReg());


jrbyrnes · 2025-08-25T16:12:52Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  const TargetRegisterClass *Src1RC = MRI.getRegClass(I.getOperand(4).getReg());
+  const TargetRegisterClass *Src2RC = MRI.getRegClass(I.getOperand(6).getReg());
+
+  bool IsVReg64 = (DstRC->getID() == AMDGPU::VReg_64_Align2RegClassID);


Can you use TRI->getRegSizeInBits(DstRC) == 64

jrbyrnes · 2025-08-25T16:13:16Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+
+  LIS->InsertMachineInstrInMaps(*Op0L_Op1L);
+
+  SrcSubIdx1 = TRI->composeSubRegIndices(SrcMO1.getSubReg(), AMDGPU::sub1);


Extract common code into helper

jrbyrnes

I think ideally this would be done in a way such that we could rerun preRA scheduling after this and perhaps undo the unpack if it has made things worse. This may reduce the reliance on having an accurate profitability model.

jrbyrnes · 2025-08-25T18:17:17Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+      } else
+        TotalCyclesBetweenCandidates += 1;
+
+      if (!(TotalCyclesBetweenCandidates > NumMFMACycles))


Tie between the cyclecount should go to the packed version I think

jrbyrnes · 2025-08-25T18:22:21Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+      if (SIInstrInfo::isMFMA(MI)) {
+        const MCSchedClassDesc *SchedClassDesc =
+            SchedModel.resolveSchedClass(&MI);
+        NumMFMACycles =


Should we reduce for the issue latency of the MFMA

jrbyrnes · 2025-08-25T18:30:30Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    MachineInstr &Instr = *I;
+    const MCSchedClassDesc *InstrSchedClassDesc =
+        SchedModel.resolveSchedClass(&Instr);
+    TotalCyclesBetweenCandidates +=


This cycle count modelling is inaccurate and doesn't properly account for observed latency between instructions. If there are dependencies in the instructions between the MFMA and unpack instruction, we must observe the full latency of the dependee.

For example, if we have a load and a use of the load between the MFMA and the unpack instruction, we will incur the full latency of the load between the MFMA and the unpack candidate.

Another interesting case is if the unpack candidate uses some value from the MFMA. In this case, the unpack candidate must wait for the MFMA regardless.

Improving the cycle count modelling is not something that we should attempt to handle -- I view it more as a downside to the implementation choice.

To do proper cycle count modelling we would basically need to do ad-hoc reconstructions of the schedule DAG for the dependency modelling.

If we do believe that such modelling is needed, we should probably just move this unpacking into scheduling somewhere, and use the schedulers mechanisms to do latency profitability checks.

jrbyrnes · 2025-08-25T18:33:00Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    LIS->InsertMachineInstrInMaps(*HiInput0_MI);
+  }
+
+  SubRegID = 0;


Possible to extract common code?

jrbyrnes · 2025-08-25T18:33:20Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+void GCNPreRAOptimizationsImpl::processF16Unpacking(MachineInstr &I,
+                                                    uint16_t AvailableBudget) {
+  MachineBasicBlock &MBB = *I.getParent();
+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();


MRI is local to the class.

jrbyrnes · 2025-08-25T18:39:27Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  LIS->RemoveMachineInstrFromMaps(I);
+  I.eraseFromParent();
+  LIS->removeInterval(DstReg);
+  LIS->createAndComputeVirtRegInterval(DstReg);


Are you also [re]computing the virtRegInterval for the use regs -- these may shift slightly due to SlotIndex numbering. Or, in the case where we are creating new regs to hold temporary results for 16 bit unpack, we'll need LiveIntervals for all those.

jrbyrnes · 2025-08-25T18:40:59Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  return;
+}
+
+void GCNPreRAOptimizationsImpl::processF16Unpacking(MachineInstr &I,


What is AvailableBudget needed for?

jrbyrnes · 2025-08-25T18:43:49Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    MaskLo = RegAndLatency.first; // mask for lower 16 bits
+    RegAndLatency = BuildImm(16);
+    ShiftAmt = RegAndLatency.first; // mask for higher 16 bits
+    IsF16MaskSet = true;


So this just sets MaskLo to the masked Lo bits of the first unpacked F16 instruction? What if the next F16 unpack candidate uses a different register?

Can you add a test where we unpack multiple F16 instructions and they have different source registers? You may need multiple MFMAs. Also a test where only 1 operand is reused.

jrbyrnes · 2025-09-03T19:58:38Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

  return true;
 }

+bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(


We should allow for unpacking with immediate operands

jrbyrnes · 2025-09-03T19:58:44Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  default:
+    return false;
+  }
+}


jrbyrnes · 2025-09-03T19:58:54Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  default:
+    return std::numeric_limits<uint16_t>::max();
+  }
+}


jrbyrnes · 2025-09-03T20:01:42Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+  return;
+}
+
+void GCNPreRAOptimizationsImpl::processFMAF32Unpacking(MachineInstr &I) {


Do we really need a separate function for this? Seems like we're mainly changing some bool immediates in createUnpackedMI

jrbyrnes · 2025-09-03T20:02:32Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    return;
+
+  MachineInstrBuilder Op0L_Op1L =
+      createUnpackedMI(MBB, I, DL, UnpackedOpcode, false, false);


createUnpackedMI(MBB, I, DL, UnpackedOpcode, /*IsHiBits=*/false, /*IsFMA=*/false);

jrbyrnes · 2025-09-03T20:10:15Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    }
+  }
+  NewMI.addImm(ClampVal); // clamp
+  // packed instructions do not support output modifiers. safe to assign them 0


jrbyrnes · 2025-09-03T20:12:03Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    NewMI.addReg(SrcMO1.getReg(), 0, Src0SubIdx);
+  }
+
+  if (Src1_Mods & NegModifier) {


It looks like there is a lot of common code between the src0/src1/src2 handling, is it possible to extract this into function?

jrbyrnes · 2025-09-03T20:13:59Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+    return;
+  }
+
+  MachineOperand &DstMO = I.getOperand(0);


We aren't using any of these variables in this function scope except to pass them into insertUnpackedF32MI -- can we just create those variables from the instruction at that scope instead?

jrbyrnes · 2025-09-03T20:14:25Z

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

+      if (SIInstrInfo::isMFMA(MI)) {
+        const MCSchedClassDesc *SchedClassDesc =
+            SchedModel.resolveSchedClass(&MI);
+        NumMFMACycles =


llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp

jrbyrnes · 2025-09-03T22:36:17Z

The PostRA scheduler can freely move around any of the instructions (unpacked / packed instructions). Using a issue cycle model between PreRA and PostRA scheduling then somewhat loosely corresponds with the actual profitability. We may unpack some instructions which get moved much later, or we may not unpack an instruction which is moved directly after MFMA. The latter is just a missed optimization whereas the former is degrading performance.

Doing it between scheduling passes makes some sense if our SchedModel knew about non-coissue latency, however it doesn't. Without that, I think it makes the most sense to have this handling after the PostRA scheduler. Otherwise, we should use a flag because this may cause regressions.

jrbyrnes · 2025-09-19T21:18:09Z

Can this be abandoned ?

akadutta · 2025-09-20T02:55:44Z

Corresponding change merged through #157968. That patch moves the optimization after post-RA scheduling. Closing this as this is an older iteration of 157968.

akadutta added 5 commits July 29, 2025 15:40

initial commit

7c44328

add test

4bff965

code cleanup

d3b19c6

miscellaneous code optimizations and cleanup

c581612

add code comments

c695b99

llvmbot added the backend:AMDGPU label Aug 1, 2025

bcahoon requested review from arsenm, jrbyrnes and kerbowa August 1, 2025 17:54

bcahoon reviewed Aug 4, 2025

View reviewed changes

arsenm reviewed Aug 5, 2025

View reviewed changes

jplehr reviewed Aug 6, 2025

View reviewed changes

removing repetitive code, capitalize vars

1a51a42

jrbyrnes reviewed Aug 7, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp Outdated Show resolved Hide resolved

akadutta and others added 7 commits August 15, 2025 18:00

adding support for FP16 ops

e9056e8

code fix

5cb47d2

clang-formatted and mir tests added

178a363

adding gfx942 supports and code cleanup

7acafc4

adding pk_fma_f32 support and more code cleanup

cf4cb9e

Merge branch 'main' into users/akadutta/amdgpu/unpack_non_coissue_insts

0c0fe5d

fix incorrent merge

a77dab9

jrbyrnes reviewed Aug 25, 2025

View reviewed changes

akadutta added 2 commits August 27, 2025 14:14

check dependencies with MFMA inst and code cleanup

c467ed5

remove f16 support && add dependency checks

1bcbeba

code cleanup, add code comments

8ed311e

jrbyrnes reviewed Sep 3, 2025

View reviewed changes

akadutta and others added 2 commits September 4, 2025 12:54

add support for immediate operands, modularize code

064fa84

Merge branch 'main' into users/akadutta/amdgpu/unpack_non_coissue_insts

c9bfd29

akadutta mentioned this pull request Sep 10, 2025

[AMDGPU]: Unpack packed instructions overlapped by MFMAs post-RA scheduling #157968

Merged

akadutta closed this Sep 20, 2025

		@@ -0,0 +1,116 @@
		; TODO: change variable names. Make test smaller if possible

	int src0_modifiers_Idx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);
	int Src0ModifierxIdx = AMDGPU::getNamedOperandIdx(I.getOpcode(), AMDGPU::OpName::src0_modifiers);


		LIS->InsertMachineInstrInMaps(*Op0L_Op1L);

		SrcSubIdx1 = TRI->composeSubRegIndices(SrcMO1.getSubReg(), AMDGPU::sub1);

Co-issue packed instructions by unpacking #151704

Co-issue packed instructions by unpacking #151704

Uh oh!

Conversation

akadutta commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

llvmbot commented Aug 1, 2025

Uh oh!

akadutta commented Aug 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bcahoon commented Aug 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbyrnes left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbyrnes left a comment •

edited

Loading

jrbyrnes Aug 25, 2025 •

edited

Loading