[AMDGPU][Scheduler] Scoring system for rematerialization candidates #153092

lucas-rami · 2025-08-11T21:48:57Z

This is a significant refactoring of the scheduler's rematerialization stage meant to improve rematerialization capabilities and lay strong foundations for future improvements.

The core contribution is a scoring system to estimate the benefit of each rematerialization candidate. This score takes into account

region frequencies,
estimated impact on function latency (based on instruction latency), and
estimated impact on RP in regions that have excess RP with respect to an objective.

The overall stage's flow is as follows. After setting its RP objective (to reach a specific occupancy, or to eliminate spilling), the stage collects all rematerialization candidates in the function. Among those candidates it identifies and rematerializes all those that are always beneficial to rematerialize regardless of the stage's objective. Then, it scores all other candidates and iteratively rematerializes them in decreasing score order until it reaches the RP objective in all regions. As with the current implementation, all affected regions are then rescheduled, and rematerializations conditionally rolled back if they end up not benefiting performance.

New tests in machine-scheduler-rematerialization-scoring.mir showcase how the scoring system dictates which rematerialization are the most beneficial and therefore performed first.

llvmbot · 2025-08-11T21:49:19Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

This is a significant refactoring of the scheduler's rematerialization stage meant to improve rematerialization capabilities and lay strong foundations for future improvements.

The core contribution is a scoring system to estimate the benefit of each rematerialization candidate. This score takes into account

region frequencies,
estimated impact on function latency (based on instruction latency), and
estimated impact on RP in regions that have excess RP with respect to an objective.

The overall stage's flow is as follows. After setting its RP objective (to reach a specific occupancy, or to eliminate spilling), the stage collects all rematerialization candidates in the function. Among those candidates it identifies and rematerializes all those that are always beneficial to rematerialize regardless of the stage's objective. Then, it scores all other candidates and iteratively rematerializes them in decreasing score order until it reaches the RP objective in all regions. As with the current implementation, all affected regions are then rescheduled, and rematerializations conditionally rollbacked if they end up not benefiting performance.

Test churn in lightly affected tests is due to registers that are considered to be always beneficial to rematerialize now being rematerialized unconditionally. Many tests in machine-scheduler-sink-trivial-remats*.mir files become much less interesting (those with no loops in particular) because many registers within them are considered always beneficial to rematerialize as well. New tests in machine-scheduler-rematerialization-scoring.mir showcase how the scoring system dictates which rematerialization are the most beneficial and therefore performed first.

Patch is 449.27 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153092.diff

32 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+471-269)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+192-42)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+10-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+17-17)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/call-waitcnt.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/dbg-value-ends-sched-region.mir (+17-17)
(modified) llvm/test/CodeGen/AMDGPU/dbg-value-starts-sched-region.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/debug-value-scheduler-crash.mir (+29-29)
(modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+12-13)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-wwm.ll (+8-7)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.and.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.or.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.xor.ll (+6-6)
(added) llvm/test/CodeGen/AMDGPU/machine-scheduler-rematerialization-scoring.mir (+535)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-attr.mir (+362-366)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (+494-496)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+9-8)
(modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+63-63)
(modified) llvm/test/CodeGen/AMDGPU/sched-assert-dead-def-subreg-use-other-subreg.mir (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/sched-assert-onlydbg-value-empty-region.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/schedule-ilp.mir (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/spill-empty-live-interval.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+42-42)
(modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+2-2)

diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 254b75b784e75..7916e7acd36c1 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -28,10 +28,21 @@
 #include "GCNRegPressure.h"
 #include "SIMachineFunctionInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/STLExtras.h"
+#include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineBlockFrequencyInfo.h"
+#include "llvm/CodeGen/MachineBranchProbabilityInfo.h"
 #include "llvm/CodeGen/RegisterClassInfo.h"
 #include "llvm/MC/LaneBitmask.h"
+#include "llvm/MC/MCInstrItineraries.h"
+#include "llvm/MC/MCSchedule.h"
+#include "llvm/MC/TargetRegistry.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/raw_ostream.h"
+#include <deque>
+#include <limits>
+#include <string>
 
 #define DEBUG_TYPE "machine-scheduler"
 
@@ -808,6 +819,8 @@ void GCNScheduleDAGMILive::schedule() {
 
 GCNRegPressure
 GCNScheduleDAGMILive::getRealRegPressure(unsigned RegionIdx) const {
+  if (Regions[RegionIdx].first == Regions[RegionIdx].second)
+    return llvm::getRegPressure(MRI, LiveIns[RegionIdx]);
   GCNDownwardRPTracker RPTracker(*LIS);
   RPTracker.advance(Regions[RegionIdx].first, Regions[RegionIdx].second,
                     &LiveIns[RegionIdx]);
@@ -1089,33 +1102,224 @@ bool ClusteredLowOccStage::initGCNSchedStage() {
 #define REMAT_PREFIX "[PreRARemat] "
 #define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << REMAT_PREFIX; X;)
 
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+void PreRARematStage::printTargetRegions(bool PrintAll) const {
+  if (PrintAll) {
+    for (auto [I, Target] : enumerate(RPTargets))
+      REMAT_DEBUG(dbgs() << "  [" << I << "] " << Target << '\n');
+    return;
+  }
+  if (TargetRegions.none()) {
+    REMAT_DEBUG(dbgs() << "No target regions\n");
+    return;
+  }
+  REMAT_DEBUG(dbgs() << "Target regions:\n");
+  for (unsigned I : TargetRegions.set_bits())
+    REMAT_DEBUG(dbgs() << "  [" << I << "] " << RPTargets[I] << '\n');
+}
+
+void PreRARematStage::RematReg::print(
+    const DenseMap<MachineInstr *, unsigned> &MIRegion) const {
+  REMAT_DEBUG(dbgs() << "  [" << MIRegion.at(DefMI) << "] " << *DefMI);
+  REMAT_DEBUG(dbgs() << "    -> used in [" << UseRegion << "] " << *UseMI);
+  const unsigned NumRegions = Live.size();
+  REMAT_DEBUG(dbgs() << "    Guaranteed RP reduction in:");
+  for (unsigned I = 0; I < NumRegions; ++I) {
+    if (isBeneficialRegion(I))
+      dbgs() << " [" << I << "]";
+  }
+  dbgs() << '\n';
+  REMAT_DEBUG(dbgs() << "    Possible RP reduction in:");
+  for (unsigned I = 0; I < NumRegions; ++I) {
+    if (isMaybeBeneficialRegion(I))
+      dbgs() << " [" << I << "]";
+  }
+  dbgs() << '\n';
+}
+
+#endif
+
 bool PreRARematStage::initGCNSchedStage() {
   // FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for
   // regions inbetween the defs and region we sinked the def to. Will need to be
   // fixed if there is another pass after this pass.
   assert(!S.hasNextStage());
 
-  if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() == 1)
+  if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() <= 1)
     return false;
 
   // Before performing any IR modification record the parent region of each MI
   // and the parent MBB of each region.
   const unsigned NumRegions = DAG.Regions.size();
-  RegionBB.reserve(NumRegions);
   for (unsigned I = 0; I < NumRegions; ++I) {
     RegionBoundaries Region = DAG.Regions[I];
     for (auto MI = Region.first; MI != Region.second; ++MI)
       MIRegion.insert({&*MI, I});
-    RegionBB.push_back(Region.first->getParent());
+    MachineBasicBlock *ParentMBB = Region.first->getParent();
+    if (Region.second != ParentMBB->end())
+      MIRegion.insert({&*Region.second, I});
+    RegionBB.push_back(ParentMBB);
+  }
+
+  setObjective();
+  REMAT_DEBUG({
+    dbgs() << "Analyzing ";
+    MF.getFunction().printAsOperand(dbgs(), false);
+    dbgs() << ": ";
+    if (TargetRegions.none()) {
+      dbgs() << "no objective to achieve, occupancy is maximal at "
+             << MFI.getMaxWavesPerEU() << '\n';
+    } else if (TargetOcc) {
+      dbgs() << "increase occupancy from " << *TargetOcc - 1 << '\n';
+    } else {
+      dbgs() << "reduce spilling (minimum target occupancy is "
+             << MFI.getMinWavesPerEU() << ")\n";
+    }
+    printTargetRegions(/*PrintAll=*/TargetRegions.none());
+  });
+
+  // Compute region frequencies. 0 encodes an unknown region frequency.
+  SmallVector<uint64_t> RegionFreq;
+  RegionFreq.reserve(NumRegions);
+  assert(DAG.MLI && "MLI not defined in DAG");
+  MachineBranchProbabilityInfo MBPI;
+  MachineBlockFrequencyInfo MBFI(MF, MBPI, *DAG.MLI);
+  uint64_t EntryFreq = MBFI.getEntryFreq().getFrequency();
+  if (EntryFreq) {
+    for (const MachineBasicBlock *MBB : RegionBB)
+      RegionFreq.push_back(MBFI.getBlockFreq(MBB).getFrequency() / EntryFreq);
+  } else {
+    RegionFreq.insert(RegionFreq.end(), RegionBB.size(), 0);
   }
+  REMAT_DEBUG({
+    dbgs() << "Region frequencies:\n";
+    for (auto [I, Freq] : enumerate(RegionFreq)) {
+      dbgs() << REMAT_PREFIX << "  [" << I << "] ";
+      if (Freq)
+        dbgs() << Freq;
+      else
+        dbgs() << "unknown ";
+      dbgs() << " | " << *DAG.Regions[I].first;
+    }
+  });
 
-  if (!canIncreaseOccupancyOrReduceSpill())
+  if (!collectRematRegs(RegionFreq)) {
+    REMAT_DEBUG(dbgs() << "No rematerializable registers\n");
     return false;
+  }
+
+  REMAT_DEBUG({
+    dbgs() << "Rematerializable registers:\n";
+    for (const RematReg &Remat : RematRegs)
+      Remat.print(MIRegion);
+  });
+
+  // Start by rematerializing always beneficial registers. These should never
+  // be rollbacked. All other rematerialization candidates get added to list of
+  // rematerializations that will be scored.
+  REMAT_DEBUG(dbgs() << "==== ALWAYS BENEFICIAL ====\n");
+  SmallVector<ScoredRemat> ScoredRemats;
+  BitVector RecomputeRP(NumRegions);
+  for (const RematReg &Remat : RematRegs) {
+    if (Remat.isAlwaysBeneficial()) {
+      REMAT_DEBUG(dbgs() << "[" << MIRegion[Remat.DefMI]
+                         << "] REMAT (always) | " << *Remat.DefMI);
+      rematerialize(Remat, RecomputeRP);
+    } else {
+      ScoredRemats.emplace_back(&Remat, DAG.ST, *DAG.TII);
+    }
+  }
+  unsetSatisifedRPTargets(RescheduleRegions);
+
+#ifndef NDEBUG
+  printTargetRegions();
+  unsigned RoundNum = 0;
+#endif
+
+  // Rematerialize registers in successive rounds until all RP targets are
+  // satisifed or until we run out of rematerialization candidates.
+  while ((updateAndVerifyRPTargets(RecomputeRP) || TargetRegions.any()) &&
+         !ScoredRemats.empty()) {
+    // (Re-)Score and (re-)sort all remats in increasing score order.
+    for (ScoredRemat &Remat : ScoredRemats)
+      Remat.update(TargetRegions, RPTargets, RegionFreq, !TargetOcc);
+    stable_sort(ScoredRemats);
+
+    REMAT_DEBUG({
+      dbgs() << "==== ROUND " << RoundNum << " ====\n";
+      for (const ScoredRemat &SRemat : ScoredRemats) {
+        dbgs() << REMAT_PREFIX << "*" << SRemat.getScore() << "* | "
+               << *SRemat.Remat->DefMI;
+      }
+    });
+
+    RecomputeRP.reset();
+    int RematIdx = ScoredRemats.size() - 1;
+
+    // Rematerialize registers in decreasing score order until we estimate that
+    // all RP targets are satisfied or until rematerialization candidates are no
+    // longer useful to decrease RP.
+    for (; RematIdx >= 0 && TargetRegions.any(); --RematIdx) {
+      const RematReg &Remat = *ScoredRemats[RematIdx].Remat;
+      int Score = ScoredRemats[RematIdx].getScore();
+
+      // Stop when scores become negative. Since scores monotonically decrease
+      // as remats are performed, we know there is nothing useful left to do in
+      // such cases.
+      if (Score <= 0) {
+        REMAT_DEBUG(dbgs() << "Stop remats on non-positive score | "
+                           << *Remat.DefMI);
+        RematIdx = -1;
+        break;
+      }
+
+      // When previous rematerializations in this round have already satisfied
+      // RP targets in all regions this rematerialization can impact, we have a
+      // good indication that our scores have diverged significantly from
+      // reality, in which case we interrupt this round and re-score. This also
+      // ensures that every rematerialization we perform is possibly impactful
+      // in at least one target region.
+      if (!Remat.intersectWithTarget(TargetRegions)) {
+        REMAT_DEBUG(dbgs() << "Stop round on stale score | " << *Remat.DefMI);
+        break;
+      }
+
+      REMAT_DEBUG(dbgs() << "[" << MIRegion[Remat.DefMI] << "] REMAT *" << Score
+                         << "* | " << *Remat.DefMI);
+      MachineInstr *RematMI = rematerialize(Remat, RecomputeRP);
+      // Every rematerialization done with the objective of increasing occupancy
+      // increases latency. If we don't manage to increase occupancy, we want to
+      // roll them back.
+      if (TargetOcc)
+        Rollbackable.push_back({RematMI, &Remat});
+      unsetSatisifedRPTargets(Remat.Live);
+    }
+
+#ifndef NDEBUG
+    printTargetRegions();
+    ++RoundNum;
+#endif
+
+    // Peel off registers we already rematerialized from the vector's tail.
+    ScoredRemats.truncate(RematIdx + 1);
+  }
+  if (RescheduleRegions.none())
+    return false;
+
+  // Commit all pressure changes to the DAG and compute minimum achieved
+  // occupancy in impacted regions.
+  REMAT_DEBUG(dbgs() << "==== REMAT RESULTS ====\n");
+  unsigned DynamicVGPRBlockSize = MFI.getDynamicVGPRBlockSize();
+  AchievedOcc = MFI.getMaxWavesPerEU();
+  for (unsigned I : RescheduleRegions.set_bits()) {
+    const GCNRegPressure &RP = RPTargets[I].getCurrentRP();;
+    DAG.Pressure[I] = RP;
+    unsigned NewRegionOcc = RP.getOccupancy(ST, DynamicVGPRBlockSize);
+    AchievedOcc = std::min(AchievedOcc, NewRegionOcc);
+    REMAT_DEBUG(dbgs() << "[" << I << "] Achieved occupancy " << NewRegionOcc
+                       << " (" << RPTargets[I] << ")\n");
+  }
 
-  // Rematerialize identified instructions and update scheduler's state.
-  rematerialize();
-  if (GCNTrackers)
-    DAG.RegionLiveOuts.buildLiveRegMap();
   REMAT_DEBUG({
     dbgs() << "Retrying function scheduling with new min. occupancy of "
            << AchievedOcc << " from rematerializing (original was "
@@ -1124,7 +1328,6 @@ bool PreRARematStage::initGCNSchedStage() {
       dbgs() << ", target was " << *TargetOcc;
     dbgs() << ")\n";
   });
-
   if (AchievedOcc > DAG.MinOccupancy) {
     DAG.MinOccupancy = AchievedOcc;
     SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>();
@@ -1151,6 +1354,10 @@ void UnclusteredHighRPStage::finalizeGCNSchedStage() {
 }
 
 bool GCNSchedStage::initGCNRegion() {
+  // Skip empty scheduling region.
+  if (DAG.begin() == DAG.end())
+    return false;
+
   // Check whether this new region is also a new block.
   if (DAG.RegionBegin->getParent() != CurrentMBB)
     setupNewBlock();
@@ -1158,8 +1365,8 @@ bool GCNSchedStage::initGCNRegion() {
   unsigned NumRegionInstrs = std::distance(DAG.begin(), DAG.end());
   DAG.enterRegion(CurrentMBB, DAG.begin(), DAG.end(), NumRegionInstrs);
 
-  // Skip empty scheduling regions (0 or 1 schedulable instructions).
-  if (DAG.begin() == DAG.end() || DAG.begin() == std::prev(DAG.end()))
+  // Skip regions with 1 schedulable instruction.
+  if (DAG.begin() == std::prev(DAG.end()))
     return false;
 
   LLVM_DEBUG(dbgs() << "********** MI Scheduling **********\n");
@@ -1691,27 +1898,20 @@ bool PreRARematStage::allUsesAvailableAt(const MachineInstr *InstToRemat,
   return true;
 }
 
-bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
+void PreRARematStage::setObjective() {
   const Function &F = MF.getFunction();
 
-  // Maps optimizable regions (i.e., regions at minimum and register-limited
-  // occupancy, or regions with spilling) to the target RP we would like to
-  // reach.
-  DenseMap<unsigned, GCNRPTarget> OptRegions;
+  // Set up "spilling targets" for all regions.
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(F);
   unsigned MaxVGPRs = ST.getMaxNumVGPRs(F);
-  auto ResetTargetRegions = [&]() {
-    OptRegions.clear();
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      const GCNRegPressure &RP = DAG.Pressure[I];
-      GCNRPTarget Target(MaxSGPRs, MaxVGPRs, MF, RP);
-      if (!Target.satisfied())
-        OptRegions.insert({I, Target});
-    }
-  };
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    const GCNRegPressure &RP = DAG.Pressure[I];
+    GCNRPTarget &Target = RPTargets.emplace_back(MaxSGPRs, MaxVGPRs, MF, RP);
+    if (!Target.satisfied())
+      TargetRegions.set(I);
+  }
 
-  ResetTargetRegions();
-  if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
+  if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
     // In addition to register usage being above addressable limits, occupancy
     // below the minimum is considered like "spilling" as well.
     TargetOcc = std::nullopt;
@@ -1719,59 +1919,27 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
     // There is no spilling and room to improve occupancy; set up "increased
     // occupancy targets" for all regions.
     TargetOcc = DAG.MinOccupancy + 1;
-    unsigned VGPRBlockSize =
-        MF.getInfo<SIMachineFunctionInfo>()->getDynamicVGPRBlockSize();
+    const unsigned VGPRBlockSize = MFI.getDynamicVGPRBlockSize();
     MaxSGPRs = ST.getMaxNumSGPRs(*TargetOcc, false);
     MaxVGPRs = ST.getMaxNumVGPRs(*TargetOcc, VGPRBlockSize);
-    ResetTargetRegions();
-  }
-  REMAT_DEBUG({
-    dbgs() << "Analyzing ";
-    MF.getFunction().printAsOperand(dbgs(), false);
-    dbgs() << ": ";
-    if (OptRegions.empty()) {
-      dbgs() << "no objective to achieve, occupancy is maximal at "
-             << MFI.getMaxWavesPerEU();
-    } else if (!TargetOcc) {
-      dbgs() << "reduce spilling (minimum target occupancy is "
-             << MFI.getMinWavesPerEU() << ')';
-    } else {
-      dbgs() << "increase occupancy from " << DAG.MinOccupancy << " to "
-             << TargetOcc;
-    }
-    dbgs() << '\n';
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      if (auto OptIt = OptRegions.find(I); OptIt != OptRegions.end()) {
-        dbgs() << REMAT_PREFIX << "  [" << I << "] " << OptIt->getSecond()
-               << '\n';
-      }
+    for (auto [I, Target] : enumerate(RPTargets)) {
+      Target.setTarget(MaxSGPRs, MaxVGPRs);
+      if (!Target.satisfied())
+        TargetRegions.set(I);
     }
-  });
-  if (OptRegions.empty())
-    return false;
+  }
+}
 
-  // Accounts for a reduction in RP in an optimizable region. Returns whether we
-  // estimate that we have identified enough rematerialization opportunities to
-  // achieve our goal, and sets Progress to true when this particular reduction
-  // in pressure was helpful toward that goal.
-  auto ReduceRPInRegion = [&](auto OptIt, Register Reg, LaneBitmask Mask,
-                              bool &Progress) -> bool {
-    GCNRPTarget &Target = OptIt->getSecond();
-    if (!Target.isSaveBeneficial(Reg))
-      return false;
-    Progress = true;
-    Target.saveReg(Reg, Mask, DAG.MRI);
-    if (Target.satisfied())
-      OptRegions.erase(OptIt->getFirst());
-    return OptRegions.empty();
-  };
+bool PreRARematStage::collectRematRegs(ArrayRef<uint64_t> RegionFreq) {
+  assert(RegionFreq.size() == DAG.Regions.size());
 
   // We need up-to-date live-out info. to query live-out register masks in
   // regions containing rematerializable instructions.
   DAG.RegionLiveOuts.buildLiveRegMap();
 
-  // Cache set of registers that are going to be rematerialized.
-  DenseSet<unsigned> RematRegs;
+  // Set of registers already marked for potential remterialization; used for
+  // remat chains checks.
+  DenseSet<Register> RematRegSet;
 
   // Identify rematerializable instructions in the function.
   for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
@@ -1782,30 +1950,34 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
       if (!isTriviallyReMaterializable(DefMI))
         continue;
 
-      // We only support rematerializing virtual registers with one definition.
+      // We only support rematerializing virtual registers with one
+      // definition.
       Register Reg = DefMI.getOperand(0).getReg();
       if (!Reg.isVirtual() || !DAG.MRI.hasOneDef(Reg))
         continue;
 
       // We only care to rematerialize the instruction if it has a single
-      // non-debug user in a different region. The using MI may not belong to a
-      // region if it is a lone region terminator.
+      // non-debug user in a different region.
+      // FIXME: Allow rematerializations with multiple uses. This should be
+      // relatively easy to support using the current cost model.
       MachineInstr *UseMI = DAG.MRI.getOneNonDBGUser(Reg);
       if (!UseMI)
         continue;
       auto UseRegion = MIRegion.find(UseMI);
-      if (UseRegion != MIRegion.end() && UseRegion->second == I)
+      if (UseRegion == MIRegion.end() || UseRegion->second == I)
         continue;
 
       // Do not rematerialize an instruction if it uses or is used by an
       // instruction that we have designated for rematerialization.
       // FIXME: Allow for rematerialization chains: this requires 1. updating
-      // remat points to account for uses that are rematerialized, and 2. either
-      // rematerializing the candidates in careful ordering, or deferring the
-      // MBB RP walk until the entire chain has been rematerialized.
-      if (Rematerializations.contains(UseMI) ||
-          llvm::any_of(DefMI.operands(), [&RematRegs](MachineOperand &MO) {
-            return MO.isReg() && RematRegs.contains(MO.getReg());
+      // remat points to account for uses that are rematerialized, and 2.
+      // either rematerializing the candidates in careful ordering, or
+      // deferring the MBB RP walk until the entire chain has been
+      // rematerialized.
+      MachineOperand &UseFirstMO = UseMI->getOperand(0);
+      if ((UseFirstMO.isReg() && RematRegSet.contains(UseFirstMO.getReg())) ||
+          llvm::any_of(DefMI.operands(), [&RematRegSet](MachineOperand &MO) {
+            return MO.isReg() && RematRegSet.contains(MO.getReg());
           }))
         continue;
 
@@ -1817,106 +1989,146 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
       if (!allUsesAvailableAt(&DefMI, DefIdx, UseIdx))
         continue;
 
-      REMAT_DEBUG(dbgs() << "Region " << I << ": remat instruction " << DefMI);
-      RematInstruction &Remat =
-          Rematerializations.try_emplace(&DefMI, UseMI).first->second;
-
-      bool RematUseful = false;
-      if (auto It = OptRegions.find(I); It != OptRegions.end()) {
-        // Optimistically consider that moving the instruction out of its
-        // defining region will reduce RP in the latter; this assumes that
-        // maximum RP in the region is reached somewhere between the defining
-        // instruction and the end of the region.
-        REMAT_DEBUG(dbgs() << "  Defining region is optimizable\n");
-        LaneBitmask Mask = DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg];
-        if (ReduceRPInRegion(It, Reg, Mask, RematUseful))
-          return true;
-      }
-
-      for (unsigned LIRegion = 0; LIRegion != E; ++LIRegion) {
-        // We are only collecting regions in which the register is a live-in
-        // (and may be live-through).
-        auto It = DAG.LiveIns[LIRegion].find(Reg);
-        if (It == DAG.LiveIns[LIRegion].end() || It->second.none())
-          continue;
-        Remat.LiveInRegions.insert(LIRegion);
-
-        // Account for the reduction in RP due to the rematerialization in an
-        // optimizable region in which the defined register is a live-in. This
-        // is exact for live-through region but optimistic in the using region,
-        // where RP is actually reduced only if maximum RP is reached somewhere
-        // between the beginning of the region and the rematerializable
-        // instruction's use.
-        if (auto It = OptRegions.find(LIRegion); It != OptRegions.end()) {
-          REMAT_DEBUG(dbgs() << "  Live-in in region " << LIRegion << '\n');
-          if (ReduceRPInRegion(It, Reg, DA...
[truncated]

github-actions · 2025-08-11T21:52:22Z

✅ With the latest revision this PR passed the C/C++ code formatter.

shiltian · 2025-08-11T22:25:58Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/raw_ostream.h"
+#include <deque>


where is this used?

Nowhere, thanks for the catch.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

arsenm · 2025-08-12T05:43:29Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+void PreRARematStage::printTargetRegions(bool PrintAll) const {
+  if (PrintAll) {
+    for (auto [I, Target] : enumerate(RPTargets))
+      REMAT_DEBUG(dbgs() << "  [" << I << "] " << Target << '\n');


You only use this under REMAT_DEBUG uses, so can drop all the REMAT_DEBUGs here

PreRARematStage::printTargetRegions is called from some #ifndef NDEBUG blocks, unless I am mistaken if I don't wrap the dbgs with REMAT_DEBUG these logs with show up even if didn't ask for the scheduler's debug output.

Then wrap those dangling uses in the _DEBUG blocks, they don't need their own ifndef NDEBUGs

arsenm · 2025-08-12T05:47:02Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+      MachineOperand &UseFirstMO = UseMI->getOperand(0);
+      if ((UseFirstMO.isReg() && RematRegSet.contains(UseFirstMO.getReg())) ||
+          llvm::any_of(DefMI.operands(), [&RematRegSet](MachineOperand &MO) {
+            return MO.isReg() && RematRegSet.contains(MO.getReg());


This looks backwards? The UseMI is looking at the assumed single def, and DefMI is looking at all of its operands? Should one or both of these checks be verifying it's a use/def?

I think the implementation is right (though it is confusing, so I refactored it to make it clearer).

UseFirstMO.isReg() && RematRegSet.contains(UseFirstMO.getReg())

This ensures that UseMI is not itself an instruction we have already added to the rematerializable list. If the first operand is not a def reg then the instruction couldn't have been considered rematerializable, so RematRegSet won't contain the register.

llvm::any_of(DefMI.operands(), [&RematRegSet](MachineOperand &MO) { return MO.isReg() && RematRegSet.contains(MO.getReg()) })

This ensures that no operand of DefMI is defined by an instruction we have added to the rematerializable list. We know that the first operand will clear the test since we only consider registers with a single def as potentially rematerializable, so the defined register is guaranteed to not be in RematRegSet already.

arsenm · 2025-08-12T05:47:50Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+
+  // Store the register's lane bitmask.
+  unsigned SubReg = DefMI->getOperand(0).getSubReg();
+  Mask = SubReg ? DAG.TRI->getSubRegIndexLaneMask(SubReg)


Is using LaneBitmask::getAll good enough here?

I can in fact simplify this whole thing to getSubRegIndexLaneMask(SubReg) since it can handle the 0 case, thanks for the catch.

This actually doesn't work since the lane mask returned when subreg is 0 is ~0U, yielding a very large/incorrect number when calling getNumCoveredRegs on it. By sheer luck this did not break tests on a07fef2 but only on 8c142c8 which changes rematerialization priorities in some tests when multiple rematerialization candidates have the same score. I rollbacked the change in 70cd6f6.

getNumCoveredRegs is a broken API. You're not allowed to rely on lanemask values this way. You're only allowed to check for overlaps. A LaneMask on its own doesn't mean anything, you need to know the associated register class.

We probably should have a tablegenerated version of this function, or find a different way to get a register count at the use point

Thanks for the context, this is only used for the scoring heuristic and for the sake of time I just resorted to using the register class to get the register size, without accounting for a potential subregister. Left a FIXME for future reference.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

llvm/test/CodeGen/AMDGPU/machine-scheduler-rematerialization-scoring.mir

arsenm · 2025-08-13T08:34:12Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+void PreRARematStage::printTargetRegions(bool PrintAll) const {
+  if (PrintAll) {
+    for (auto [I, Target] : enumerate(RPTargets))
+      REMAT_DEBUG(dbgs() << "  [" << I << "] " << Target << '\n');


Then wrap those dangling uses in the _DEBUG blocks, they don't need their own ifndef NDEBUGs

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

lucas-rami · 2025-09-05T16:04:57Z

ping

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

arsenm · 2025-09-25T00:17:26Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  // Skip regions with 1 schedulable instruction.
+  if (DAG.begin() == std::prev(DAG.end()))
    return false;


Can this be pulled up right after the empty region check?

Unfortunately no, ScheduleDAGMILive::enterRegion expects to be called even for single-instruction regions to maintain some internal state.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

arsenm · 2025-09-25T00:23:22Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+
+  // Store the register's lane bitmask.
+  unsigned SubReg = DefMI->getOperand(0).getSubReg();
+  Mask = SubReg ? DAG.TRI->getSubRegIndexLaneMask(SubReg)


getNumCoveredRegs is a broken API. You're not allowed to rely on lanemask values this way. You're only allowed to check for overlaps. A LaneMask on its own doesn't mean anything, you need to know the associated register class.

We probably should have a tablegenerated version of this function, or find a different way to get a register count at the use point

arsenm · 2025-09-25T00:26:17Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

-  DenseSet<unsigned> RematRegs;
+  // Set of registers already marked for potential remterialization; used for
+  // remat chains checks.
+  DenseSet<Register> RematRegSet;


Suggested change

DenseSet<Register> RematRegSet;

SmallSet<Register, 4> RematRegSet;

Maybe should be tracking this on RegSubRegPair. Eventually we really need to improve rematerialize for sub registers

I don't think we need to track subregisters at this point since we only consider registers with one def as rematerializable, so a register that would be partially defined more than once would not be considered.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

jrbyrnes · 2025-09-26T22:06:02Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

  }

-  if (!canIncreaseOccupancyOrReduceSpill())
+  setObjective();


I think it's possible to not have any remat target regions -- should we return in that case?

Yes it is possible however I think we still want to do all the "always beneficial" remats in those cases as they reduce live ranges and may reduce latency as well by moving instructions to lower-frequency regions.

I'm not sure we should have the "always beneficial" feature. I think the scheduling rematerialization should be used only to reduce spilling or increase occupancy in accordance with the scheduler's goals. The "always beneficial" feature seems to be moving away from that, and supporting it will likely cause complexity in the heuristics.

I wonder why these always beneficial remats were not captured by the MachineSink pass?

I removed the feature from this PR, making this contribution a little shorter/simpler which is good. I still think this is a useful feature to have but I get your point that it doesn't really belong in the scheduler's remat phase. I also don't know why MachineSink is not taking care of those, it's something I want to look into in the future.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

jrbyrnes · 2025-09-26T23:10:20Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  Register Reg = Remat->DefMI->getOperand(0).getReg();
+  unsigned RPScore = 0;
+  for (unsigned I : TargetRegions.set_bits()) {
+    unsigned Freq = std::max(RegionFreq[I], static_cast<uint64_t>(1));


Checking the frequency of the affected region only makes sense if we're going to reduce spilling.

For the occupancy case, this info is basically irrelevant.

I think it can still be helpful to consider region frequency in the occupancy case to e.g. prioritize rematerializing in lower frequency regions to minimize the latency penalty.

jrbyrnes · 2025-09-26T23:11:38Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

-      GCNRPTracker::LiveRegSet &RegionLiveIns = DAG.LiveIns[I];
+  // The estimated RP reduction is directly proportional to the size of the
+  // rematerializable register.
+  setRPScore(RPScore * SIRegisterInfo::getNumCoveredRegs(Remat->Mask));


For both the occupancy and spilling cases, I would think we would want to prioritize remats into less hot blocks.

The WeightRP and WeightRPMaybe weights are meant to encourage that by giving less weight to regions in which the register is actually used, so between two registers live in the same number of target regions with the same total frequency, the register which is used in the lower frequency region will have higher score i.e. will be prioritized.

The relative value of these two weights can be changed to further prioritize rematerializing into less hot blocks. Alternatively, the frequency of the region being rematerialized to can be made into a separate scoring criteria that takes precedence over the number of target regions in which the register is live. I don't know which option will be better in most cases. What do you think?

IMO, the scoring model should take into account the (perhaps in order of priority):
1. maximal freuency of affected regions (if spilling)
2. delta between def / remat frequency
3. the number of target regions affected
4. reg / lanemask width.

Maybe 3 and 4 can be combined into a single metric.

For the occupancy case, the frequency of the intermediate target blocks doesn't matter -- we remove latency from the original def region, and we add latency to the remat region. We shouldn't care about the intermediate frequency as doing a remat has no impact on those regions -- if we do enough remats we will increase occupancy which is function wide.

For the spilling case, the frequencey of the intermediate target blocks does matter. If our RP is above the threshold, then we will spill somewhere, we would prefer not to spill in the hottest blocks. Thus, we should prioritize reducing RP in the highest frequency regions first.

For the occupancy case, the frequency of the intermediate target blocks doesn't matter -- we remove latency from the original def region, and we add latency to the remat region. We shouldn't care about the intermediate frequency as doing a remat has no impact on those regions -- if we do enough remats we will increase occupancy which is function wide.
For the spilling case, the frequencey of the intermediate target blocks does matter. If our RP is above the threshold, then we will spill somewhere, we would prefer not to spill in the hottest blocks. Thus, we should prioritize reducing RP in the highest frequency regions first.

Thanks a lot for the explanation, that makes sense to me now. Updated the scoring method to reflect this.

IMO, the scoring model should take into account the (perhaps in order of priority):

maximal freuency of affected regions (if spilling)

delta between def / remat frequency

the number of target regions affected

reg / lanemask width.

With the above remark in mind those make more sense than what I had indeed. Implemented exactly those scoring components with those priorities (with merged 3/4).

jrbyrnes · 2025-09-26T23:13:17Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+    // most cases cheaper than spilling. We still give a bonus to remats for
+    // which we are able to do the calculation.
+    if (InstrLatencyGain && *InstrLatencyGain < 0) {
+      int SpillLatencyGain = SaveCost * Remat->DefFrequency;


This spill cost model seems a bit flaky -- can we get away with just checking the def vs remat frequency and the latency?

I have moved the calculation to ScoredRemat::getLatencyGain but it is the same idea i.e. I'm comparing the latency difference induced by the rematerializable MI being moved to a different region to the additional latency that would come from saving/restoring to/from stack. The former should be smaller than the latter in almost all cases. Does that make sense?

jrbyrnes · 2025-10-02T17:23:20Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

  }

-  if (!canIncreaseOccupancyOrReduceSpill())
+  setObjective();


I'm not sure we should have the "always beneficial" feature. I think the scheduling rematerialization should be used only to reduce spilling or increase occupancy in accordance with the scheduler's goals. The "always beneficial" feature seems to be moving away from that, and supporting it will likely cause complexity in the heuristics.

I wonder why these always beneficial remats were not captured by the MachineSink pass?

jrbyrnes · 2025-10-02T17:26:53Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp


-  ResetTargetRegions();
-  if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
+  if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {


Does this take into account the LDS & workgroup size?

Yes, MFI.getMaxWavesPerEU comes from getWavesPerEU which takes into account those things for the maximum achievable occupancy.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

jrbyrnes · 2025-10-02T17:51:20Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+    // This save is exact in beneficial regions but optimistic in all other
+    // regions where the register is live.
+    RPTargets[I].saveReg(Reg, Remat.Mask, DAG.MRI);
+    DAG.LiveIns[I].erase(Reg);


Do we need to check if the remat fully covers the LiveIn mask? Maybe due to the one-def check this isn't needed?

I think this isn't needed because of the one-def check indeed.

jrbyrnes · 2025-10-02T17:51:29Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+    // regions where the register is live.
+    RPTargets[I].saveReg(Reg, Remat.Mask, DAG.MRI);
+    DAG.LiveIns[I].erase(Reg);
+    DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I).erase(Reg);


Ditto as well

jrbyrnes · 2025-10-02T17:58:05Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

 }

+void GCNScheduleDAGMILive::deleteMI(unsigned RegionIdx, MachineInstr *MI) {
+  updateRegionBoundaries(Regions[RegionIdx], MI, nullptr);


I think the updateRegionBoundaries is broken -- in the case where we are removing the only instruction from the region.

Probably the best way to fix is it to update the MIRegion for each remat, and check this when updating the boundaries.

I ended up deleting updateRegionBoundaries because I was the only user and it was a somewhat long function that can be replaced with a couple simple lines at both use points. However I don't think I changed the implementation by doing that, and I am not seeing an issue when deleting the only instruction from the region in deleteMI.

If MI is the only instruction in the region we have Region.second == std::next(MachineBasicBlock::iterator(MI)) so after the function we will have Region.first == Region.second. test_rollback_remats_emptydefregion in machine-scheduler-sink-trivial-remats.mir tests that we can completely empty a region (bb.1) and then rollback instructions to it again and it seems to work without issue.

test_rollback_remats_emptydefregion in machine-scheduler-sink-trivial-remats.mir tests that we can completely empty a region (bb.1) and then rollback instructions to it again and it seems to work without issue.

Can you add a test where we do this in the case of multiple scheduling regions in a block? We have multiple sched regions in a block, but, through remat, we empty the first one? You can use sched.barrier(i32 0) to create sched regions. My concern is that we may run into issues if Region.second != MBB.end()

Added test_rollback_remats_emptydefregion_barrier (renamed the existing test to test_rollback_remats_emptydefregion_block). It seems to be working fine (and I checked the logs to makes sure it is actually rematerializing and then rolling back remats). Let me know if there is still a concern.

jrbyrnes · 2025-10-02T19:13:17Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

-      GCNRPTracker::LiveRegSet &RegionLiveIns = DAG.LiveIns[I];
+  // The estimated RP reduction is directly proportional to the size of the
+  // rematerializable register.
+  setRPScore(RPScore * SIRegisterInfo::getNumCoveredRegs(Remat->Mask));


IMO, the scoring model should take into account the (perhaps in order of priority):
1. maximal freuency of affected regions (if spilling)
2. delta between def / remat frequency
3. the number of target regions affected
4. reg / lanemask width.

Maybe 3 and 4 can be combined into a single metric.

For the occupancy case, the frequency of the intermediate target blocks doesn't matter -- we remove latency from the original def region, and we add latency to the remat region. We shouldn't care about the intermediate frequency as doing a remat has no impact on those regions -- if we do enough remats we will increase occupancy which is function wide.

For the spilling case, the frequencey of the intermediate target blocks does matter. If our RP is above the threshold, then we will spill somewhere, we would prefer not to spill in the hottest blocks. Thus, we should prioritize reducing RP in the highest frequency regions first.

jrbyrnes · 2025-10-02T19:20:49Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  return divideCeil(DAG.TRI->getRegSizeInBits(RC), 32);
+}
+
+unsigned PreRARematStage::ScoredRemat::getLatencyGain(


I think we might want to consider just removing this latency spill cost check as it just adds more code to support. For all current users, the remat latency will be less than the spill latency.

Maybe we can consider adding this if we remat more expensive instructions / as needed.

Makes sense, removed it.

jrbyrnes · 2025-10-08T19:46:07Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+                                          const GCNScheduleDAGMILive &DAG)
+    : Remat(Remat), NumRegs(getNumRegs(DAG)), FreqDiff(getFreqDiff(Freq)) {}
+
+unsigned PreRARematStage::ScoredRemat::getNumRegs(


Can we use the RematReg::Mask ?

Matt mentioned here that I cannot rely on the mask + SIRegisterInfo::getNumCoveredRegs to reliably derive the number of registers so I am now using the RC. However I don't know of a way to get the number of registers covered by a subregister from the RC + SubIdx.

getNumChannelsFromSubReg?

Oh, that's implemented with getNumCoveredRegs

But size of subregister class should work too

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

lucas-rami · 2025-10-15T11:35:14Z

f41fddc passed functional internal testing.

jrbyrnes · 2025-10-15T19:40:18Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp


-  ResetTargetRegions();
-  if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
+  if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {


Can we use/hoist the MFI.getMaxWavesPerEU() check as part of an early exit check now that we no longer do always profitable remats?

It makes sense that we should be able to do that but after trying to move this to the beginning of PreRARematStage::initGCNSchedStage it can have an impact when the "amdgpu-num-sgpr" or "amdgpu-num-vgpr" attributes are set with values lower than the maximums we would have gotten by default. If the condition is met we will ignore these harder to achieve RP targets, whereas setObjective takes those attributes into account through getMaxNumSGPRs/getMaxNumVGPRs. The effect can be seen on the small_num_sgprs_as_spill/small_num_vgprs_as_spill tests in machine-scheduler-sink-trivial-remats-attr.mir.

I am unsure whether we should care about honoring these attributes if we are already at maximum achievable occupancy though.

Ah - makes sense. I think we should continue honoring these attributes and exit in the way we do. Other scheduling heuristics take this into account, and I can see how the attribute would be useful for dev / testing purposes.

I think we can still early exit on maxWavesPerEU if the numVGPRs and numSGPRs attributes aren't specified.

jrbyrnes · 2025-10-15T19:47:43Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+      dbgs() << "reduce spilling (minimum target occupancy is "
+             << MFI.getMinWavesPerEU() << ")\n";
+    }
+    printTargetRegions(/*PrintAll=*/TargetRegions.none());


I think TargetRegions.none() will always be false here -- since setObjective would have returned false above?

Indeed, this also allows me to remove the function argument altogether, thanks.

jrbyrnes · 2025-10-15T20:16:12Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  // The difference between defining and using frequency is in the range
+  // [-MaxDiff, MaxDiff], shift it to [0,2 x MaxDiff] to stay in the positive
+  // range, then rescale to the representable range in the final score.
+  const uint64_t FreqDiff = (MaxDiff + (DefOrOne - UseOrMax));


Can you add UnitTests or provide some more info about how this FreqDiff scaling is supposed to work -- it's not very clear just reading through it.

Alternatively, I think you simplify the ScoredRemat::score handling to use a struct or something instead of the complex bitpack algorithm, which I think would allow us to use a more direct calculation for the relative frequency. I think this will also help against information loss.

I removed the bitpacking algorithm and simply used three class members. It does simplify the logic a bunch at the cost of extra data and extra comparison/sorting time.

It may not matter much at this point since we typically don't have a lot of rematerialization candidates due to the exsiting limitations but with support for remat with multiple uses and chains the number of candidates is much larger, so these things may become more important more later, and I can always re-introduce that then.

The point of the now removed calculation around FreqDiff was to rescale the range of possible differences between defining and using region to the range of representable values in the bitpacked score (as a function of the number of bits allocated to that scoring component). It mapped the largest possible penalty to latency (DefFreq < UseFreq) to 0 and the smallest possible penalty to latency (DefFreq > UseFreq) to 0xFF..FF, with a linear (up to integer rounding) interpolation of the values in between. favor_lower_frequency in machine-scheduler-rematerialization-scoring.mir shows that we will prefer to rematerialize registers into regions of lower frequency, all other things being equal.

jrbyrnes · 2025-10-15T21:18:14Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+
+    // Peel off registers we already rematerialized from the vector's tail.
+    ScoredRemats.truncate(RematIdx + 1);
+  } while ((updateAndVerifyRPTargets(RecomputeRP) || TargetRegions.any()) &&


Is it possible to hit infinite loop ? If we have a ScoredRemat that is no longer beneficial, but non-empty TargetRegions?

I don't think so. If a remat is no longer beneficial while there are still target regions we will check whether any region was incorrectly unmarked as a target due to our heuristics, then re-score and start a new round. Then, either there is at least one remat which is still useful (the same one or another) and we will be able to make progress, or the first remat we check will have null score and we will stop rematerializing altogether.

lucas-rami · 2025-10-20T20:59:43Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

-  if (!Remat->maybeBeneficial(TargetRegions, RPTargets))
-    return;


This was redundant with the loop below, which will leave both MaxFreq and RegionImpact at 0 if the remat is not beneficial anyway.

jrbyrnes · 2025-10-27T22:33:17Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h

+      if (RegionImpact != O.RegionImpact)
+        return RegionImpact < O.RegionImpact;
+      // Break ties using pointer to rematerializable register.
+      return Remat > O.Remat;


Should it be Remat < 0.Remat ?

I made the comparison in this direction because, all other things being equal, it will favor rematerializing registers with slightly longer live ranges in their defining regions.

Registers are collected in instruction order within each region so registers defined "early" in a region have lower addresses than those defined "late" in the same region. The former also have longer live-ranges in the region since they start earlier and extend to the live-outs. If the defining region is a region with excess RP this can lead to more optimal rematerializations overall in very specific cases (our remat unit tests actually hit that very specific case quite often due to their regular/artificial nature).

jrbyrnes · 2025-10-27T22:46:46Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+  TII->reMaterialize(*InsertPos->getParent(), InsertPos, NewReg, 0, DefMI,
+                     *DAG.TRI);
+  MachineInstr *RematMI = &*std::prev(InsertPos);
+  Remat.UseMI->substituteRegister(Reg, NewReg, 0, *DAG.TRI);


Do we need to specify SubIdx ? If the use is using a subreg?

substituteRegister combines the existing subindex of operands using Reg with any subindex given in the third argument. Setting the latter to 0 preserves whatever subindex Reg already has.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

jrbyrnes · 2025-10-27T23:29:46Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp


-  ResetTargetRegions();
-  if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
+  if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {


I think we can still early exit on maxWavesPerEU if the numVGPRs and numSGPRs attributes aren't specified.

jrbyrnes · 2025-10-27T23:31:52Z

LGTM aside from outstanding comments -- please wait for Matt to take a look.

- Use SchedModel instead of instruction itinerary - Don't use getNumConvertedRegs to get number of regs, use RC instead. - Clarify some comments. - Other minor changes.

- Change components of scoring system. - Remove rematerialization of always beneficial registers. - Remove latency calculation in scoring. - Other small changes.

- Add some explanatory comments for the calculations. - Normalize frequencies to minimum to avoid overflows. - Create helper struct to hold all frequency information (helpful for further improvements).

In cases where the (normalized) maximum region frequency is lower than the number of different frequency scores representable (2^16 currently), the current method of computing the scaling factor shrinks the actually achievable range of frequency scores available (up to a single achievable score in the most extreme cases). This is due to integer rounding when dividing frequencies to fit within the representable range. When the maximum frequency is smaller than the maximum score achievable, use a different rescaling calculation to avoid the expressivity loss.

- Removed some debug-only functions from the header and put them as close as possible to where they are needed. - Split bitpacked score into separate components compared 1-to-1.

lucas-rami requested review from arsenm, jrbyrnes, mariusz-sikora-at-amd and shiltian August 11, 2025 21:48

lucas-rami added the backend:AMDGPU label Aug 11, 2025

llvmbot added the llvm:globalisel label Aug 11, 2025

shiltian reviewed Aug 11, 2025

View reviewed changes

arsenm reviewed Aug 12, 2025

View reviewed changes

arsenm reviewed Aug 13, 2025

View reviewed changes

lucas-rami force-pushed the remat-score-system branch from 70cd6f6 to 3d7c021 Compare September 24, 2025 12:42

arsenm reviewed Sep 25, 2025

View reviewed changes

jrbyrnes reviewed Sep 26, 2025

View reviewed changes

lucas-rami force-pushed the remat-score-system branch 3 times, most recently from e979f80 to 007d9b9 Compare October 1, 2025 15:21

jrbyrnes reviewed Oct 2, 2025

View reviewed changes

lucas-rami force-pushed the remat-score-system branch from 874692b to d452fa9 Compare October 7, 2025 15:34

jrbyrnes reviewed Oct 8, 2025

View reviewed changes

jrbyrnes reviewed Oct 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Show resolved Hide resolved

jrbyrnes reviewed Oct 14, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp Outdated Show resolved Hide resolved

jrbyrnes reviewed Oct 15, 2025

View reviewed changes

lucas-rami commented Oct 20, 2025

View reviewed changes

jrbyrnes approved these changes Oct 27, 2025

View reviewed changes

lucas-rami added 16 commits November 4, 2025 01:07

Scoring system for rematerializations

4d68086

Format

534abd7

Address feedback + fix failing test

39f1f52

Remove REMAT_DEBUG and break ties in score

a681d9b

Fix failing tests and rollback mask change

d4a0ccb

Address more feedback

0f359d9

- Use SchedModel instead of instruction itinerary - Don't use getNumConvertedRegs to get number of regs, use RC instead. - Clarify some comments. - Other minor changes.

Rebase for new test + improve comment

7ca1ba6

Refactor scoring system + Remove always benef/latency calc

bec1eba

- Change components of scoring system. - Remove rematerialization of always beneficial registers. - Remove latency calculation in scoring. - Other small changes.

Fix tests

03c229c

Improve code for frequency-based calculations

0aafbb5

- Add some explanatory comments for the calculations. - Normalize frequencies to minimum to avoid overflows. - Create helper struct to hold all frequency information (helpful for further improvements).

Add empty region test

256271e

Walk over all regions to compute AchievedOcc

be6bf74

Use std::max for freq calc.

6dd8bf2

Simplify score calculation and improve debug

474f5d8

- Removed some debug-only functions from the header and put them as close as possible to where they are needed. - Split bitpacked score into separate components compared 1-to-1.

Early exit on maxWavesPerEU and rebase

0e1a6d8

lucas-rami force-pushed the remat-score-system branch from fd86daf to 0e1a6d8 Compare November 4, 2025 01:31

	DenseSet<Register> RematRegSet;
	SmallSet<Register, 4> RematRegSet;

		if (!Remat->maybeBeneficial(TargetRegions, RPTargets))
		return;

[AMDGPU][Scheduler] Scoring system for rematerialization candidates #153092

Are you sure you want to change the base?

[AMDGPU][Scheduler] Scoring system for rematerialization candidates #153092

Conversation

lucas-rami commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucas-rami Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lucas-rami commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucas-rami commented Aug 11, 2025 •

edited

Loading

llvmbot commented Aug 11, 2025 •

edited

Loading

github-actions bot commented Aug 11, 2025 •

edited

Loading

lucas-rami Aug 12, 2025 •

edited

Loading

jrbyrnes Oct 2, 2025 •

edited

Loading