-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU][Scheduler] Scoring system for rematerialization candidates #153092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: Lucas Ramirez (lucas-rami) ChangesThis is a significant refactoring of the scheduler's rematerialization stage meant to improve rematerialization capabilities and lay strong foundations for future improvements. The core contribution is a scoring system to estimate the benefit of each rematerialization candidate. This score takes into account
The overall stage's flow is as follows. After setting its RP objective (to reach a specific occupancy, or to eliminate spilling), the stage collects all rematerialization candidates in the function. Among those candidates it identifies and rematerializes all those that are always beneficial to rematerialize regardless of the stage's objective. Then, it scores all other candidates and iteratively rematerializes them in decreasing score order until it reaches the RP objective in all regions. As with the current implementation, all affected regions are then rescheduled, and rematerializations conditionally rollbacked if they end up not benefiting performance. Test churn in lightly affected tests is due to registers that are considered to be always beneficial to rematerialize now being rematerialized unconditionally. Many tests in Patch is 449.27 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153092.diff 32 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 254b75b784e75..7916e7acd36c1 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -28,10 +28,21 @@
#include "GCNRegPressure.h"
#include "SIMachineFunctionInfo.h"
#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/STLExtras.h"
+#include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineBlockFrequencyInfo.h"
+#include "llvm/CodeGen/MachineBranchProbabilityInfo.h"
#include "llvm/CodeGen/RegisterClassInfo.h"
#include "llvm/MC/LaneBitmask.h"
+#include "llvm/MC/MCInstrItineraries.h"
+#include "llvm/MC/MCSchedule.h"
+#include "llvm/MC/TargetRegistry.h"
#include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/raw_ostream.h"
+#include <deque>
+#include <limits>
+#include <string>
#define DEBUG_TYPE "machine-scheduler"
@@ -808,6 +819,8 @@ void GCNScheduleDAGMILive::schedule() {
GCNRegPressure
GCNScheduleDAGMILive::getRealRegPressure(unsigned RegionIdx) const {
+ if (Regions[RegionIdx].first == Regions[RegionIdx].second)
+ return llvm::getRegPressure(MRI, LiveIns[RegionIdx]);
GCNDownwardRPTracker RPTracker(*LIS);
RPTracker.advance(Regions[RegionIdx].first, Regions[RegionIdx].second,
&LiveIns[RegionIdx]);
@@ -1089,33 +1102,224 @@ bool ClusteredLowOccStage::initGCNSchedStage() {
#define REMAT_PREFIX "[PreRARemat] "
#define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << REMAT_PREFIX; X;)
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+void PreRARematStage::printTargetRegions(bool PrintAll) const {
+ if (PrintAll) {
+ for (auto [I, Target] : enumerate(RPTargets))
+ REMAT_DEBUG(dbgs() << " [" << I << "] " << Target << '\n');
+ return;
+ }
+ if (TargetRegions.none()) {
+ REMAT_DEBUG(dbgs() << "No target regions\n");
+ return;
+ }
+ REMAT_DEBUG(dbgs() << "Target regions:\n");
+ for (unsigned I : TargetRegions.set_bits())
+ REMAT_DEBUG(dbgs() << " [" << I << "] " << RPTargets[I] << '\n');
+}
+
+void PreRARematStage::RematReg::print(
+ const DenseMap<MachineInstr *, unsigned> &MIRegion) const {
+ REMAT_DEBUG(dbgs() << " [" << MIRegion.at(DefMI) << "] " << *DefMI);
+ REMAT_DEBUG(dbgs() << " -> used in [" << UseRegion << "] " << *UseMI);
+ const unsigned NumRegions = Live.size();
+ REMAT_DEBUG(dbgs() << " Guaranteed RP reduction in:");
+ for (unsigned I = 0; I < NumRegions; ++I) {
+ if (isBeneficialRegion(I))
+ dbgs() << " [" << I << "]";
+ }
+ dbgs() << '\n';
+ REMAT_DEBUG(dbgs() << " Possible RP reduction in:");
+ for (unsigned I = 0; I < NumRegions; ++I) {
+ if (isMaybeBeneficialRegion(I))
+ dbgs() << " [" << I << "]";
+ }
+ dbgs() << '\n';
+}
+
+#endif
+
bool PreRARematStage::initGCNSchedStage() {
// FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for
// regions inbetween the defs and region we sinked the def to. Will need to be
// fixed if there is another pass after this pass.
assert(!S.hasNextStage());
- if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() == 1)
+ if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() <= 1)
return false;
// Before performing any IR modification record the parent region of each MI
// and the parent MBB of each region.
const unsigned NumRegions = DAG.Regions.size();
- RegionBB.reserve(NumRegions);
for (unsigned I = 0; I < NumRegions; ++I) {
RegionBoundaries Region = DAG.Regions[I];
for (auto MI = Region.first; MI != Region.second; ++MI)
MIRegion.insert({&*MI, I});
- RegionBB.push_back(Region.first->getParent());
+ MachineBasicBlock *ParentMBB = Region.first->getParent();
+ if (Region.second != ParentMBB->end())
+ MIRegion.insert({&*Region.second, I});
+ RegionBB.push_back(ParentMBB);
+ }
+
+ setObjective();
+ REMAT_DEBUG({
+ dbgs() << "Analyzing ";
+ MF.getFunction().printAsOperand(dbgs(), false);
+ dbgs() << ": ";
+ if (TargetRegions.none()) {
+ dbgs() << "no objective to achieve, occupancy is maximal at "
+ << MFI.getMaxWavesPerEU() << '\n';
+ } else if (TargetOcc) {
+ dbgs() << "increase occupancy from " << *TargetOcc - 1 << '\n';
+ } else {
+ dbgs() << "reduce spilling (minimum target occupancy is "
+ << MFI.getMinWavesPerEU() << ")\n";
+ }
+ printTargetRegions(/*PrintAll=*/TargetRegions.none());
+ });
+
+ // Compute region frequencies. 0 encodes an unknown region frequency.
+ SmallVector<uint64_t> RegionFreq;
+ RegionFreq.reserve(NumRegions);
+ assert(DAG.MLI && "MLI not defined in DAG");
+ MachineBranchProbabilityInfo MBPI;
+ MachineBlockFrequencyInfo MBFI(MF, MBPI, *DAG.MLI);
+ uint64_t EntryFreq = MBFI.getEntryFreq().getFrequency();
+ if (EntryFreq) {
+ for (const MachineBasicBlock *MBB : RegionBB)
+ RegionFreq.push_back(MBFI.getBlockFreq(MBB).getFrequency() / EntryFreq);
+ } else {
+ RegionFreq.insert(RegionFreq.end(), RegionBB.size(), 0);
}
+ REMAT_DEBUG({
+ dbgs() << "Region frequencies:\n";
+ for (auto [I, Freq] : enumerate(RegionFreq)) {
+ dbgs() << REMAT_PREFIX << " [" << I << "] ";
+ if (Freq)
+ dbgs() << Freq;
+ else
+ dbgs() << "unknown ";
+ dbgs() << " | " << *DAG.Regions[I].first;
+ }
+ });
- if (!canIncreaseOccupancyOrReduceSpill())
+ if (!collectRematRegs(RegionFreq)) {
+ REMAT_DEBUG(dbgs() << "No rematerializable registers\n");
return false;
+ }
+
+ REMAT_DEBUG({
+ dbgs() << "Rematerializable registers:\n";
+ for (const RematReg &Remat : RematRegs)
+ Remat.print(MIRegion);
+ });
+
+ // Start by rematerializing always beneficial registers. These should never
+ // be rollbacked. All other rematerialization candidates get added to list of
+ // rematerializations that will be scored.
+ REMAT_DEBUG(dbgs() << "==== ALWAYS BENEFICIAL ====\n");
+ SmallVector<ScoredRemat> ScoredRemats;
+ BitVector RecomputeRP(NumRegions);
+ for (const RematReg &Remat : RematRegs) {
+ if (Remat.isAlwaysBeneficial()) {
+ REMAT_DEBUG(dbgs() << "[" << MIRegion[Remat.DefMI]
+ << "] REMAT (always) | " << *Remat.DefMI);
+ rematerialize(Remat, RecomputeRP);
+ } else {
+ ScoredRemats.emplace_back(&Remat, DAG.ST, *DAG.TII);
+ }
+ }
+ unsetSatisifedRPTargets(RescheduleRegions);
+
+#ifndef NDEBUG
+ printTargetRegions();
+ unsigned RoundNum = 0;
+#endif
+
+ // Rematerialize registers in successive rounds until all RP targets are
+ // satisifed or until we run out of rematerialization candidates.
+ while ((updateAndVerifyRPTargets(RecomputeRP) || TargetRegions.any()) &&
+ !ScoredRemats.empty()) {
+ // (Re-)Score and (re-)sort all remats in increasing score order.
+ for (ScoredRemat &Remat : ScoredRemats)
+ Remat.update(TargetRegions, RPTargets, RegionFreq, !TargetOcc);
+ stable_sort(ScoredRemats);
+
+ REMAT_DEBUG({
+ dbgs() << "==== ROUND " << RoundNum << " ====\n";
+ for (const ScoredRemat &SRemat : ScoredRemats) {
+ dbgs() << REMAT_PREFIX << "*" << SRemat.getScore() << "* | "
+ << *SRemat.Remat->DefMI;
+ }
+ });
+
+ RecomputeRP.reset();
+ int RematIdx = ScoredRemats.size() - 1;
+
+ // Rematerialize registers in decreasing score order until we estimate that
+ // all RP targets are satisfied or until rematerialization candidates are no
+ // longer useful to decrease RP.
+ for (; RematIdx >= 0 && TargetRegions.any(); --RematIdx) {
+ const RematReg &Remat = *ScoredRemats[RematIdx].Remat;
+ int Score = ScoredRemats[RematIdx].getScore();
+
+ // Stop when scores become negative. Since scores monotonically decrease
+ // as remats are performed, we know there is nothing useful left to do in
+ // such cases.
+ if (Score <= 0) {
+ REMAT_DEBUG(dbgs() << "Stop remats on non-positive score | "
+ << *Remat.DefMI);
+ RematIdx = -1;
+ break;
+ }
+
+ // When previous rematerializations in this round have already satisfied
+ // RP targets in all regions this rematerialization can impact, we have a
+ // good indication that our scores have diverged significantly from
+ // reality, in which case we interrupt this round and re-score. This also
+ // ensures that every rematerialization we perform is possibly impactful
+ // in at least one target region.
+ if (!Remat.intersectWithTarget(TargetRegions)) {
+ REMAT_DEBUG(dbgs() << "Stop round on stale score | " << *Remat.DefMI);
+ break;
+ }
+
+ REMAT_DEBUG(dbgs() << "[" << MIRegion[Remat.DefMI] << "] REMAT *" << Score
+ << "* | " << *Remat.DefMI);
+ MachineInstr *RematMI = rematerialize(Remat, RecomputeRP);
+ // Every rematerialization done with the objective of increasing occupancy
+ // increases latency. If we don't manage to increase occupancy, we want to
+ // roll them back.
+ if (TargetOcc)
+ Rollbackable.push_back({RematMI, &Remat});
+ unsetSatisifedRPTargets(Remat.Live);
+ }
+
+#ifndef NDEBUG
+ printTargetRegions();
+ ++RoundNum;
+#endif
+
+ // Peel off registers we already rematerialized from the vector's tail.
+ ScoredRemats.truncate(RematIdx + 1);
+ }
+ if (RescheduleRegions.none())
+ return false;
+
+ // Commit all pressure changes to the DAG and compute minimum achieved
+ // occupancy in impacted regions.
+ REMAT_DEBUG(dbgs() << "==== REMAT RESULTS ====\n");
+ unsigned DynamicVGPRBlockSize = MFI.getDynamicVGPRBlockSize();
+ AchievedOcc = MFI.getMaxWavesPerEU();
+ for (unsigned I : RescheduleRegions.set_bits()) {
+ const GCNRegPressure &RP = RPTargets[I].getCurrentRP();;
+ DAG.Pressure[I] = RP;
+ unsigned NewRegionOcc = RP.getOccupancy(ST, DynamicVGPRBlockSize);
+ AchievedOcc = std::min(AchievedOcc, NewRegionOcc);
+ REMAT_DEBUG(dbgs() << "[" << I << "] Achieved occupancy " << NewRegionOcc
+ << " (" << RPTargets[I] << ")\n");
+ }
- // Rematerialize identified instructions and update scheduler's state.
- rematerialize();
- if (GCNTrackers)
- DAG.RegionLiveOuts.buildLiveRegMap();
REMAT_DEBUG({
dbgs() << "Retrying function scheduling with new min. occupancy of "
<< AchievedOcc << " from rematerializing (original was "
@@ -1124,7 +1328,6 @@ bool PreRARematStage::initGCNSchedStage() {
dbgs() << ", target was " << *TargetOcc;
dbgs() << ")\n";
});
-
if (AchievedOcc > DAG.MinOccupancy) {
DAG.MinOccupancy = AchievedOcc;
SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>();
@@ -1151,6 +1354,10 @@ void UnclusteredHighRPStage::finalizeGCNSchedStage() {
}
bool GCNSchedStage::initGCNRegion() {
+ // Skip empty scheduling region.
+ if (DAG.begin() == DAG.end())
+ return false;
+
// Check whether this new region is also a new block.
if (DAG.RegionBegin->getParent() != CurrentMBB)
setupNewBlock();
@@ -1158,8 +1365,8 @@ bool GCNSchedStage::initGCNRegion() {
unsigned NumRegionInstrs = std::distance(DAG.begin(), DAG.end());
DAG.enterRegion(CurrentMBB, DAG.begin(), DAG.end(), NumRegionInstrs);
- // Skip empty scheduling regions (0 or 1 schedulable instructions).
- if (DAG.begin() == DAG.end() || DAG.begin() == std::prev(DAG.end()))
+ // Skip regions with 1 schedulable instruction.
+ if (DAG.begin() == std::prev(DAG.end()))
return false;
LLVM_DEBUG(dbgs() << "********** MI Scheduling **********\n");
@@ -1691,27 +1898,20 @@ bool PreRARematStage::allUsesAvailableAt(const MachineInstr *InstToRemat,
return true;
}
-bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
+void PreRARematStage::setObjective() {
const Function &F = MF.getFunction();
- // Maps optimizable regions (i.e., regions at minimum and register-limited
- // occupancy, or regions with spilling) to the target RP we would like to
- // reach.
- DenseMap<unsigned, GCNRPTarget> OptRegions;
+ // Set up "spilling targets" for all regions.
unsigned MaxSGPRs = ST.getMaxNumSGPRs(F);
unsigned MaxVGPRs = ST.getMaxNumVGPRs(F);
- auto ResetTargetRegions = [&]() {
- OptRegions.clear();
- for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
- const GCNRegPressure &RP = DAG.Pressure[I];
- GCNRPTarget Target(MaxSGPRs, MaxVGPRs, MF, RP);
- if (!Target.satisfied())
- OptRegions.insert({I, Target});
- }
- };
+ for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+ const GCNRegPressure &RP = DAG.Pressure[I];
+ GCNRPTarget &Target = RPTargets.emplace_back(MaxSGPRs, MaxVGPRs, MF, RP);
+ if (!Target.satisfied())
+ TargetRegions.set(I);
+ }
- ResetTargetRegions();
- if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
+ if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
// In addition to register usage being above addressable limits, occupancy
// below the minimum is considered like "spilling" as well.
TargetOcc = std::nullopt;
@@ -1719,59 +1919,27 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
// There is no spilling and room to improve occupancy; set up "increased
// occupancy targets" for all regions.
TargetOcc = DAG.MinOccupancy + 1;
- unsigned VGPRBlockSize =
- MF.getInfo<SIMachineFunctionInfo>()->getDynamicVGPRBlockSize();
+ const unsigned VGPRBlockSize = MFI.getDynamicVGPRBlockSize();
MaxSGPRs = ST.getMaxNumSGPRs(*TargetOcc, false);
MaxVGPRs = ST.getMaxNumVGPRs(*TargetOcc, VGPRBlockSize);
- ResetTargetRegions();
- }
- REMAT_DEBUG({
- dbgs() << "Analyzing ";
- MF.getFunction().printAsOperand(dbgs(), false);
- dbgs() << ": ";
- if (OptRegions.empty()) {
- dbgs() << "no objective to achieve, occupancy is maximal at "
- << MFI.getMaxWavesPerEU();
- } else if (!TargetOcc) {
- dbgs() << "reduce spilling (minimum target occupancy is "
- << MFI.getMinWavesPerEU() << ')';
- } else {
- dbgs() << "increase occupancy from " << DAG.MinOccupancy << " to "
- << TargetOcc;
- }
- dbgs() << '\n';
- for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
- if (auto OptIt = OptRegions.find(I); OptIt != OptRegions.end()) {
- dbgs() << REMAT_PREFIX << " [" << I << "] " << OptIt->getSecond()
- << '\n';
- }
+ for (auto [I, Target] : enumerate(RPTargets)) {
+ Target.setTarget(MaxSGPRs, MaxVGPRs);
+ if (!Target.satisfied())
+ TargetRegions.set(I);
}
- });
- if (OptRegions.empty())
- return false;
+ }
+}
- // Accounts for a reduction in RP in an optimizable region. Returns whether we
- // estimate that we have identified enough rematerialization opportunities to
- // achieve our goal, and sets Progress to true when this particular reduction
- // in pressure was helpful toward that goal.
- auto ReduceRPInRegion = [&](auto OptIt, Register Reg, LaneBitmask Mask,
- bool &Progress) -> bool {
- GCNRPTarget &Target = OptIt->getSecond();
- if (!Target.isSaveBeneficial(Reg))
- return false;
- Progress = true;
- Target.saveReg(Reg, Mask, DAG.MRI);
- if (Target.satisfied())
- OptRegions.erase(OptIt->getFirst());
- return OptRegions.empty();
- };
+bool PreRARematStage::collectRematRegs(ArrayRef<uint64_t> RegionFreq) {
+ assert(RegionFreq.size() == DAG.Regions.size());
// We need up-to-date live-out info. to query live-out register masks in
// regions containing rematerializable instructions.
DAG.RegionLiveOuts.buildLiveRegMap();
- // Cache set of registers that are going to be rematerialized.
- DenseSet<unsigned> RematRegs;
+ // Set of registers already marked for potential remterialization; used for
+ // remat chains checks.
+ DenseSet<Register> RematRegSet;
// Identify rematerializable instructions in the function.
for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
@@ -1782,30 +1950,34 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
if (!isTriviallyReMaterializable(DefMI))
continue;
- // We only support rematerializing virtual registers with one definition.
+ // We only support rematerializing virtual registers with one
+ // definition.
Register Reg = DefMI.getOperand(0).getReg();
if (!Reg.isVirtual() || !DAG.MRI.hasOneDef(Reg))
continue;
// We only care to rematerialize the instruction if it has a single
- // non-debug user in a different region. The using MI may not belong to a
- // region if it is a lone region terminator.
+ // non-debug user in a different region.
+ // FIXME: Allow rematerializations with multiple uses. This should be
+ // relatively easy to support using the current cost model.
MachineInstr *UseMI = DAG.MRI.getOneNonDBGUser(Reg);
if (!UseMI)
continue;
auto UseRegion = MIRegion.find(UseMI);
- if (UseRegion != MIRegion.end() && UseRegion->second == I)
+ if (UseRegion == MIRegion.end() || UseRegion->second == I)
continue;
// Do not rematerialize an instruction if it uses or is used by an
// instruction that we have designated for rematerialization.
// FIXME: Allow for rematerialization chains: this requires 1. updating
- // remat points to account for uses that are rematerialized, and 2. either
- // rematerializing the candidates in careful ordering, or deferring the
- // MBB RP walk until the entire chain has been rematerialized.
- if (Rematerializations.contains(UseMI) ||
- llvm::any_of(DefMI.operands(), [&RematRegs](MachineOperand &MO) {
- return MO.isReg() && RematRegs.contains(MO.getReg());
+ // remat points to account for uses that are rematerialized, and 2.
+ // either rematerializing the candidates in careful ordering, or
+ // deferring the MBB RP walk until the entire chain has been
+ // rematerialized.
+ MachineOperand &UseFirstMO = UseMI->getOperand(0);
+ if ((UseFirstMO.isReg() && RematRegSet.contains(UseFirstMO.getReg())) ||
+ llvm::any_of(DefMI.operands(), [&RematRegSet](MachineOperand &MO) {
+ return MO.isReg() && RematRegSet.contains(MO.getReg());
}))
continue;
@@ -1817,106 +1989,146 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
if (!allUsesAvailableAt(&DefMI, DefIdx, UseIdx))
continue;
- REMAT_DEBUG(dbgs() << "Region " << I << ": remat instruction " << DefMI);
- RematInstruction &Remat =
- Rematerializations.try_emplace(&DefMI, UseMI).first->second;
-
- bool RematUseful = false;
- if (auto It = OptRegions.find(I); It != OptRegions.end()) {
- // Optimistically consider that moving the instruction out of its
- // defining region will reduce RP in the latter; this assumes that
- // maximum RP in the region is reached somewhere between the defining
- // instruction and the end of the region.
- REMAT_DEBUG(dbgs() << " Defining region is optimizable\n");
- LaneBitmask Mask = DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg];
- if (ReduceRPInRegion(It, Reg, Mask, RematUseful))
- return true;
- }
-
- for (unsigned LIRegion = 0; LIRegion != E; ++LIRegion) {
- // We are only collecting regions in which the register is a live-in
- // (and may be live-through).
- auto It = DAG.LiveIns[LIRegion].find(Reg);
- if (It == DAG.LiveIns[LIRegion].end() || It->second.none())
- continue;
- Remat.LiveInRegions.insert(LIRegion);
-
- // Account for the reduction in RP due to the rematerialization in an
- // optimizable region in which the defined register is a live-in. This
- // is exact for live-through region but optimistic in the using region,
- // where RP is actually reduced only if maximum RP is reached somewhere
- // between the beginning of the region and the rematerializable
- // instruction's use.
- if (auto It = OptRegions.find(LIRegion); It != OptRegions.end()) {
- REMAT_DEBUG(dbgs() << " Live-in in region " << LIRegion << '\n');
- if (ReduceRPInRegion(It, Reg, DA...
[truncated]
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
| #include "llvm/Support/ErrorHandling.h" | ||
| #include "llvm/Support/raw_ostream.h" | ||
| #include <deque> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowhere, thanks for the catch.
| void PreRARematStage::printTargetRegions(bool PrintAll) const { | ||
| if (PrintAll) { | ||
| for (auto [I, Target] : enumerate(RPTargets)) | ||
| REMAT_DEBUG(dbgs() << " [" << I << "] " << Target << '\n'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only use this under REMAT_DEBUG uses, so can drop all the REMAT_DEBUGs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PreRARematStage::printTargetRegions is called from some #ifndef NDEBUG blocks, unless I am mistaken if I don't wrap the dbgs with REMAT_DEBUG these logs with show up even if didn't ask for the scheduler's debug output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then wrap those dangling uses in the _DEBUG blocks, they don't need their own ifndef NDEBUGs
| MachineOperand &UseFirstMO = UseMI->getOperand(0); | ||
| if ((UseFirstMO.isReg() && RematRegSet.contains(UseFirstMO.getReg())) || | ||
| llvm::any_of(DefMI.operands(), [&RematRegSet](MachineOperand &MO) { | ||
| return MO.isReg() && RematRegSet.contains(MO.getReg()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks backwards? The UseMI is looking at the assumed single def, and DefMI is looking at all of its operands? Should one or both of these checks be verifying it's a use/def?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation is right (though it is confusing, so I refactored it to make it clearer).
UseFirstMO.isReg() && RematRegSet.contains(UseFirstMO.getReg())This ensures that UseMI is not itself an instruction we have already added to the rematerializable list. If the first operand is not a def reg then the instruction couldn't have been considered rematerializable, so RematRegSet won't contain the register.
llvm::any_of(DefMI.operands(), [&RematRegSet](MachineOperand &MO) {
return MO.isReg() && RematRegSet.contains(MO.getReg())
})This ensures that no operand of DefMI is defined by an instruction we have added to the rematerializable list. We know that the first operand will clear the test since we only consider registers with a single def as potentially rematerializable, so the defined register is guaranteed to not be in RematRegSet already.
|
|
||
| // Store the register's lane bitmask. | ||
| unsigned SubReg = DefMI->getOperand(0).getSubReg(); | ||
| Mask = SubReg ? DAG.TRI->getSubRegIndexLaneMask(SubReg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using LaneBitmask::getAll good enough here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can in fact simplify this whole thing to getSubRegIndexLaneMask(SubReg) since it can handle the 0 case, thanks for the catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually doesn't work since the lane mask returned when subreg is 0 is ~0U, yielding a very large/incorrect number when calling getNumCoveredRegs on it. By sheer luck this did not break tests on a07fef2 but only on 8c142c8 which changes rematerialization priorities in some tests when multiple rematerialization candidates have the same score. I rollbacked the change in 70cd6f6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getNumCoveredRegs is a broken API. You're not allowed to rely on lanemask values this way. You're only allowed to check for overlaps. A LaneMask on its own doesn't mean anything, you need to know the associated register class.
We probably should have a tablegenerated version of this function, or find a different way to get a register count at the use point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context, this is only used for the scoring heuristic and for the sake of time I just resorted to using the register class to get the register size, without accounting for a potential subregister. Left a FIXME for future reference.
llvm/test/CodeGen/AMDGPU/machine-scheduler-rematerialization-scoring.mir
Outdated
Show resolved
Hide resolved
llvm/test/CodeGen/AMDGPU/machine-scheduler-rematerialization-scoring.mir
Outdated
Show resolved
Hide resolved
| void PreRARematStage::printTargetRegions(bool PrintAll) const { | ||
| if (PrintAll) { | ||
| for (auto [I, Target] : enumerate(RPTargets)) | ||
| REMAT_DEBUG(dbgs() << " [" << I << "] " << Target << '\n'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then wrap those dangling uses in the _DEBUG blocks, they don't need their own ifndef NDEBUGs
|
ping |
70cd6f6 to
3d7c021
Compare
| // Skip regions with 1 schedulable instruction. | ||
| if (DAG.begin() == std::prev(DAG.end())) | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be pulled up right after the empty region check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no, ScheduleDAGMILive::enterRegion expects to be called even for single-instruction regions to maintain some internal state.
|
|
||
| // Store the register's lane bitmask. | ||
| unsigned SubReg = DefMI->getOperand(0).getSubReg(); | ||
| Mask = SubReg ? DAG.TRI->getSubRegIndexLaneMask(SubReg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getNumCoveredRegs is a broken API. You're not allowed to rely on lanemask values this way. You're only allowed to check for overlaps. A LaneMask on its own doesn't mean anything, you need to know the associated register class.
We probably should have a tablegenerated version of this function, or find a different way to get a register count at the use point
| DenseSet<unsigned> RematRegs; | ||
| // Set of registers already marked for potential remterialization; used for | ||
| // remat chains checks. | ||
| DenseSet<Register> RematRegSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DenseSet<Register> RematRegSet; | |
| SmallSet<Register, 4> RematRegSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should be tracking this on RegSubRegPair. Eventually we really need to improve rematerialize for sub registers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to track subregisters at this point since we only consider registers with one def as rematerializable, so a register that would be partially defined more than once would not be considered.
| } | ||
|
|
||
| if (!canIncreaseOccupancyOrReduceSpill()) | ||
| setObjective(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible to not have any remat target regions -- should we return in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is possible however I think we still want to do all the "always beneficial" remats in those cases as they reduce live ranges and may reduce latency as well by moving instructions to lower-frequency regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should have the "always beneficial" feature. I think the scheduling rematerialization should be used only to reduce spilling or increase occupancy in accordance with the scheduler's goals. The "always beneficial" feature seems to be moving away from that, and supporting it will likely cause complexity in the heuristics.
I wonder why these always beneficial remats were not captured by the MachineSink pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the feature from this PR, making this contribution a little shorter/simpler which is good. I still think this is a useful feature to have but I get your point that it doesn't really belong in the scheduler's remat phase. I also don't know why MachineSink is not taking care of those, it's something I want to look into in the future.
| Register Reg = Remat->DefMI->getOperand(0).getReg(); | ||
| unsigned RPScore = 0; | ||
| for (unsigned I : TargetRegions.set_bits()) { | ||
| unsigned Freq = std::max(RegionFreq[I], static_cast<uint64_t>(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the frequency of the affected region only makes sense if we're going to reduce spilling.
For the occupancy case, this info is basically irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can still be helpful to consider region frequency in the occupancy case to e.g. prioritize rematerializing in lower frequency regions to minimize the latency penalty.
| GCNRPTracker::LiveRegSet &RegionLiveIns = DAG.LiveIns[I]; | ||
| // The estimated RP reduction is directly proportional to the size of the | ||
| // rematerializable register. | ||
| setRPScore(RPScore * SIRegisterInfo::getNumCoveredRegs(Remat->Mask)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both the occupancy and spilling cases, I would think we would want to prioritize remats into less hot blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WeightRP and WeightRPMaybe weights are meant to encourage that by giving less weight to regions in which the register is actually used, so between two registers live in the same number of target regions with the same total frequency, the register which is used in the lower frequency region will have higher score i.e. will be prioritized.
The relative value of these two weights can be changed to further prioritize rematerializing into less hot blocks. Alternatively, the frequency of the region being rematerialized to can be made into a separate scoring criteria that takes precedence over the number of target regions in which the register is live. I don't know which option will be better in most cases. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the scoring model should take into account the (perhaps in order of priority):
1. maximal freuency of affected regions (if spilling)
2. delta between def / remat frequency
3. the number of target regions affected
4. reg / lanemask width.
Maybe 3 and 4 can be combined into a single metric.
For the occupancy case, the frequency of the intermediate target blocks doesn't matter -- we remove latency from the original def region, and we add latency to the remat region. We shouldn't care about the intermediate frequency as doing a remat has no impact on those regions -- if we do enough remats we will increase occupancy which is function wide.
For the spilling case, the frequencey of the intermediate target blocks does matter. If our RP is above the threshold, then we will spill somewhere, we would prefer not to spill in the hottest blocks. Thus, we should prioritize reducing RP in the highest frequency regions first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the occupancy case, the frequency of the intermediate target blocks doesn't matter -- we remove latency from the original def region, and we add latency to the remat region. We shouldn't care about the intermediate frequency as doing a remat has no impact on those regions -- if we do enough remats we will increase occupancy which is function wide.
For the spilling case, the frequencey of the intermediate target blocks does matter. If our RP is above the threshold, then we will spill somewhere, we would prefer not to spill in the hottest blocks. Thus, we should prioritize reducing RP in the highest frequency regions first.
Thanks a lot for the explanation, that makes sense to me now. Updated the scoring method to reflect this.
IMO, the scoring model should take into account the (perhaps in order of priority):
- maximal freuency of affected regions (if spilling)
- delta between def / remat frequency
- the number of target regions affected
- reg / lanemask width.
With the above remark in mind those make more sense than what I had indeed. Implemented exactly those scoring components with those priorities (with merged 3/4).
| // most cases cheaper than spilling. We still give a bonus to remats for | ||
| // which we are able to do the calculation. | ||
| if (InstrLatencyGain && *InstrLatencyGain < 0) { | ||
| int SpillLatencyGain = SaveCost * Remat->DefFrequency; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spill cost model seems a bit flaky -- can we get away with just checking the def vs remat frequency and the latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the calculation to ScoredRemat::getLatencyGain but it is the same idea i.e. I'm comparing the latency difference induced by the rematerializable MI being moved to a different region to the additional latency that would come from saving/restoring to/from stack. The former should be smaller than the latter in almost all cases. Does that make sense?
e979f80 to
007d9b9
Compare
| } | ||
|
|
||
| if (!canIncreaseOccupancyOrReduceSpill()) | ||
| setObjective(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should have the "always beneficial" feature. I think the scheduling rematerialization should be used only to reduce spilling or increase occupancy in accordance with the scheduler's goals. The "always beneficial" feature seems to be moving away from that, and supporting it will likely cause complexity in the heuristics.
I wonder why these always beneficial remats were not captured by the MachineSink pass?
|
|
||
| ResetTargetRegions(); | ||
| if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) { | ||
| if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this take into account the LDS & workgroup size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, MFI.getMaxWavesPerEU comes from getWavesPerEU which takes into account those things for the maximum achievable occupancy.
| // This save is exact in beneficial regions but optimistic in all other | ||
| // regions where the register is live. | ||
| RPTargets[I].saveReg(Reg, Remat.Mask, DAG.MRI); | ||
| DAG.LiveIns[I].erase(Reg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if the remat fully covers the LiveIn mask? Maybe due to the one-def check this isn't needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isn't needed because of the one-def check indeed.
| // regions where the register is live. | ||
| RPTargets[I].saveReg(Reg, Remat.Mask, DAG.MRI); | ||
| DAG.LiveIns[I].erase(Reg); | ||
| DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I).erase(Reg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto as well
| } | ||
|
|
||
| void GCNScheduleDAGMILive::deleteMI(unsigned RegionIdx, MachineInstr *MI) { | ||
| updateRegionBoundaries(Regions[RegionIdx], MI, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the updateRegionBoundaries is broken -- in the case where we are removing the only instruction from the region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the best way to fix is it to update the MIRegion for each remat, and check this when updating the boundaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up deleting updateRegionBoundaries because I was the only user and it was a somewhat long function that can be replaced with a couple simple lines at both use points. However I don't think I changed the implementation by doing that, and I am not seeing an issue when deleting the only instruction from the region in deleteMI.
If MI is the only instruction in the region we have Region.second == std::next(MachineBasicBlock::iterator(MI)) so after the function we will have Region.first == Region.second. test_rollback_remats_emptydefregion in machine-scheduler-sink-trivial-remats.mir tests that we can completely empty a region (bb.1) and then rollback instructions to it again and it seems to work without issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_rollback_remats_emptydefregioninmachine-scheduler-sink-trivial-remats.mirtests that we can completely empty a region (bb.1) and then rollback instructions to it again and it seems to work without issue.
Can you add a test where we do this in the case of multiple scheduling regions in a block? We have multiple sched regions in a block, but, through remat, we empty the first one? You can use sched.barrier(i32 0) to create sched regions. My concern is that we may run into issues if Region.second != MBB.end()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test_rollback_remats_emptydefregion_barrier (renamed the existing test to test_rollback_remats_emptydefregion_block). It seems to be working fine (and I checked the logs to makes sure it is actually rematerializing and then rolling back remats). Let me know if there is still a concern.
| GCNRPTracker::LiveRegSet &RegionLiveIns = DAG.LiveIns[I]; | ||
| // The estimated RP reduction is directly proportional to the size of the | ||
| // rematerializable register. | ||
| setRPScore(RPScore * SIRegisterInfo::getNumCoveredRegs(Remat->Mask)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the scoring model should take into account the (perhaps in order of priority):
1. maximal freuency of affected regions (if spilling)
2. delta between def / remat frequency
3. the number of target regions affected
4. reg / lanemask width.
Maybe 3 and 4 can be combined into a single metric.
For the occupancy case, the frequency of the intermediate target blocks doesn't matter -- we remove latency from the original def region, and we add latency to the remat region. We shouldn't care about the intermediate frequency as doing a remat has no impact on those regions -- if we do enough remats we will increase occupancy which is function wide.
For the spilling case, the frequencey of the intermediate target blocks does matter. If our RP is above the threshold, then we will spill somewhere, we would prefer not to spill in the hottest blocks. Thus, we should prioritize reducing RP in the highest frequency regions first.
| return divideCeil(DAG.TRI->getRegSizeInBits(RC), 32); | ||
| } | ||
|
|
||
| unsigned PreRARematStage::ScoredRemat::getLatencyGain( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to consider just removing this latency spill cost check as it just adds more code to support. For all current users, the remat latency will be less than the spill latency.
Maybe we can consider adding this if we remat more expensive instructions / as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, removed it.
874692b to
d452fa9
Compare
| const GCNScheduleDAGMILive &DAG) | ||
| : Remat(Remat), NumRegs(getNumRegs(DAG)), FreqDiff(getFreqDiff(Freq)) {} | ||
|
|
||
| unsigned PreRARematStage::ScoredRemat::getNumRegs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the RematReg::Mask ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt mentioned here that I cannot rely on the mask + SIRegisterInfo::getNumCoveredRegs to reliably derive the number of registers so I am now using the RC. However I don't know of a way to get the number of registers covered by a subregister from the RC + SubIdx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getNumChannelsFromSubReg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's implemented with getNumCoveredRegs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But size of subregister class should work too
|
f41fddc passed functional internal testing. |
|
|
||
| ResetTargetRegions(); | ||
| if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) { | ||
| if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use/hoist the MFI.getMaxWavesPerEU() check as part of an early exit check now that we no longer do always profitable remats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that we should be able to do that but after trying to move this to the beginning of PreRARematStage::initGCNSchedStage it can have an impact when the "amdgpu-num-sgpr" or "amdgpu-num-vgpr" attributes are set with values lower than the maximums we would have gotten by default. If the condition is met we will ignore these harder to achieve RP targets, whereas setObjective takes those attributes into account through getMaxNumSGPRs/getMaxNumVGPRs. The effect can be seen on the small_num_sgprs_as_spill/small_num_vgprs_as_spill tests in machine-scheduler-sink-trivial-remats-attr.mir.
I am unsure whether we should care about honoring these attributes if we are already at maximum achievable occupancy though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - makes sense. I think we should continue honoring these attributes and exit in the way we do. Other scheduling heuristics take this into account, and I can see how the attribute would be useful for dev / testing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still early exit on maxWavesPerEU if the numVGPRs and numSGPRs attributes aren't specified.
| dbgs() << "reduce spilling (minimum target occupancy is " | ||
| << MFI.getMinWavesPerEU() << ")\n"; | ||
| } | ||
| printTargetRegions(/*PrintAll=*/TargetRegions.none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TargetRegions.none() will always be false here -- since setObjective would have returned false above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this also allows me to remove the function argument altogether, thanks.
| // The difference between defining and using frequency is in the range | ||
| // [-MaxDiff, MaxDiff], shift it to [0,2 x MaxDiff] to stay in the positive | ||
| // range, then rescale to the representable range in the final score. | ||
| const uint64_t FreqDiff = (MaxDiff + (DefOrOne - UseOrMax)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add UnitTests or provide some more info about how this FreqDiff scaling is supposed to work -- it's not very clear just reading through it.
Alternatively, I think you simplify the ScoredRemat::score handling to use a struct or something instead of the complex bitpack algorithm, which I think would allow us to use a more direct calculation for the relative frequency. I think this will also help against information loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the bitpacking algorithm and simply used three class members. It does simplify the logic a bunch at the cost of extra data and extra comparison/sorting time.
It may not matter much at this point since we typically don't have a lot of rematerialization candidates due to the exsiting limitations but with support for remat with multiple uses and chains the number of candidates is much larger, so these things may become more important more later, and I can always re-introduce that then.
The point of the now removed calculation around FreqDiff was to rescale the range of possible differences between defining and using region to the range of representable values in the bitpacked score (as a function of the number of bits allocated to that scoring component). It mapped the largest possible penalty to latency (DefFreq < UseFreq) to 0 and the smallest possible penalty to latency (DefFreq > UseFreq) to 0xFF..FF, with a linear (up to integer rounding) interpolation of the values in between. favor_lower_frequency in machine-scheduler-rematerialization-scoring.mir shows that we will prefer to rematerialize registers into regions of lower frequency, all other things being equal.
|
|
||
| // Peel off registers we already rematerialized from the vector's tail. | ||
| ScoredRemats.truncate(RematIdx + 1); | ||
| } while ((updateAndVerifyRPTargets(RecomputeRP) || TargetRegions.any()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to hit infinite loop ? If we have a ScoredRemat that is no longer beneficial, but non-empty TargetRegions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. If a remat is no longer beneficial while there are still target regions we will check whether any region was incorrectly unmarked as a target due to our heuristics, then re-score and start a new round. Then, either there is at least one remat which is still useful (the same one or another) and we will be able to make progress, or the first remat we check will have null score and we will stop rematerializing altogether.
| if (!Remat->maybeBeneficial(TargetRegions, RPTargets)) | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was redundant with the loop below, which will leave both MaxFreq and RegionImpact at 0 if the remat is not beneficial anyway.
| if (RegionImpact != O.RegionImpact) | ||
| return RegionImpact < O.RegionImpact; | ||
| // Break ties using pointer to rematerializable register. | ||
| return Remat > O.Remat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be Remat < 0.Remat ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the comparison in this direction because, all other things being equal, it will favor rematerializing registers with slightly longer live ranges in their defining regions.
Registers are collected in instruction order within each region so registers defined "early" in a region have lower addresses than those defined "late" in the same region. The former also have longer live-ranges in the region since they start earlier and extend to the live-outs. If the defining region is a region with excess RP this can lead to more optimal rematerializations overall in very specific cases (our remat unit tests actually hit that very specific case quite often due to their regular/artificial nature).
| TII->reMaterialize(*InsertPos->getParent(), InsertPos, NewReg, 0, DefMI, | ||
| *DAG.TRI); | ||
| MachineInstr *RematMI = &*std::prev(InsertPos); | ||
| Remat.UseMI->substituteRegister(Reg, NewReg, 0, *DAG.TRI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to specify SubIdx ? If the use is using a subreg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substituteRegister combines the existing subindex of operands using Reg with any subindex given in the third argument. Setting the latter to 0 preserves whatever subindex Reg already has.
|
|
||
| ResetTargetRegions(); | ||
| if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) { | ||
| if (TargetRegions.any() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still early exit on maxWavesPerEU if the numVGPRs and numSGPRs attributes aren't specified.
|
LGTM aside from outstanding comments -- please wait for Matt to take a look. |
- Use SchedModel instead of instruction itinerary - Don't use getNumConvertedRegs to get number of regs, use RC instead. - Clarify some comments. - Other minor changes.
- Change components of scoring system. - Remove rematerialization of always beneficial registers. - Remove latency calculation in scoring. - Other small changes.
- Add some explanatory comments for the calculations. - Normalize frequencies to minimum to avoid overflows. - Create helper struct to hold all frequency information (helpful for further improvements).
In cases where the (normalized) maximum region frequency is lower than the number of different frequency scores representable (2^16 currently), the current method of computing the scaling factor shrinks the actually achievable range of frequency scores available (up to a single achievable score in the most extreme cases). This is due to integer rounding when dividing frequencies to fit within the representable range. When the maximum frequency is smaller than the maximum score achievable, use a different rescaling calculation to avoid the expressivity loss.
- Removed some debug-only functions from the header and put them as close as possible to where they are needed. - Split bitpacked score into separate components compared 1-to-1.
fd86daf to
0e1a6d8
Compare
This is a significant refactoring of the scheduler's rematerialization stage meant to improve rematerialization capabilities and lay strong foundations for future improvements.
The core contribution is a scoring system to estimate the benefit of each rematerialization candidate. This score takes into account
The overall stage's flow is as follows. After setting its RP objective (to reach a specific occupancy, or to eliminate spilling), the stage collects all rematerialization candidates in the function. Among those candidates it identifies and rematerializes all those that are always beneficial to rematerialize regardless of the stage's objective. Then, it scores all other candidates and iteratively rematerializes them in decreasing score order until it reaches the RP objective in all regions. As with the current implementation, all affected regions are then rescheduled, and rematerializations conditionally rolled back if they end up not benefiting performance.
New tests in
machine-scheduler-rematerialization-scoring.mirshowcase how the scoring system dictates which rematerialization are the most beneficial and therefore performed first.