-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU][Scheduler] Consistent occupancy calculation during rematerialization #149224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
640025a
Use occupancy calculation taking into account VGPR combined savings
lucas-rami dbc88c8
Format
lucas-rami 7262c7f
Change approach to VGPR savings accounting
lucas-rami 5bf0b78
Format
lucas-rami e09f596
Remove assert and std::min
lucas-rami 83aab61
Remove redundant check
lucas-rami 890c105
Fix typo on
lucas-rami fa1524d
Remove nested debugs
lucas-rami File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1086,7 +1086,8 @@ bool ClusteredLowOccStage::initGCNSchedStage() { | |
| } | ||
|
|
||
| /// Allows to easily filter for this stage's debug output. | ||
| #define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << "[PreRARemat] "; X;) | ||
| #define REMAT_PREFIX "[PreRARemat] " | ||
| #define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << REMAT_PREFIX; X;) | ||
|
|
||
| bool PreRARematStage::initGCNSchedStage() { | ||
| // FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for | ||
|
|
@@ -1115,10 +1116,15 @@ bool PreRARematStage::initGCNSchedStage() { | |
| rematerialize(); | ||
| if (GCNTrackers) | ||
| DAG.RegionLiveOuts.buildLiveRegMap(); | ||
| REMAT_DEBUG( | ||
| dbgs() << "Retrying function scheduling with new min. occupancy of " | ||
| << AchievedOcc << " from rematerializing (original was " | ||
| << DAG.MinOccupancy << ", target was " << TargetOcc << ")\n"); | ||
| REMAT_DEBUG({ | ||
| dbgs() << "Retrying function scheduling with new min. occupancy of " | ||
| << AchievedOcc << " from rematerializing (original was " | ||
| << DAG.MinOccupancy; | ||
| if (TargetOcc) | ||
| dbgs() << ", target was " << *TargetOcc; | ||
| dbgs() << ")\n"; | ||
| }); | ||
|
|
||
| if (AchievedOcc > DAG.MinOccupancy) { | ||
| DAG.MinOccupancy = AchievedOcc; | ||
| SIMachineFunctionInfo &MFI = *MF.getInfo<SIMachineFunctionInfo>(); | ||
|
|
@@ -1540,8 +1546,7 @@ bool ClusteredLowOccStage::shouldRevertScheduling(unsigned WavesAfter) { | |
|
|
||
| bool PreRARematStage::shouldRevertScheduling(unsigned WavesAfter) { | ||
| return GCNSchedStage::shouldRevertScheduling(WavesAfter) || | ||
| mayCauseSpilling(WavesAfter) || | ||
| (IncreaseOccupancy && WavesAfter < TargetOcc); | ||
| mayCauseSpilling(WavesAfter) || (TargetOcc && WavesAfter < TargetOcc); | ||
| } | ||
|
|
||
| bool ILPInitialScheduleStage::shouldRevertScheduling(unsigned WavesAfter) { | ||
|
|
@@ -1687,78 +1692,63 @@ bool PreRARematStage::allUsesAvailableAt(const MachineInstr *InstToRemat, | |
| } | ||
|
|
||
| bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() { | ||
| REMAT_DEBUG({ | ||
| dbgs() << "Collecting rematerializable instructions in "; | ||
| MF.getFunction().printAsOperand(dbgs(), false); | ||
| dbgs() << '\n'; | ||
| }); | ||
| const Function &F = MF.getFunction(); | ||
|
|
||
| // Maps optimizable regions (i.e., regions at minimum and register-limited | ||
| // occupancy, or regions with spilling) to the target RP we would like to | ||
| // reach. | ||
| DenseMap<unsigned, GCNRPTarget> OptRegions; | ||
| const Function &F = MF.getFunction(); | ||
| unsigned DynamicVGPRBlockSize = | ||
| MF.getInfo<SIMachineFunctionInfo>()->getDynamicVGPRBlockSize(); | ||
|
|
||
| std::pair<unsigned, unsigned> WavesPerEU = ST.getWavesPerEU(F); | ||
| const unsigned MaxSGPRsNoSpill = ST.getMaxNumSGPRs(F); | ||
| const unsigned MaxVGPRsNoSpill = ST.getMaxNumVGPRs(F); | ||
| const unsigned MaxSGPRsIncOcc = | ||
| ST.getMaxNumSGPRs(DAG.MinOccupancy + 1, false); | ||
| const unsigned MaxVGPRsIncOcc = | ||
| ST.getMaxNumVGPRs(DAG.MinOccupancy + 1, DynamicVGPRBlockSize); | ||
| IncreaseOccupancy = WavesPerEU.second > DAG.MinOccupancy; | ||
|
|
||
| // Collect optimizable regions. If there is spilling in any region we will | ||
|
||
| // just try to reduce spilling. Otherwise we will try to increase occupancy by | ||
| // one in the whole function. | ||
| for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) { | ||
| GCNRegPressure &RP = DAG.Pressure[I]; | ||
| // We allow ArchVGPR or AGPR savings to count as savings of the other kind | ||
| // of VGPR only when trying to eliminate spilling. We cannot do this when | ||
| // trying to increase occupancy since VGPR class swaps only occur later in | ||
| // the register allocator i.e., the scheduler will not be able to reason | ||
| // about these savings and will not report an increase in the achievable | ||
| // occupancy, triggering rollbacks. | ||
| GCNRPTarget Target(MaxSGPRsNoSpill, MaxVGPRsNoSpill, MF, RP, | ||
| /*CombineVGPRSavings=*/true); | ||
| if (!Target.satisfied() && IncreaseOccupancy) { | ||
| // There is spilling in the region and we were so far trying to increase | ||
| // occupancy. Strop trying that and focus on reducing spilling. | ||
| IncreaseOccupancy = false; | ||
| OptRegions.clear(); | ||
| } else if (IncreaseOccupancy) { | ||
| // There is no spilling in the region, try to increase occupancy. | ||
| Target = GCNRPTarget(MaxSGPRsIncOcc, MaxVGPRsIncOcc, MF, RP, | ||
| /*CombineVGPRSavings=*/false); | ||
| unsigned MaxSGPRs = ST.getMaxNumSGPRs(F); | ||
| unsigned MaxVGPRs = ST.getMaxNumVGPRs(F); | ||
| auto ResetTargetRegions = [&]() { | ||
| OptRegions.clear(); | ||
| for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) { | ||
| const GCNRegPressure &RP = DAG.Pressure[I]; | ||
| GCNRPTarget Target(MaxSGPRs, MaxVGPRs, MF, RP); | ||
| if (!Target.satisfied()) | ||
| OptRegions.insert({I, Target}); | ||
| } | ||
| if (!Target.satisfied()) | ||
| OptRegions.insert({I, Target}); | ||
| } | ||
| if (OptRegions.empty()) | ||
| return false; | ||
| }; | ||
|
|
||
| #ifndef NDEBUG | ||
| if (IncreaseOccupancy) { | ||
| REMAT_DEBUG(dbgs() << "Occupancy minimal (" << DAG.MinOccupancy | ||
| << ") in regions:\n"); | ||
| ResetTargetRegions(); | ||
| if (!OptRegions.empty() || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) { | ||
| // In addition to register usage being above addressable limits, occupancy | ||
| // below the minimum is considered like "spilling" as well. | ||
| TargetOcc = std::nullopt; | ||
| } else { | ||
| REMAT_DEBUG(dbgs() << "Spilling w.r.t. minimum target occupancy (" | ||
| << WavesPerEU.first << ") in regions:\n"); | ||
| } | ||
| for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) { | ||
| if (auto OptIt = OptRegions.find(I); OptIt != OptRegions.end()) | ||
| REMAT_DEBUG(dbgs() << " [" << I << "] " << OptIt->getSecond() << '\n'); | ||
| // There is no spilling and room to improve occupancy; set up "increased | ||
| // occupancy targets" for all regions. | ||
| TargetOcc = DAG.MinOccupancy + 1; | ||
| unsigned VGPRBlockSize = | ||
| MF.getInfo<SIMachineFunctionInfo>()->getDynamicVGPRBlockSize(); | ||
| MaxSGPRs = ST.getMaxNumSGPRs(*TargetOcc, false); | ||
| MaxVGPRs = ST.getMaxNumVGPRs(*TargetOcc, VGPRBlockSize); | ||
| ResetTargetRegions(); | ||
| } | ||
| #endif | ||
|
|
||
| // When we are reducing spilling, the target is the minimum target number of | ||
| // waves/EU determined by the subtarget. In cases where either one of | ||
| // "amdgpu-num-sgpr" or "amdgpu-num-vgpr" are set on the function, the current | ||
| // minimum region occupancy may be higher than the latter. | ||
| TargetOcc = IncreaseOccupancy ? DAG.MinOccupancy + 1 | ||
| : std::max(DAG.MinOccupancy, WavesPerEU.first); | ||
| REMAT_DEBUG({ | ||
| dbgs() << "Analyzing "; | ||
| MF.getFunction().printAsOperand(dbgs(), false); | ||
| dbgs() << ": "; | ||
| if (OptRegions.empty()) { | ||
| dbgs() << "no objective to achieve, occupancy is maximal at " | ||
| << MFI.getMaxWavesPerEU(); | ||
| } else if (!TargetOcc) { | ||
| dbgs() << "reduce spilling (minimum target occupancy is " | ||
| << MFI.getMinWavesPerEU() << ')'; | ||
| } else { | ||
| dbgs() << "increase occupancy from " << DAG.MinOccupancy << " to " | ||
| << TargetOcc; | ||
| } | ||
| dbgs() << '\n'; | ||
| for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) { | ||
| if (auto OptIt = OptRegions.find(I); OptIt != OptRegions.end()) { | ||
| dbgs() << REMAT_PREFIX << " [" << I << "] " << OptIt->getSecond() | ||
| << '\n'; | ||
| } | ||
| } | ||
| }); | ||
| if (OptRegions.empty()) | ||
| return false; | ||
|
|
||
| // Accounts for a reduction in RP in an optimizable region. Returns whether we | ||
| // estimate that we have identified enough rematerialization opportunities to | ||
|
|
@@ -1767,7 +1757,7 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() { | |
| auto ReduceRPInRegion = [&](auto OptIt, Register Reg, LaneBitmask Mask, | ||
| bool &Progress) -> bool { | ||
| GCNRPTarget &Target = OptIt->getSecond(); | ||
| if (!Target.isSaveBeneficial(Reg, DAG.MRI)) | ||
| if (!Target.isSaveBeneficial(Reg)) | ||
| return false; | ||
| Progress = true; | ||
| Target.saveReg(Reg, Mask, DAG.MRI); | ||
|
|
@@ -1876,7 +1866,7 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() { | |
| } | ||
| } | ||
|
|
||
| if (IncreaseOccupancy) { | ||
| if (TargetOcc) { | ||
| // We were trying to increase occupancy but failed, abort the stage. | ||
| REMAT_DEBUG(dbgs() << "Cannot increase occupancy\n"); | ||
| Rematerializations.clear(); | ||
|
|
@@ -1979,7 +1969,9 @@ void PreRARematStage::rematerialize() { | |
| // All regions impacted by at least one rematerialization must be rescheduled. | ||
| // Maximum pressure must also be recomputed for all regions where it changed | ||
| // non-predictably and checked against the target occupancy. | ||
| AchievedOcc = TargetOcc; | ||
| unsigned DynamicVGPRBlockSize = | ||
| MF.getInfo<SIMachineFunctionInfo>()->getDynamicVGPRBlockSize(); | ||
| AchievedOcc = MFI.getMaxWavesPerEU(); | ||
| for (auto &[I, OriginalRP] : ImpactedRegions) { | ||
| bool IsEmptyRegion = DAG.Regions[I].first == DAG.Regions[I].second; | ||
| RescheduleRegions[I] = !IsEmptyRegion; | ||
|
|
@@ -2003,9 +1995,8 @@ void PreRARematStage::rematerialize() { | |
| } | ||
| } | ||
| DAG.Pressure[I] = RP; | ||
| AchievedOcc = std::min( | ||
| AchievedOcc, RP.getOccupancy(ST, MF.getInfo<SIMachineFunctionInfo>() | ||
| ->getDynamicVGPRBlockSize())); | ||
| AchievedOcc = | ||
| std::min(AchievedOcc, RP.getOccupancy(ST, DynamicVGPRBlockSize)); | ||
| } | ||
| REMAT_DEBUG(dbgs() << "Achieved occupancy " << AchievedOcc << "\n"); | ||
| } | ||
|
|
@@ -2035,7 +2026,7 @@ void PreRARematStage::finalizeGCNSchedStage() { | |
| // which case we do not want to rollback either (the rescheduling was already | ||
| // reverted in PreRARematStage::shouldRevertScheduling in such cases). | ||
| unsigned MaxOcc = std::max(AchievedOcc, DAG.MinOccupancy); | ||
| if (!IncreaseOccupancy || MaxOcc >= TargetOcc) | ||
| if (!TargetOcc || MaxOcc >= *TargetOcc) | ||
| return; | ||
|
|
||
| REMAT_DEBUG(dbgs() << "Rolling back all rematerializations\n"); | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is only beneficial to reduce the OtherRC if the OtherRC exceeds the addressable limit for that RC and the combined pressure is above the addressable limit for unified RF. In this case we cannot use copies / LiveRange splitting alone to allocate the registers, we must spill.
Since we know the current RC is less than the addressable limit, then it may be enough to just check the unified RP against the addressable limit for unified RF.
By reducing cross RC pressure any time we're over the
MaxUnifiedVGPRs, we are telling the rematerializer to issue cross RC copies to increase occupancy.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I am not sure I understand.
I guess we agree on the spilling case ($MaxVGPRs=256 \wedge MaxUnifiedVGPRs=512$ ) since in that case $NumVGPRsInRC \leq MaxVGPRs \wedge RP.getVGPRNum(true) > MaxUnifiedVGPRs \Longrightarrow NumVGPRsInOtherRC > MaxVGPRs$ (modulo the VGPR allocation granule in the unified computation) i.e., we only do cross-RC saves if there are too many excess VGPRs in the other RC to fit through copies in the current RC.
For the occupancy increase case ($0<MaxVGPRs=MaxUnifiedVGPRs\leq256$ ) we always have $NumVGPRsInRC<256$ and $NumVGPRsInOtherRC<256$ otherwise the stage would be trying to reduce spilling. If $NumVGPRsInRC \leq MaxVGPRs \wedge RP.getVGPRNum(true) > MaxUnifiedVGPRs$ , isn't any VGPR/AGPR save beneficial? Is there a chance we increase the number of cross RC copies by always saving there?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think the way you've implemented it is fine actually.
I was looking at a case where we use the waves-per-eu attribute. Unless amdgpu-agpr-alloc is also specified, we will just split the unified RF in half during RA. In that case, with the current implementation, we would potentially be inserting cross RC copies to increase occupancy. I'm not sure how concerned we should be with this case, since for target's with UnifiedRF, we should be using pure vgpr unless in occupancy 1 case.
However, in the default case (no attributes), RA can allocate up to the maxNumVGPRs for each RC, which should encourage a more optimal split while honoring the unified limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the investigation and for the insights, I'll keep them in mind for my future remat work.