-
Notifications
You must be signed in to change notification settings - Fork 15.4k
[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking #162077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking #162077
Conversation
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
@llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesThe pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units. There are no test changes because functionality didn't change, except:
This also changes the tracking to use a DenseMap instead of a massive I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else. Patch is 35.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162077.diff 1 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 76bfce8c0f6f9..90a5cd4e87ae7 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -97,7 +97,27 @@ auto inst_counter_types(InstCounterType MaxCounter = NUM_INST_CNTS) {
return enum_seq(LOAD_CNT, MaxCounter);
}
-using RegInterval = std::pair<int, int>;
+/// Integer IDs used to track vector memory locations we may have to wait on.
+/// Encoded as u16 chunks:
+///
+/// [0, MAX_REGUNITS ): MCRegUnit
+/// [FIRST_LDSDMA, LAST_LDSDMA ): LDS DMA IDs
+using VMEMID = uint32_t;
+
+enum : VMEMID {
+ TRACKINGID_RANGE_LEN = (1 << 16),
+
+ REGUNITS_BEGIN = 0,
+ REGUNITS_END = REGUNITS_BEGIN + TRACKINGID_RANGE_LEN,
+
+ // Note for LDSDMA: LDSDMA_BEGIN corresponds to the "common"
+ // entry, which is updated for all LDS DMA operations encountered.
+ // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
+ LDSDMA_BEGIN = REGUNITS_END,
+ LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,
+
+ NUM_LDSDMA = TRACKINGID_RANGE_LEN
+};
struct HardwareLimits {
unsigned LoadcntMax; // Corresponds to VMcnt prior to gfx12.
@@ -146,30 +166,6 @@ static constexpr StringLiteral WaitEventTypeName[] = {
#undef AMDGPU_EVENT_NAME
// clang-format on
-// The mapping is:
-// 0 .. SQ_MAX_PGM_VGPRS-1 real VGPRs
-// SQ_MAX_PGM_VGPRS .. NUM_ALL_VGPRS-1 extra VGPR-like slots
-// NUM_ALL_VGPRS .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs
-// NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS .. SCC
-// We reserve a fixed number of VGPR slots in the scoring tables for
-// special tokens like SCMEM_LDS (needed for buffer load to LDS).
-enum RegisterMapping {
- SQ_MAX_PGM_VGPRS = 2048, // Maximum programmable VGPRs across all targets.
- AGPR_OFFSET = 512, // Maximum programmable ArchVGPRs across all targets.
- SQ_MAX_PGM_SGPRS = 128, // Maximum programmable SGPRs across all targets.
- // Artificial register slots to track LDS writes into specific LDS locations
- // if a location is known. When slots are exhausted or location is
- // unknown use the first slot. The first slot is also always updated in
- // addition to known location's slot to properly generate waits if dependent
- // instruction's location is unknown.
- FIRST_LDS_VGPR = SQ_MAX_PGM_VGPRS, // Extra slots for LDS stores.
- NUM_LDS_VGPRS = 9, // One more than the stores we track.
- NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_LDS_VGPRS, // Where SGPRs start.
- NUM_ALL_ALLOCATABLE = NUM_ALL_VGPRS + SQ_MAX_PGM_SGPRS,
- // Remaining non-allocatable registers
- SCC = NUM_ALL_ALLOCATABLE
-};
-
// Enumerate different types of result-returning VMEM operations. Although
// s_waitcnt orders them all with a single vmcnt counter, in the absence of
// s_waitcnt only instructions of the same VmemType are guaranteed to write
@@ -616,32 +612,26 @@ class WaitcntBrackets {
return getScoreUB(T) - getScoreLB(T);
}
- unsigned getRegScore(int GprNo, InstCounterType T) const {
- if (GprNo < NUM_ALL_VGPRS)
- return VgprScores[T][GprNo];
-
- if (GprNo < NUM_ALL_ALLOCATABLE)
- return SgprScores[getSgprScoresIdx(T)][GprNo - NUM_ALL_VGPRS];
+ unsigned getSGPRScore(MCRegUnit RU, InstCounterType T) const {
+ auto It = SGPRs.find(RU);
+ return (It != SGPRs.end()) ? It->second.Scores[getSgprScoresIdx(T)] : 0;
+ }
- assert(GprNo == SCC);
- return SCCScore;
+ unsigned getVMemScore(VMEMID TID, InstCounterType T) const {
+ auto It = VMem.find(TID);
+ return (It != VMem.end()) ? It->second.Scores[T] : 0;
}
bool merge(const WaitcntBrackets &Other);
- RegInterval getRegInterval(const MachineInstr *MI,
- const MachineOperand &Op) const;
-
bool counterOutOfOrder(InstCounterType T) const;
void simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;
void simplifyWaitcnt(InstCounterType T, unsigned &Count) const;
- void determineWait(InstCounterType T, RegInterval Interval,
- AMDGPU::Waitcnt &Wait) const;
- void determineWait(InstCounterType T, int RegNo,
- AMDGPU::Waitcnt &Wait) const {
- determineWait(T, {RegNo, RegNo + 1}, Wait);
- }
+ void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg,
+ AMDGPU::Waitcnt &Wait) const;
+ void determineWaitForLDSDMA(InstCounterType T, VMEMID TID,
+ AMDGPU::Waitcnt &Wait) const;
void tryClearSCCWriteEvent(MachineInstr *Inst);
void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
@@ -690,19 +680,19 @@ class WaitcntBrackets {
// Return true if there might be pending writes to the vgpr-interval by VMEM
// instructions with types different from V.
- bool hasOtherPendingVmemTypes(RegInterval Interval, VmemType V) const {
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
- assert(RegNo < NUM_ALL_VGPRS);
- if (VgprVmemTypes[RegNo] & ~(1 << V))
+ bool hasOtherPendingVmemTypes(MCPhysReg Reg, VmemType V) const {
+ for (MCRegUnit RU : regunits(Reg)) {
+ auto It = VMem.find(RU);
+ if (It != VMem.end() && (It->second.VMEMTypes & ~(1 << V)))
return true;
}
return false;
}
- void clearVgprVmemTypes(RegInterval Interval) {
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
- assert(RegNo < NUM_ALL_VGPRS);
- VgprVmemTypes[RegNo] = 0;
+ void clearVgprVmemTypes(MCPhysReg Reg) {
+ for (MCRegUnit RU : regunits(Reg)) {
+ if (auto It = VMem.find(RU); It != VMem.end())
+ It->second.VMEMTypes = 0;
}
}
@@ -718,7 +708,7 @@ class WaitcntBrackets {
bool hasPointSampleAccel(const MachineInstr &MI) const;
bool hasPointSamplePendingVmemTypes(const MachineInstr &MI,
- RegInterval Interval) const;
+ MCPhysReg RU) const;
void print(raw_ostream &) const;
void dump() const { print(dbgs()); }
@@ -730,9 +720,24 @@ class WaitcntBrackets {
unsigned MyShift;
unsigned OtherShift;
};
+
+ void determineWaitForScore(InstCounterType T, unsigned Score,
+ AMDGPU::Waitcnt &Wait) const;
+
static bool mergeScore(const MergeInfo &M, unsigned &Score,
unsigned OtherScore);
+ iterator_range<MCRegUnitIterator> regunits(MCPhysReg Reg) const {
+ assert(Reg != AMDGPU::SCC && "Shouldn't be used on SCC");
+ const TargetRegisterClass *RC = Context->TRI->getPhysRegBaseClass(Reg);
+ unsigned Size = Context->TRI->getRegSizeInBits(*RC);
+ if (!Context->TRI->isInAllocatableClass(Reg))
+ return {{}, {}};
+ if (Size == 16 && Context->ST->hasD16Writes32BitVgpr())
+ Reg = Context->TRI->get32BitRegister(Reg);
+ return Context->TRI->regunits(Reg);
+ }
+
void setScoreLB(InstCounterType T, unsigned Val) {
assert(T < NUM_INST_CNTS);
ScoreLBs[T] = Val;
@@ -749,15 +754,26 @@ class WaitcntBrackets {
ScoreLBs[EXP_CNT] = ScoreUBs[EXP_CNT] - Context->getWaitCountMax(EXP_CNT);
}
- void setRegScore(int GprNo, InstCounterType T, unsigned Val) {
- setScoreByInterval({GprNo, GprNo + 1}, T, Val);
+ void setRegScore(MCPhysReg Reg, InstCounterType T, unsigned Val) {
+ const SIRegisterInfo *TRI = Context->TRI;
+ if (Reg == AMDGPU::SCC) {
+ SCCScore = Val;
+ } else if (TRI->isVectorRegister(*Context->MRI, Reg)) {
+ for (MCRegUnit RU : regunits(Reg))
+ VMem[RU].Scores[T] = Val;
+ } else if (TRI->isSGPRReg(*Context->MRI, Reg)) {
+ auto STy = getSgprScoresIdx(T);
+ for (MCRegUnit RU : regunits(Reg))
+ SGPRs[RU].Scores[STy] = Val;
+ }
}
- void setScoreByInterval(RegInterval Interval, InstCounterType CntTy,
- unsigned Score);
+ void setVMemScore(VMEMID TID, InstCounterType T, unsigned Val) {
+ VMem[TID].Scores[T] = Val;
+ }
- void setScoreByOperand(const MachineInstr *MI, const MachineOperand &Op,
- InstCounterType CntTy, unsigned Val);
+ void setScoreByOperand(const MachineOperand &Op, InstCounterType CntTy,
+ unsigned Val);
const SIInsertWaitcnts *Context;
@@ -768,26 +784,39 @@ class WaitcntBrackets {
unsigned LastFlat[NUM_INST_CNTS] = {0};
// Remember the last GDS operation.
unsigned LastGDS = 0;
- // wait_cnt scores for every vgpr.
- // Keep track of the VgprUB and SgprUB to make merge at join efficient.
- int VgprUB = -1;
- int SgprUB = -1;
- unsigned VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS] = {{0}};
- // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
- // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
- // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps the
- // X_CNT score.
- unsigned SgprScores[2][SQ_MAX_PGM_SGPRS] = {{0}};
+
+ /// The score tracking logic is fragmented as follows:
+ /// - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
+ /// - SGPRs: SGPR RegUnits
+ /// - SCC
+
+ struct VGPRInfo {
+ // Scores for all instruction counters.
+ unsigned Scores[NUM_INST_CNTS] = {0};
+ // Bitmask of the VmemTypes of VMEM instructions for this VGPR.
+ unsigned VMEMTypes = 0;
+ };
+
+ struct SGPRInfo {
+ // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
+ // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
+ // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps
+ // the X_CNT score.
+ unsigned Scores[2] = {0};
+ };
+
+ DenseMap<VMEMID, VGPRInfo> VMem; // VGPR + LDS DMA
+ DenseMap<MCRegUnit, SGPRInfo> SGPRs;
+
// Reg score for SCC.
unsigned SCCScore = 0;
// The unique instruction that has an SCC write pending, if there is one.
const MachineInstr *PendingSCCWrite = nullptr;
- // Bitmask of the VmemTypes of VMEM instructions that might have a pending
- // write to each vgpr.
- unsigned char VgprVmemTypes[NUM_ALL_VGPRS] = {0};
+
// Store representative LDS DMA operations. The only useful info here is
// alias info. One store is kept per unique AAInfo.
- SmallVector<const MachineInstr *, NUM_LDS_VGPRS - 1> LDSDMAStores;
+ // Entry zero is the "generic" entry that applies to all LDSDMA stores.
+ SmallVector<const MachineInstr *> LDSDMAStores;
};
class SIInsertWaitcntsLegacy : public MachineFunctionPass {
@@ -813,82 +842,10 @@ class SIInsertWaitcntsLegacy : public MachineFunctionPass {
} // end anonymous namespace
-RegInterval WaitcntBrackets::getRegInterval(const MachineInstr *MI,
- const MachineOperand &Op) const {
- if (Op.getReg() == AMDGPU::SCC)
- return {SCC, SCC + 1};
-
- const SIRegisterInfo *TRI = Context->TRI;
- const MachineRegisterInfo *MRI = Context->MRI;
-
- if (!TRI->isInAllocatableClass(Op.getReg()))
- return {-1, -1};
-
- // A use via a PW operand does not need a waitcnt.
- // A partial write is not a WAW.
- assert(!Op.getSubReg() || !Op.isUndef());
-
- RegInterval Result;
-
- MCRegister MCReg = AMDGPU::getMCReg(Op.getReg(), *Context->ST);
- unsigned RegIdx = TRI->getHWRegIndex(MCReg);
-
- const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Op.getReg());
- unsigned Size = TRI->getRegSizeInBits(*RC);
-
- // AGPRs/VGPRs are tracked every 16 bits, SGPRs by 32 bits
- if (TRI->isVectorRegister(*MRI, Op.getReg())) {
- unsigned Reg = RegIdx << 1 | (AMDGPU::isHi16Reg(MCReg, *TRI) ? 1 : 0);
- assert(!Context->ST->hasMAIInsts() || Reg < AGPR_OFFSET);
- Result.first = Reg;
- if (TRI->isAGPR(*MRI, Op.getReg()))
- Result.first += AGPR_OFFSET;
- assert(Result.first >= 0 && Result.first < SQ_MAX_PGM_VGPRS);
- assert(Size % 16 == 0);
- Result.second = Result.first + (Size / 16);
-
- if (Size == 16 && Context->ST->hasD16Writes32BitVgpr()) {
- // Regardless of which lo16/hi16 is used, consider the full 32-bit
- // register used.
- if (AMDGPU::isHi16Reg(MCReg, *TRI))
- Result.first -= 1;
- else
- Result.second += 1;
- }
- } else if (TRI->isSGPRReg(*MRI, Op.getReg()) && RegIdx < SQ_MAX_PGM_SGPRS) {
- // SGPRs including VCC, TTMPs and EXEC but excluding read-only scalar
- // sources like SRC_PRIVATE_BASE.
- Result.first = RegIdx + NUM_ALL_VGPRS;
- Result.second = Result.first + divideCeil(Size, 32);
- } else {
- return {-1, -1};
- }
-
- return Result;
-}
-
-void WaitcntBrackets::setScoreByInterval(RegInterval Interval,
- InstCounterType CntTy,
- unsigned Score) {
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
- if (RegNo < NUM_ALL_VGPRS) {
- VgprUB = std::max(VgprUB, RegNo);
- VgprScores[CntTy][RegNo] = Score;
- } else if (RegNo < NUM_ALL_ALLOCATABLE) {
- SgprUB = std::max(SgprUB, RegNo - NUM_ALL_VGPRS);
- SgprScores[getSgprScoresIdx(CntTy)][RegNo - NUM_ALL_VGPRS] = Score;
- } else {
- assert(RegNo == SCC);
- SCCScore = Score;
- }
- }
-}
-
-void WaitcntBrackets::setScoreByOperand(const MachineInstr *MI,
- const MachineOperand &Op,
+void WaitcntBrackets::setScoreByOperand(const MachineOperand &Op,
InstCounterType CntTy, unsigned Score) {
- RegInterval Interval = getRegInterval(MI, Op);
- setScoreByInterval(Interval, CntTy, Score);
+ assert(Op.isReg());
+ setRegScore(Op.getReg().asMCReg(), CntTy, Score);
}
// Return true if the subtarget is one that enables Point Sample Acceleration
@@ -911,12 +868,12 @@ bool WaitcntBrackets::hasPointSampleAccel(const MachineInstr &MI) const {
// one that has outstanding writes to vmem-types different than VMEM_NOSAMPLER
// (this is the type that a point sample accelerated instruction effectively
// becomes)
-bool WaitcntBrackets::hasPointSamplePendingVmemTypes(
- const MachineInstr &MI, RegInterval Interval) const {
+bool WaitcntBrackets::hasPointSamplePendingVmemTypes(const MachineInstr &MI,
+ MCPhysReg Reg) const {
if (!hasPointSampleAccel(MI))
return false;
- return hasOtherPendingVmemTypes(Interval, VMEM_NOSAMPLER);
+ return hasOtherPendingVmemTypes(Reg, VMEM_NOSAMPLER);
}
void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
@@ -943,57 +900,52 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// All GDS operations must protect their address register (same as
// export.)
if (const auto *AddrOp = TII->getNamedOperand(Inst, AMDGPU::OpName::addr))
- setScoreByOperand(&Inst, *AddrOp, EXP_CNT, CurrScore);
+ setScoreByOperand(*AddrOp, EXP_CNT, CurrScore);
if (Inst.mayStore()) {
if (const auto *Data0 =
TII->getNamedOperand(Inst, AMDGPU::OpName::data0))
- setScoreByOperand(&Inst, *Data0, EXP_CNT, CurrScore);
+ setScoreByOperand(*Data0, EXP_CNT, CurrScore);
if (const auto *Data1 =
TII->getNamedOperand(Inst, AMDGPU::OpName::data1))
- setScoreByOperand(&Inst, *Data1, EXP_CNT, CurrScore);
+ setScoreByOperand(*Data1, EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst) && !SIInstrInfo::isGWS(Inst) &&
Inst.getOpcode() != AMDGPU::DS_APPEND &&
Inst.getOpcode() != AMDGPU::DS_CONSUME &&
Inst.getOpcode() != AMDGPU::DS_ORDERED_COUNT) {
for (const MachineOperand &Op : Inst.all_uses()) {
if (TRI->isVectorRegister(*MRI, Op.getReg()))
- setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+ setScoreByOperand(Op, EXP_CNT, CurrScore);
}
}
} else if (TII->isFLAT(Inst)) {
if (Inst.mayStore()) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst)) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
}
} else if (TII->isMIMG(Inst)) {
if (Inst.mayStore()) {
- setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+ setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst)) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
}
} else if (TII->isMTBUF(Inst)) {
if (Inst.mayStore())
- setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+ setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
} else if (TII->isMUBUF(Inst)) {
if (Inst.mayStore()) {
- setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+ setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
} else if (SIInstrInfo::isAtomicRet(Inst)) {
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
EXP_CNT, CurrScore);
}
} else if (TII->isLDSDIR(Inst)) {
// LDSDIR instructions attach the score to the destination.
- setScoreByOperand(&Inst,
- *TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
+ setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
EXP_CNT, CurrScore);
} else {
if (TII->isEXP(Inst)) {
@@ -1003,18 +955,18 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// score.
for (MachineOperand &DefMO : Inst.all_defs()) {
if (TRI->isVGPR(*MRI, DefMO.getReg())) {
- setScoreByOperand(&Inst, DefMO, EXP_CNT, CurrScore);
+ setScoreByOperand(DefMO, EXP_CNT, CurrScore);
}
}
}
for (const MachineOperand &Op : Inst.all_uses()) {
if (TRI->isVectorRegister(*MRI, Op.getReg()))
- setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+ setScoreByOperand(Op, EXP_CNT, CurrScore);
}
}
} else if (T == X_CNT) {
for (const MachineOperand &Op : Inst.all_uses())
- setScoreByOperand(&Inst, Op, T, CurrScore);
+ setScoreByOperand(Op, T, CurrScore);
} else /* LGKM_CNT || EXP_CNT || VS_CNT || NUM_INST_CNTS */ {
// Match the score to the destination registers.
//
@@ -1026,9 +978,9 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// Special cases where implicit register defs exists, such as M0 or VCC,
// but none with memory instructions.
for (const MachineOperand &Op : Inst.defs()) {
- RegInterval Interval = getRegInterval(&Inst, Op);
if (T == LOAD_CNT || T == SAMPLE_CNT || T == BVH_CNT) {
- if (Interval.first >= NUM_ALL_VGPRS)
+ if (!Context->TRI->isVectorRegister(*Context->MRI,
+ Op.getReg())) // TODO: add wrapper
continue;
if (updateVMCntOnly(Inst)) {
// updateVMCntOnly should only leave us with VGPRs
@@ -1041,11 +993,11 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
// this with another potential dependency
if (hasPointSampleAccel(Inst))
TypesMask |= 1 << VMEM_NOSAMPLER;
- for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo)
- VgprVmemTypes[RegNo] |= TypesMask;
+ for (MCRegUnit RU : regunits(Op.getReg().asMCReg()))
+ VMem[RU].VMEMTypes |= TypesMask;
}
}
- setScoreByInterval(Interval, T, CurrScore);
+ setScoreByOperand(Op, T, CurrScore);
}
if (Inst.mayStore() &&
(TII->isDS(Inst) || TII->mayWriteLDSThroughDMA(Inst))) {
@@ -1076,19 +1028,19...
[truncated]
|
You can get the unit's "root" register and test that. See MCRegUnitRootIterator. AMDGPU does not use ad hoc aliasing so I think every unit should have exactly one root. |
Is it worth adding this (potentially expensive?) query just so I can merge a few methods together though ? Perhaps I can try it in another diff on top of this ? This diff already changes a lot of thing, I want to make sure it doesn't become too big to review or debug in case of an issue. |
3ec5466 to
34a77ec
Compare
34a77ec to
70baf17
Compare
|
Ping |
70baf17 to
9cecce3
Compare
9cecce3 to
e1a66e0
Compare
ritter-x2a
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to see the impact of the DenseMap vs the vector (or lack thereof) in a compile-time regression tracker, but it doesn't seem like we have one for AMDGPU workloads working right now.
The reasoning in the commit message sounds reasonable to me, though, so LGTM.
Why would this get memcpied, and how big is it? I have a hard time believing this is big enough to be a problem |
What we had in That's (roughly) 4B * 10 * 2048 = 81920B of data per basic block just for VGPRs. Most of these entries will never be used, it's just wasted memory. Note that the main reason for this patch is streamlining the code of InsertWaitCnt. Moving the tracking to |
It's memcpyed here to save the WaitCntBracket state at the end of a basic block, to be used as the initial state for a successor block:
|
|
Maybe this should use SparseBitVector? |
It stores counters, not single bits. I think using a |
| using VMEMID = uint32_t; | ||
|
|
||
| enum : VMEMID { | ||
| TRACKINGID_RANGE_LEN = (1 << 16), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just an arbitrary value larger than MAX_REGUNITS? Can you assert somewhere that it is >= TRI.getNumRegUnits()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the ctor of WaitcntBrackets. I also added a comment to clarify the value is arbitrary and can be changed if more is needed.
| LDSDMA_BEGIN = REGUNITS_END, | ||
| LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN, | ||
|
|
||
| NUM_LDSDMA = TRACKINGID_RANGE_LEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot more than the 9 that we used to track! But I guess there is no downside, now that we are using DenseMap instead of arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a u32 and each "slice" is u16 so we can store 65535 slices and we only use 2.
It's a bit overkill, we can always re-partition the VMEMID if we ever have an issue with it. It's an implementation detail
| auto It = SGPRs.find(RU); | ||
| return It != SGPRs.end() ? It->second.Scores[getSgprScoresIdx(T)] : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just SGPRs.lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but it'd create a temporary value just to extract a zero out of it. Right now Scores is just 2 ints so it wouldn't be a problem, but I prefer to use the find pattern to be consistent with the rest, and also to avoid the temporary in case we add more to Scores over time and the temporary becomes big
| void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg, | ||
| AMDGPU::Waitcnt &Wait) const; | ||
| void determineWaitForLDSDMA(InstCounterType T, VMEMID TID, | ||
| AMDGPU::Waitcnt &Wait) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these be two overloads of determineWait, or does that cause problems because MCPhysReg and VMEMID have more or less the same underlying type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah they're both integers. I could make the VMEMID an enum class but that'd add a bunch of casts all over the place. It's a tradeoff
| // The score tracking logic is fragmented as follows: | ||
| // - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding. | ||
| // - SGPRs: SGPR RegUnits | ||
| // - SCC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't SCC be handled like any other SGPR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, SCC is not a SGPR. At least it's not part of the SGPR reg classes.
This patch aims to be NFCI, so I didn't try hard to fix things like these because I didn't want to bloat the patch too much. I want to come back to the pass and take another look once this lands so I added a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SCC isn't an SGPR; it's not general purpose and not allocatable
| } | ||
| } | ||
|
|
||
| void WaitcntBrackets::setScoreByOperand(const MachineInstr *MI, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted you could precommit the change to remove MI here, since it is trivially unused here and in getRegInterval. That would avoid some churn in this PR.
| // entry, which is updated for all LDS DMA operations encountered. | ||
| // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1. | ||
| LDSDMA_BEGIN = REGUNITS_END, | ||
| LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure LDSDMA_END is needed? In a couple of places you use it in assertions, but there is nothing that would prevent those assertions from failing if LDS IDs happened to climb high enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determineWaitForLDSDMA would fail because it checks the ID is in the right range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it would fail an assertion, but you should not be writing assertions that can fail on valid user input, and if I understand correctly then a program that accesses enough different LDS allocations will fail the assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check so we can't allocate IDs above the limit, like we had before
|
|
||
| for (int J = 0; J <= VgprUB; J++) { | ||
| unsigned RegScore = getRegScore(J, T); | ||
| for (auto &[ID, Info] : VMem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to print the entries in non-deterministic order? That could be annoying, although it is only debug output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sorted the keys, I think quality of debug output is important.
|
One more thought: is there a risk of the DenseMaps growing ever larger because we never remove entries from them? Maybe the |
f88e053 to
e944acd
Compare
I added a method to purge the map, and I also made The map can't grow huge in an uncontrolled way. The worst case (if we don't purge it) is that we end up with one entry for each register unit used across the function in every I collected statistics locally using an assert in the destructor of |
e944acd to
10726b1
Compare
| } else if (TRI->isVectorRegister(*Context->MRI, Reg)) { | ||
| for (MCRegUnit RU : regunits(Reg)) | ||
| VMem[toVMEMID(RU)].Scores[T] = Val; | ||
| } else if (TRI->isSGPRReg(*Context->MRI, Reg)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be else if (isSGPR()) else { }? There aren't registers that are something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that are handled right now but I'll add an unreachable there just in case.
| // The score tracking logic is fragmented as follows: | ||
| // - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding. | ||
| // - SGPRs: SGPR RegUnits | ||
| // - SCC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SCC isn't an SGPR; it's not general purpose and not allocatable
10726b1 to
1a77571
Compare
Yes I think the assert sounds useful. |
| // entry, which is updated for all LDS DMA operations encountered. | ||
| // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1. | ||
| LDSDMA_BEGIN = REGUNITS_END, | ||
| LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it would fail an assertion, but you should not be writing assertions that can fail on valid user input, and if I understand correctly then a program that accesses enough different LDS allocations will fail the assertion.
I added a debug-only destructor that checks the maps. |
|
|
||
| ; There are 8 pseudo registers defined to track LDS DMA dependencies. | ||
|
|
||
| define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#170660 should fix this. I'll rebase once it lands.
2049624 to
9560519
Compare
Clean up the tracking logic to rely on register units. The pass was already "reinventing" the concept just to deal with 16 bit registers. There are no test changes, functionality is the same, except we can now track more LDS DMA IDs if we need it. The debug prints also changed a bit because we now talk in terms of register units. This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the `memcpy` built-in). I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses. This still isn't as clean as I'd like it to be though. There is a mix of "VMEMID", "LDS DMA ID", "SGPR RegUnit" and "PhysReg" in the API of WaitCntBrackets. There is no type safety to avoid mix-ups as these are all integers. We could add another layer of abstraction on top, but I feel like it's going to add too much code/boilerplate for such a small issue.
9560519 to
c1b5840
Compare
jayfoad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your patience.
| if (Slot) | ||
| setRegScore(FIRST_LDS_VGPR, T, CurrScore); | ||
| setVMemScore(LDSDMA_BEGIN, T, CurrScore); | ||
| if (Slot && Slot < NUM_LDSDMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question: I think Slot can be zero here but only if MemOp does not have a suitable MMO with AA info. Is that case still handled conservatively correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we always set the LDSDMA_BEGIN slot no matter what, so we can fall back to that if needed.
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/31920 Here is the relevant piece of the build log for the reference |

The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units.
There are no test changes because functionality didn't change, except:
1 << 16)This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the
memcpybuilt-in on a big test file).I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else.
One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR. I don't think RUs always considered registers so I can really fetch their regclass.