[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking #162077

Pierre-vh · 2025-10-06T12:22:39Z

The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units.

There are no test changes because functionality didn't change, except:

We can now track more LDS DMA IDs if we need it (up to 1 << 16)
The debug prints also changed a bit because we now talk in terms of register units.

This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the memcpy built-in on a big test file).

I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses.

I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else.
One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR. I don't think RUs always considered registers so I can really fetch their regclass.

Pierre-vh · 2025-10-06T12:22:59Z

[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking #162077 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-10-06T12:25:57Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvmbot · 2025-10-06T12:32:15Z

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units.

There are no test changes because functionality didn't change, except:

We can now track more LDS DMA IDs if we need it.
The debug prints also changed a bit because we now talk in terms of register units.

This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the memcpy built-in).

I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap
accesses.

I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else.
One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR. I don't think RUs always considered registers so I can really fetch their regclass.

Patch is 35.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162077.diff

1 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+225-273)

diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 76bfce8c0f6f9..90a5cd4e87ae7 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -97,7 +97,27 @@ auto inst_counter_types(InstCounterType MaxCounter = NUM_INST_CNTS) {
   return enum_seq(LOAD_CNT, MaxCounter);
 }
 
-using RegInterval = std::pair<int, int>;
+/// Integer IDs used to track vector memory locations we may have to wait on.
+/// Encoded as u16 chunks:
+///
+///   [0,            MAX_REGUNITS ): MCRegUnit
+///   [FIRST_LDSDMA, LAST_LDSDMA  ): LDS DMA IDs
+using VMEMID = uint32_t;
+
+enum : VMEMID {
+  TRACKINGID_RANGE_LEN = (1 << 16),
+
+  REGUNITS_BEGIN = 0,
+  REGUNITS_END = REGUNITS_BEGIN + TRACKINGID_RANGE_LEN,
+
+  // Note for LDSDMA: LDSDMA_BEGIN corresponds to the "common"
+  // entry, which is updated for all LDS DMA operations encountered.
+  // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
+  LDSDMA_BEGIN = REGUNITS_END,
+  LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,
+
+  NUM_LDSDMA = TRACKINGID_RANGE_LEN
+};
 
 struct HardwareLimits {
   unsigned LoadcntMax; // Corresponds to VMcnt prior to gfx12.
@@ -146,30 +166,6 @@ static constexpr StringLiteral WaitEventTypeName[] = {
 #undef AMDGPU_EVENT_NAME
 // clang-format on
 
-// The mapping is:
-//  0                .. SQ_MAX_PGM_VGPRS-1               real VGPRs
-//  SQ_MAX_PGM_VGPRS .. NUM_ALL_VGPRS-1                  extra VGPR-like slots
-//  NUM_ALL_VGPRS    .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs
-//  NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS ..                    SCC
-// We reserve a fixed number of VGPR slots in the scoring tables for
-// special tokens like SCMEM_LDS (needed for buffer load to LDS).
-enum RegisterMapping {
-  SQ_MAX_PGM_VGPRS = 2048, // Maximum programmable VGPRs across all targets.
-  AGPR_OFFSET = 512,       // Maximum programmable ArchVGPRs across all targets.
-  SQ_MAX_PGM_SGPRS = 128,  // Maximum programmable SGPRs across all targets.
-  // Artificial register slots to track LDS writes into specific LDS locations
-  // if a location is known. When slots are exhausted or location is
-  // unknown use the first slot. The first slot is also always updated in
-  // addition to known location's slot to properly generate waits if dependent
-  // instruction's location is unknown.
-  FIRST_LDS_VGPR = SQ_MAX_PGM_VGPRS, // Extra slots for LDS stores.
-  NUM_LDS_VGPRS = 9,                 // One more than the stores we track.
-  NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_LDS_VGPRS, // Where SGPRs start.
-  NUM_ALL_ALLOCATABLE = NUM_ALL_VGPRS + SQ_MAX_PGM_SGPRS,
-  // Remaining non-allocatable registers
-  SCC = NUM_ALL_ALLOCATABLE
-};
-
 // Enumerate different types of result-returning VMEM operations. Although
 // s_waitcnt orders them all with a single vmcnt counter, in the absence of
 // s_waitcnt only instructions of the same VmemType are guaranteed to write
@@ -616,32 +612,26 @@ class WaitcntBrackets {
     return getScoreUB(T) - getScoreLB(T);
   }
 
-  unsigned getRegScore(int GprNo, InstCounterType T) const {
-    if (GprNo < NUM_ALL_VGPRS)
-      return VgprScores[T][GprNo];
-
-    if (GprNo < NUM_ALL_ALLOCATABLE)
-      return SgprScores[getSgprScoresIdx(T)][GprNo - NUM_ALL_VGPRS];
+  unsigned getSGPRScore(MCRegUnit RU, InstCounterType T) const {
+    auto It = SGPRs.find(RU);
+    return (It != SGPRs.end()) ? It->second.Scores[getSgprScoresIdx(T)] : 0;
+  }
 
-    assert(GprNo == SCC);
-    return SCCScore;
+  unsigned getVMemScore(VMEMID TID, InstCounterType T) const {
+    auto It = VMem.find(TID);
+    return (It != VMem.end()) ? It->second.Scores[T] : 0;
   }
 
   bool merge(const WaitcntBrackets &Other);
 
-  RegInterval getRegInterval(const MachineInstr *MI,
-                             const MachineOperand &Op) const;
-
   bool counterOutOfOrder(InstCounterType T) const;
   void simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;
   void simplifyWaitcnt(InstCounterType T, unsigned &Count) const;
 
-  void determineWait(InstCounterType T, RegInterval Interval,
-                     AMDGPU::Waitcnt &Wait) const;
-  void determineWait(InstCounterType T, int RegNo,
-                     AMDGPU::Waitcnt &Wait) const {
-    determineWait(T, {RegNo, RegNo + 1}, Wait);
-  }
+  void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg,
+                               AMDGPU::Waitcnt &Wait) const;
+  void determineWaitForLDSDMA(InstCounterType T, VMEMID TID,
+                              AMDGPU::Waitcnt &Wait) const;
   void tryClearSCCWriteEvent(MachineInstr *Inst);
 
   void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
@@ -690,19 +680,19 @@ class WaitcntBrackets {
 
   // Return true if there might be pending writes to the vgpr-interval by VMEM
   // instructions with types different from V.
-  bool hasOtherPendingVmemTypes(RegInterval Interval, VmemType V) const {
-    for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
-      assert(RegNo < NUM_ALL_VGPRS);
-      if (VgprVmemTypes[RegNo] & ~(1 << V))
+  bool hasOtherPendingVmemTypes(MCPhysReg Reg, VmemType V) const {
+    for (MCRegUnit RU : regunits(Reg)) {
+      auto It = VMem.find(RU);
+      if (It != VMem.end() && (It->second.VMEMTypes & ~(1 << V)))
         return true;
     }
     return false;
   }
 
-  void clearVgprVmemTypes(RegInterval Interval) {
-    for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
-      assert(RegNo < NUM_ALL_VGPRS);
-      VgprVmemTypes[RegNo] = 0;
+  void clearVgprVmemTypes(MCPhysReg Reg) {
+    for (MCRegUnit RU : regunits(Reg)) {
+      if (auto It = VMem.find(RU); It != VMem.end())
+        It->second.VMEMTypes = 0;
     }
   }
 
@@ -718,7 +708,7 @@ class WaitcntBrackets {
 
   bool hasPointSampleAccel(const MachineInstr &MI) const;
   bool hasPointSamplePendingVmemTypes(const MachineInstr &MI,
-                                      RegInterval Interval) const;
+                                      MCPhysReg RU) const;
 
   void print(raw_ostream &) const;
   void dump() const { print(dbgs()); }
@@ -730,9 +720,24 @@ class WaitcntBrackets {
     unsigned MyShift;
     unsigned OtherShift;
   };
+
+  void determineWaitForScore(InstCounterType T, unsigned Score,
+                             AMDGPU::Waitcnt &Wait) const;
+
   static bool mergeScore(const MergeInfo &M, unsigned &Score,
                          unsigned OtherScore);
 
+  iterator_range<MCRegUnitIterator> regunits(MCPhysReg Reg) const {
+    assert(Reg != AMDGPU::SCC && "Shouldn't be used on SCC");
+    const TargetRegisterClass *RC = Context->TRI->getPhysRegBaseClass(Reg);
+    unsigned Size = Context->TRI->getRegSizeInBits(*RC);
+    if (!Context->TRI->isInAllocatableClass(Reg))
+      return {{}, {}};
+    if (Size == 16 && Context->ST->hasD16Writes32BitVgpr())
+      Reg = Context->TRI->get32BitRegister(Reg);
+    return Context->TRI->regunits(Reg);
+  }
+
   void setScoreLB(InstCounterType T, unsigned Val) {
     assert(T < NUM_INST_CNTS);
     ScoreLBs[T] = Val;
@@ -749,15 +754,26 @@ class WaitcntBrackets {
       ScoreLBs[EXP_CNT] = ScoreUBs[EXP_CNT] - Context->getWaitCountMax(EXP_CNT);
   }
 
-  void setRegScore(int GprNo, InstCounterType T, unsigned Val) {
-    setScoreByInterval({GprNo, GprNo + 1}, T, Val);
+  void setRegScore(MCPhysReg Reg, InstCounterType T, unsigned Val) {
+    const SIRegisterInfo *TRI = Context->TRI;
+    if (Reg == AMDGPU::SCC) {
+      SCCScore = Val;
+    } else if (TRI->isVectorRegister(*Context->MRI, Reg)) {
+      for (MCRegUnit RU : regunits(Reg))
+        VMem[RU].Scores[T] = Val;
+    } else if (TRI->isSGPRReg(*Context->MRI, Reg)) {
+      auto STy = getSgprScoresIdx(T);
+      for (MCRegUnit RU : regunits(Reg))
+        SGPRs[RU].Scores[STy] = Val;
+    }
   }
 
-  void setScoreByInterval(RegInterval Interval, InstCounterType CntTy,
-                          unsigned Score);
+  void setVMemScore(VMEMID TID, InstCounterType T, unsigned Val) {
+    VMem[TID].Scores[T] = Val;
+  }
 
-  void setScoreByOperand(const MachineInstr *MI, const MachineOperand &Op,
-                         InstCounterType CntTy, unsigned Val);
+  void setScoreByOperand(const MachineOperand &Op, InstCounterType CntTy,
+                         unsigned Val);
 
   const SIInsertWaitcnts *Context;
 
@@ -768,26 +784,39 @@ class WaitcntBrackets {
   unsigned LastFlat[NUM_INST_CNTS] = {0};
   // Remember the last GDS operation.
   unsigned LastGDS = 0;
-  // wait_cnt scores for every vgpr.
-  // Keep track of the VgprUB and SgprUB to make merge at join efficient.
-  int VgprUB = -1;
-  int SgprUB = -1;
-  unsigned VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS] = {{0}};
-  // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
-  // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
-  // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps the
-  // X_CNT score.
-  unsigned SgprScores[2][SQ_MAX_PGM_SGPRS] = {{0}};
+
+  /// The score tracking logic is fragmented as follows:
+  /// - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
+  /// - SGPRs: SGPR RegUnits
+  /// - SCC
+
+  struct VGPRInfo {
+    // Scores for all instruction counters.
+    unsigned Scores[NUM_INST_CNTS] = {0};
+    // Bitmask of the VmemTypes of VMEM instructions for this VGPR.
+    unsigned VMEMTypes = 0;
+  };
+
+  struct SGPRInfo {
+    // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
+    // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
+    // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps
+    // the X_CNT score.
+    unsigned Scores[2] = {0};
+  };
+
+  DenseMap<VMEMID, VGPRInfo> VMem; // VGPR + LDS DMA
+  DenseMap<MCRegUnit, SGPRInfo> SGPRs;
+
   // Reg score for SCC.
   unsigned SCCScore = 0;
   // The unique instruction that has an SCC write pending, if there is one.
   const MachineInstr *PendingSCCWrite = nullptr;
-  // Bitmask of the VmemTypes of VMEM instructions that might have a pending
-  // write to each vgpr.
-  unsigned char VgprVmemTypes[NUM_ALL_VGPRS] = {0};
+
   // Store representative LDS DMA operations. The only useful info here is
   // alias info. One store is kept per unique AAInfo.
-  SmallVector<const MachineInstr *, NUM_LDS_VGPRS - 1> LDSDMAStores;
+  // Entry zero is the "generic" entry that applies to all LDSDMA stores.
+  SmallVector<const MachineInstr *> LDSDMAStores;
 };
 
 class SIInsertWaitcntsLegacy : public MachineFunctionPass {
@@ -813,82 +842,10 @@ class SIInsertWaitcntsLegacy : public MachineFunctionPass {
 
 } // end anonymous namespace
 
-RegInterval WaitcntBrackets::getRegInterval(const MachineInstr *MI,
-                                            const MachineOperand &Op) const {
-  if (Op.getReg() == AMDGPU::SCC)
-    return {SCC, SCC + 1};
-
-  const SIRegisterInfo *TRI = Context->TRI;
-  const MachineRegisterInfo *MRI = Context->MRI;
-
-  if (!TRI->isInAllocatableClass(Op.getReg()))
-    return {-1, -1};
-
-  // A use via a PW operand does not need a waitcnt.
-  // A partial write is not a WAW.
-  assert(!Op.getSubReg() || !Op.isUndef());
-
-  RegInterval Result;
-
-  MCRegister MCReg = AMDGPU::getMCReg(Op.getReg(), *Context->ST);
-  unsigned RegIdx = TRI->getHWRegIndex(MCReg);
-
-  const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Op.getReg());
-  unsigned Size = TRI->getRegSizeInBits(*RC);
-
-  // AGPRs/VGPRs are tracked every 16 bits, SGPRs by 32 bits
-  if (TRI->isVectorRegister(*MRI, Op.getReg())) {
-    unsigned Reg = RegIdx << 1 | (AMDGPU::isHi16Reg(MCReg, *TRI) ? 1 : 0);
-    assert(!Context->ST->hasMAIInsts() || Reg < AGPR_OFFSET);
-    Result.first = Reg;
-    if (TRI->isAGPR(*MRI, Op.getReg()))
-      Result.first += AGPR_OFFSET;
-    assert(Result.first >= 0 && Result.first < SQ_MAX_PGM_VGPRS);
-    assert(Size % 16 == 0);
-    Result.second = Result.first + (Size / 16);
-
-    if (Size == 16 && Context->ST->hasD16Writes32BitVgpr()) {
-      // Regardless of which lo16/hi16 is used, consider the full 32-bit
-      // register used.
-      if (AMDGPU::isHi16Reg(MCReg, *TRI))
-        Result.first -= 1;
-      else
-        Result.second += 1;
-    }
-  } else if (TRI->isSGPRReg(*MRI, Op.getReg()) && RegIdx < SQ_MAX_PGM_SGPRS) {
-    // SGPRs including VCC, TTMPs and EXEC but excluding read-only scalar
-    // sources like SRC_PRIVATE_BASE.
-    Result.first = RegIdx + NUM_ALL_VGPRS;
-    Result.second = Result.first + divideCeil(Size, 32);
-  } else {
-    return {-1, -1};
-  }
-
-  return Result;
-}
-
-void WaitcntBrackets::setScoreByInterval(RegInterval Interval,
-                                         InstCounterType CntTy,
-                                         unsigned Score) {
-  for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
-    if (RegNo < NUM_ALL_VGPRS) {
-      VgprUB = std::max(VgprUB, RegNo);
-      VgprScores[CntTy][RegNo] = Score;
-    } else if (RegNo < NUM_ALL_ALLOCATABLE) {
-      SgprUB = std::max(SgprUB, RegNo - NUM_ALL_VGPRS);
-      SgprScores[getSgprScoresIdx(CntTy)][RegNo - NUM_ALL_VGPRS] = Score;
-    } else {
-      assert(RegNo == SCC);
-      SCCScore = Score;
-    }
-  }
-}
-
-void WaitcntBrackets::setScoreByOperand(const MachineInstr *MI,
-                                        const MachineOperand &Op,
+void WaitcntBrackets::setScoreByOperand(const MachineOperand &Op,
                                         InstCounterType CntTy, unsigned Score) {
-  RegInterval Interval = getRegInterval(MI, Op);
-  setScoreByInterval(Interval, CntTy, Score);
+  assert(Op.isReg());
+  setRegScore(Op.getReg().asMCReg(), CntTy, Score);
 }
 
 // Return true if the subtarget is one that enables Point Sample Acceleration
@@ -911,12 +868,12 @@ bool WaitcntBrackets::hasPointSampleAccel(const MachineInstr &MI) const {
 // one that has outstanding writes to vmem-types different than VMEM_NOSAMPLER
 // (this is the type that a point sample accelerated instruction effectively
 // becomes)
-bool WaitcntBrackets::hasPointSamplePendingVmemTypes(
-    const MachineInstr &MI, RegInterval Interval) const {
+bool WaitcntBrackets::hasPointSamplePendingVmemTypes(const MachineInstr &MI,
+                                                     MCPhysReg Reg) const {
   if (!hasPointSampleAccel(MI))
     return false;
 
-  return hasOtherPendingVmemTypes(Interval, VMEM_NOSAMPLER);
+  return hasOtherPendingVmemTypes(Reg, VMEM_NOSAMPLER);
 }
 
 void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
@@ -943,57 +900,52 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
       // All GDS operations must protect their address register (same as
       // export.)
       if (const auto *AddrOp = TII->getNamedOperand(Inst, AMDGPU::OpName::addr))
-        setScoreByOperand(&Inst, *AddrOp, EXP_CNT, CurrScore);
+        setScoreByOperand(*AddrOp, EXP_CNT, CurrScore);
 
       if (Inst.mayStore()) {
         if (const auto *Data0 =
                 TII->getNamedOperand(Inst, AMDGPU::OpName::data0))
-          setScoreByOperand(&Inst, *Data0, EXP_CNT, CurrScore);
+          setScoreByOperand(*Data0, EXP_CNT, CurrScore);
         if (const auto *Data1 =
                 TII->getNamedOperand(Inst, AMDGPU::OpName::data1))
-          setScoreByOperand(&Inst, *Data1, EXP_CNT, CurrScore);
+          setScoreByOperand(*Data1, EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst) && !SIInstrInfo::isGWS(Inst) &&
                  Inst.getOpcode() != AMDGPU::DS_APPEND &&
                  Inst.getOpcode() != AMDGPU::DS_CONSUME &&
                  Inst.getOpcode() != AMDGPU::DS_ORDERED_COUNT) {
         for (const MachineOperand &Op : Inst.all_uses()) {
           if (TRI->isVectorRegister(*MRI, Op.getReg()))
-            setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+            setScoreByOperand(Op, EXP_CNT, CurrScore);
         }
       }
     } else if (TII->isFLAT(Inst)) {
       if (Inst.mayStore()) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst)) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       }
     } else if (TII->isMIMG(Inst)) {
       if (Inst.mayStore()) {
-        setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+        setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst)) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       }
     } else if (TII->isMTBUF(Inst)) {
       if (Inst.mayStore())
-        setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+        setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
     } else if (TII->isMUBUF(Inst)) {
       if (Inst.mayStore()) {
-        setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+        setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst)) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       }
     } else if (TII->isLDSDIR(Inst)) {
       // LDSDIR instructions attach the score to the destination.
-      setScoreByOperand(&Inst,
-                        *TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
+      setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
                         EXP_CNT, CurrScore);
     } else {
       if (TII->isEXP(Inst)) {
@@ -1003,18 +955,18 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
         // score.
         for (MachineOperand &DefMO : Inst.all_defs()) {
           if (TRI->isVGPR(*MRI, DefMO.getReg())) {
-            setScoreByOperand(&Inst, DefMO, EXP_CNT, CurrScore);
+            setScoreByOperand(DefMO, EXP_CNT, CurrScore);
           }
         }
       }
       for (const MachineOperand &Op : Inst.all_uses()) {
         if (TRI->isVectorRegister(*MRI, Op.getReg()))
-          setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+          setScoreByOperand(Op, EXP_CNT, CurrScore);
       }
     }
   } else if (T == X_CNT) {
     for (const MachineOperand &Op : Inst.all_uses())
-      setScoreByOperand(&Inst, Op, T, CurrScore);
+      setScoreByOperand(Op, T, CurrScore);
   } else /* LGKM_CNT || EXP_CNT || VS_CNT || NUM_INST_CNTS */ {
     // Match the score to the destination registers.
     //
@@ -1026,9 +978,9 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
     // Special cases where implicit register defs exists, such as M0 or VCC,
     // but none with memory instructions.
     for (const MachineOperand &Op : Inst.defs()) {
-      RegInterval Interval = getRegInterval(&Inst, Op);
       if (T == LOAD_CNT || T == SAMPLE_CNT || T == BVH_CNT) {
-        if (Interval.first >= NUM_ALL_VGPRS)
+        if (!Context->TRI->isVectorRegister(*Context->MRI,
+                                            Op.getReg())) // TODO: add wrapper
           continue;
         if (updateVMCntOnly(Inst)) {
           // updateVMCntOnly should only leave us with VGPRs
@@ -1041,11 +993,11 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
           // this with another potential dependency
           if (hasPointSampleAccel(Inst))
             TypesMask |= 1 << VMEM_NOSAMPLER;
-          for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo)
-            VgprVmemTypes[RegNo] |= TypesMask;
+          for (MCRegUnit RU : regunits(Op.getReg().asMCReg()))
+            VMem[RU].VMEMTypes |= TypesMask;
         }
       }
-      setScoreByInterval(Interval, T, CurrScore);
+      setScoreByOperand(Op, T, CurrScore);
     }
     if (Inst.mayStore() &&
         (TII->isDS(Inst) || TII->mayWriteLDSThroughDMA(Inst))) {
@@ -1076,19 +1028,19...
[truncated]

jayfoad · 2025-10-06T14:00:11Z

One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR.

You can get the unit's "root" register and test that. See MCRegUnitRootIterator. AMDGPU does not use ad hoc aliasing so I think every unit should have exactly one root.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Pierre-vh · 2025-10-07T08:13:11Z

One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR.

You can get the unit's "root" register and test that. See MCRegUnitRootIterator. AMDGPU does not use ad hoc aliasing so I think every unit should have exactly one root.

Is it worth adding this (potentially expensive?) query just so I can merge a few methods together though ?
I also thought about merging everything into a single map, then make the Value of the map some type of variant with different fields depending on whether the key is a VGPR/SGPR/SCC/LDSDMA. That'd be a bit less space efficient, but would streamline the implementation.

Perhaps I can try it in another diff on top of this ? This diff already changes a lot of thing, I want to make sure it doesn't become too big to review or debug in case of an issue.

Pierre-vh · 2025-10-28T12:03:29Z

Ping

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

ritter-x2a

It would be nice to see the impact of the DenseMap vs the vector (or lack thereof) in a compile-time regression tracker, but it doesn't seem like we have one for AMDGPU workloads working right now.
The reasoning in the commit message sounds reasonable to me, though, so LGTM.

arsenm · 2025-11-13T22:06:25Z

This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the memcpy built-in on a big test file).

Why would this get memcpied, and how big is it? I have a hard time believing this is big enough to be a problem

Pierre-vh · 2025-11-14T09:33:59Z

Why would this get memcpied, and how big is it? I have a hard time believing this is big enough to be a problem

What we had in WaitCntBracket was

unsigned VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS] = {{0}};

That's (roughly) 4B * 10 * 2048 = 81920B of data per basic block just for VGPRs. Most of these entries will never be used, it's just wasted memory.
I cant' find the source of the memcpy, I can just see a memcpy built-in taking about 30-40% of the runtime of the pass before the patch. I think it's likely inserted by the compiler to initialize the object or copy it somewhere.

Note that the main reason for this patch is streamlining the code of InsertWaitCnt. Moving the tracking to DenseMap is something I've done alongside it as it makes things simpler and (in my limited testing) is either as fast, or faster than the big array.
I could revert to using an array if there is a good reason for it.

jayfoad · 2025-11-14T11:47:39Z

Why would this get memcpied

It's memcpyed here to save the WaitCntBracket state at the end of a basic block, to be used as the initial state for a successor block:

llvm-project/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Line 2874 in 00000dc

SuccBI.Incoming = std::make_unique<WaitcntBrackets>(*Brackets);

arsenm · 2025-11-15T01:35:36Z

Maybe this should use SparseBitVector?

Pierre-vh · 2025-11-17T09:04:02Z

Maybe this should use SparseBitVector?

It stores counters, not single bits. I think using a SparseBitVector will just make this more complicated than it has to be
Is there a good reason why DenseMap isn't good here?

jayfoad · 2025-11-27T12:19:15Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+using VMEMID = uint32_t;
+
+enum : VMEMID {
+  TRACKINGID_RANGE_LEN = (1 << 16),


Is this just an arbitrary value larger than MAX_REGUNITS? Can you assert somewhere that it is >= TRI.getNumRegUnits()?

Added in the ctor of WaitcntBrackets. I also added a comment to clarify the value is arbitrary and can be changed if more is needed.

jayfoad · 2025-11-27T12:20:48Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  LDSDMA_BEGIN = REGUNITS_END,
+  LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,
+
+  NUM_LDSDMA = TRACKINGID_RANGE_LEN


This is a lot more than the 9 that we used to track! But I guess there is no downside, now that we are using DenseMap instead of arrays?

Yes, it's a u32 and each "slice" is u16 so we can store 65535 slices and we only use 2.
It's a bit overkill, we can always re-partition the VMEMID if we ever have an issue with it. It's an implementation detail

jayfoad · 2025-11-27T12:23:21Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+    auto It = SGPRs.find(RU);
+    return It != SGPRs.end() ? It->second.Scores[getSgprScoresIdx(T)] : 0;


Isn't this just SGPRs.lookup?

We could, but it'd create a temporary value just to extract a zero out of it. Right now Scores is just 2 ints so it wouldn't be a problem, but I prefer to use the find pattern to be consistent with the rest, and also to avoid the temporary in case we add more to Scores over time and the temporary becomes big

jayfoad · 2025-11-27T12:25:14Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg,
+                               AMDGPU::Waitcnt &Wait) const;
+  void determineWaitForLDSDMA(InstCounterType T, VMEMID TID,
+                              AMDGPU::Waitcnt &Wait) const;


Could these be two overloads of determineWait, or does that cause problems because MCPhysReg and VMEMID have more or less the same underlying type?

Yeah they're both integers. I could make the VMEMID an enum class but that'd add a bunch of casts all over the place. It's a tradeoff

jayfoad · 2025-11-27T12:32:47Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  // The score tracking logic is fragmented as follows:
+  // - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
+  // - SGPRs: SGPR RegUnits
+  // - SCC


Why can't SCC be handled like any other SGPR?

IIRC, SCC is not a SGPR. At least it's not part of the SGPR reg classes.

This patch aims to be NFCI, so I didn't try hard to fix things like these because I didn't want to bloat the patch too much. I want to come back to the pass and take another look once this lands so I added a TODO.

SCC isn't an SGPR; it's not general purpose and not allocatable

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

jayfoad · 2025-11-27T12:51:13Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

-  }
-}
-
-void WaitcntBrackets::setScoreByOperand(const MachineInstr *MI,


If you wanted you could precommit the change to remove MI here, since it is trivially unused here and in getRegInterval. That would avoid some churn in this PR.

jayfoad · 2025-11-27T12:57:12Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  // entry, which is updated for all LDS DMA operations encountered.
+  // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
+  LDSDMA_BEGIN = REGUNITS_END,
+  LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,


I'm not sure LDSDMA_END is needed? In a couple of places you use it in assertions, but there is nothing that would prevent those assertions from failing if LDS IDs happened to climb high enough.

determineWaitForLDSDMA would fail because it checks the ID is in the right range

Right, it would fail an assertion, but you should not be writing assertions that can fail on valid user input, and if I understand correctly then a program that accesses enough different LDS allocations will fail the assertion.

I added a check so we can't allocate IDs above the limit, like we had before

jayfoad · 2025-11-27T12:58:54Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp


-      for (int J = 0; J <= VgprUB; J++) {
-        unsigned RegScore = getRegScore(J, T);
+      for (auto &[ID, Info] : VMem) {


Is this going to print the entries in non-deterministic order? That could be annoying, although it is only debug output.

I sorted the keys, I think quality of debug output is important.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

jayfoad · 2025-11-27T13:10:24Z

One more thought: is there a risk of the DenseMaps growing ever larger because we never remove entries from them? Maybe the merge function would be a good opportunity to purge useless entries?

Pierre-vh · 2025-11-28T12:02:48Z

One more thought: is there a risk of the DenseMaps growing ever larger because we never remove entries from them? Maybe the merge function would be a good opportunity to purge useless entries?

I added a method to purge the map, and I also made clearVgprVmemTypes erase the map entry if it causes it to be empty.

The map can't grow huge in an uncontrolled way. The worst case (if we don't purge it) is that we end up with one entry for each register unit used across the function in every WaitcntBracket instance.

I collected statistics locally using an assert in the destructor of WaitcntBrackets, and the worst I saw was about 122 VMem map entries that were empty (before implementing the fixes, now it's zero).
Should I add the assert back in ? It may be useful to prevent accidental mis-use of the map

arsenm · 2025-12-03T01:26:39Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+    } else if (TRI->isVectorRegister(*Context->MRI, Reg)) {
+      for (MCRegUnit RU : regunits(Reg))
+        VMem[toVMEMID(RU)].Scores[T] = Val;
+    } else if (TRI->isSGPRReg(*Context->MRI, Reg)) {


Can't this be else if (isSGPR()) else { }? There aren't registers that are something else?

Not that are handled right now but I'll add an unreachable there just in case.

arsenm · 2025-12-03T01:27:16Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  // The score tracking logic is fragmented as follows:
+  // - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
+  // - SGPRs: SGPR RegUnits
+  // - SCC


SCC isn't an SGPR; it's not general purpose and not allocatable

jayfoad · 2025-12-03T09:52:06Z

I collected statistics locally using an assert in the destructor of WaitcntBrackets, and the worst I saw was about 122 VMem map entries that were empty (before implementing the fixes, now it's zero). Should I add the assert back in ? It may be useful to prevent accidental mis-use of the map

Yes I think the assert sounds useful.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

jayfoad · 2025-12-03T09:45:30Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  // entry, which is updated for all LDS DMA operations encountered.
+  // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
+  LDSDMA_BEGIN = REGUNITS_END,
+  LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,


Right, it would fail an assertion, but you should not be writing assertions that can fail on valid user input, and if I understand correctly then a program that accesses enough different LDS allocations will fail the assertion.

Pierre-vh · 2025-12-03T11:18:40Z

Yes I think the assert sounds useful.

I added a debug-only destructor that checks the maps.

Pierre-vh · 2025-12-04T13:53:02Z

llvm/test/CodeGen/AMDGPU/lds-dma-waits.ll


-; There are 8 pseudo registers defined to track LDS DMA dependencies.
-
 define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) {


#170660 should fix this. I'll rebase once it lands.

Clean up the tracking logic to rely on register units. The pass was already "reinventing" the concept just to deal with 16 bit registers. There are no test changes, functionality is the same, except we can now track more LDS DMA IDs if we need it. The debug prints also changed a bit because we now talk in terms of register units. This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the `memcpy` built-in). I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses. This still isn't as clean as I'd like it to be though. There is a mix of "VMEMID", "LDS DMA ID", "SGPR RegUnit" and "PhysReg" in the API of WaitCntBrackets. There is no type safety to avoid mix-ups as these are all integers. We could add another layer of abstraction on top, but I feel like it's going to add too much code/boilerplate for such a small issue.

jayfoad

LGTM, thanks for your patience.

jayfoad · 2025-12-09T12:12:47Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

-      if (Slot)
-        setRegScore(FIRST_LDS_VGPR, T, CurrScore);
+      setVMemScore(LDSDMA_BEGIN, T, CurrScore);
+      if (Slot && Slot < NUM_LDSDMA)


Just a question: I think Slot can be zero here but only if MemOp does not have a suitable MMO with AA info. Is that case still handled conservatively correctly?

Yes, we always set the LDSDMA_BEGIN slot no matter what, so we can fall back to that if needed.

llvm-ci · 2025-12-09T13:15:21Z

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/31920

Here is the relevant piece of the build log for the reference

Step 3 (annotate) failure: '../llvm-zorg/zorg/buildbot/builders/annotated/hip-build.sh --jobs=' (failure)
...
[59/61] Linking CXX executable External/HIP/cmath-hip-7.0.2
[60/61] Building CXX object External/HIP/CMakeFiles/TheNextWeek-hip-7.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o
[61/61] Linking CXX executable External/HIP/TheNextWeek-hip-7.0.2
+ build_step 'Testing HIP test-suite'
+ echo '@@@BUILD_STEP Testing HIP test-suite@@@'
+ ninja check-hip-simple
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP && /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/llvm-lit -sv array-hip-7.0.2.test empty-hip-7.0.2.test with-fopenmp-hip-7.0.2.test saxpy-hip-7.0.2.test memmove-hip-7.0.2.test memset-hip-7.0.2.test split-kernel-args-hip-7.0.2.test builtin-logb-scalbn-hip-7.0.2.test TheNextWeek-hip-7.0.2.test algorithm-hip-7.0.2.test cmath-hip-7.0.2.test complex-hip-7.0.2.test math_h-hip-7.0.2.test new-hip-7.0.2.test blender.test
-- Testing: 15 tests, 15 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: test-suite :: External/HIP/blender.test (15 of 15)
******************** TEST 'test-suite :: External/HIP/blender.test' FAILED ********************

/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out --redirect-input /dev/null --summary /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.time /bin/bash test_blender.sh
/bin/bash verify_blender.sh /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out
Begin Blender test.
TEST_SUITE_HIP_ROOT=/opt/botworker/llvm/External/hip
Render /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend
Blender 4.1.1 (hash e1743a0317bc built 2024-04-15 23:47:45)
Read blend: "/opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend"
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
I1209 13:04:21.770818 1302883 device.cpp:39] HIPEW initialization succeeded
I1209 13:04:21.774643 1302883 device.cpp:45] Found HIPCC hipcc
I1209 13:04:21.859560 1302883 device.cpp:207] Device has compute preemption or is not used for display.
I1209 13:04:21.859576 1302883 device.cpp:211] Added device "" with id "HIP__0000:83:00".
I1209 13:04:21.859656 1302883 device.cpp:568] Mapped host memory limit set to 1,009,924,165,632 bytes. (940.56G)
I1209 13:04:21.859932 1302883 device_impl.cpp:63] Using AVX2 CPU kernels.
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_rim
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.015
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.025
Fra:1 Mem:524.12M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Cables.004
Fra:1 Mem:532.72M (Peak 533.27M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors
Fra:1 Mem:534.08M (Peak 534.08M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.009
Fra:1 Mem:534.20M (Peak 534.20M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Pistons
Fra:1 Mem:534.34M (Peak 534.33M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_Insides
Fra:1 Mem:534.72M (Peak 534.72M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_greeble
Step 12 (Testing HIP test-suite) failure: Testing HIP test-suite (failure)
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP && /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/llvm-lit -sv array-hip-7.0.2.test empty-hip-7.0.2.test with-fopenmp-hip-7.0.2.test saxpy-hip-7.0.2.test memmove-hip-7.0.2.test memset-hip-7.0.2.test split-kernel-args-hip-7.0.2.test builtin-logb-scalbn-hip-7.0.2.test TheNextWeek-hip-7.0.2.test algorithm-hip-7.0.2.test cmath-hip-7.0.2.test complex-hip-7.0.2.test math_h-hip-7.0.2.test new-hip-7.0.2.test blender.test
-- Testing: 15 tests, 15 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: test-suite :: External/HIP/blender.test (15 of 15)
******************** TEST 'test-suite :: External/HIP/blender.test' FAILED ********************

/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out --redirect-input /dev/null --summary /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.time /bin/bash test_blender.sh
/bin/bash verify_blender.sh /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out
Begin Blender test.
TEST_SUITE_HIP_ROOT=/opt/botworker/llvm/External/hip
Render /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend
Blender 4.1.1 (hash e1743a0317bc built 2024-04-15 23:47:45)
Read blend: "/opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend"
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
I1209 13:04:21.770818 1302883 device.cpp:39] HIPEW initialization succeeded
I1209 13:04:21.774643 1302883 device.cpp:45] Found HIPCC hipcc
I1209 13:04:21.859560 1302883 device.cpp:207] Device has compute preemption or is not used for display.
I1209 13:04:21.859576 1302883 device.cpp:211] Added device "" with id "HIP__0000:83:00".
I1209 13:04:21.859656 1302883 device.cpp:568] Mapped host memory limit set to 1,009,924,165,632 bytes. (940.56G)
I1209 13:04:21.859932 1302883 device_impl.cpp:63] Using AVX2 CPU kernels.
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_rim
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.015
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.025
Fra:1 Mem:524.12M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Cables.004
Fra:1 Mem:532.72M (Peak 533.27M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors
Fra:1 Mem:534.08M (Peak 534.08M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.009
Fra:1 Mem:534.20M (Peak 534.20M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Pistons
Fra:1 Mem:534.34M (Peak 534.33M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_Insides
Fra:1 Mem:534.72M (Peak 534.72M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_greeble
Fra:1 Mem:535.03M (Peak 535.04M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_greeble.005
Fra:1 Mem:535.44M (Peak 535.42M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Wires
Fra:1 Mem:602.29M (Peak 602.29M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_wires
Fra:1 Mem:605.83M (Peak 605.83M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.003
Fra:1 Mem:605.88M (Peak 605.89M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_plates.001
Fra:1 Mem:605.93M (Peak 605.93M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | ENV-fog
Fra:1 Mem:607.30M (Peak 607.30M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Ground

Pierre-vh requested review from arsenm, jayfoad and nhaehnle October 6, 2025 12:23

Pierre-vh marked this pull request as ready for review October 6, 2025 12:31

llvmbot added the backend:AMDGPU label Oct 6, 2025

arsenm reviewed Oct 6, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 3ec5466 to 34a77ec Compare October 7, 2025 08:28

Pierre-vh requested a review from arsenm October 7, 2025 08:30

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 34a77ec to 70baf17 Compare October 10, 2025 12:15

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 70baf17 to 9cecce3 Compare October 28, 2025 12:03

Pierre-vh requested a review from shiltian October 28, 2025 12:03

Pierre-vh requested review from ritter-x2a and ssahasra November 5, 2025 09:48

ritter-x2a reviewed Nov 11, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved

ritter-x2a reviewed Nov 11, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 9cecce3 to e1a66e0 Compare November 12, 2025 10:00

Pierre-vh requested a review from ritter-x2a November 12, 2025 10:00

ritter-x2a reviewed Nov 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Show resolved Hide resolved

ritter-x2a approved these changes Nov 13, 2025

View reviewed changes

jayfoad reviewed Nov 27, 2025

View reviewed changes

Pierre-vh requested a review from jayfoad November 28, 2025 10:14

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch 2 times, most recently from f88e053 to e944acd Compare November 28, 2025 11:57

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from e944acd to 10726b1 Compare December 2, 2025 10:55

arsenm approved these changes Dec 3, 2025

View reviewed changes

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 10726b1 to 1a77571 Compare December 3, 2025 09:32

jayfoad reviewed Dec 3, 2025

View reviewed changes

Pierre-vh requested a review from jayfoad December 3, 2025 15:02

Pierre-vh commented Dec 4, 2025

View reviewed changes

Pierre-vh mentioned this pull request Dec 4, 2025

[AMDGPU][SIInsertWaitcnts] Wait on all LDS DMA operations when no aliasing store is found #170660

Merged

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 2049624 to 9560519 Compare December 8, 2025 11:00

Pierre-vh added 8 commits December 9, 2025 10:06

comments

29703d7

Fix build

dc9fcc9

Comments

0b5f92b

Remove bloating in map

960c900

Comments

4b41adf

Comments

c72c865

Rebase

c1b5840

Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 9560519 to c1b5840 Compare December 9, 2025 09:19

jayfoad approved these changes Dec 9, 2025

View reviewed changes

Pierre-vh merged commit bf93440 into main Dec 9, 2025
10 checks passed

Pierre-vh deleted the users/pierre-vh/refactor-insertwaitcnt-regunits branch December 9, 2025 12:51

		auto It = SGPRs.find(RU);
		return It != SGPRs.end() ? It->second.Scores[getSgprScoresIdx(T)] : 0;


		; There are 8 pseudo registers defined to track LDS DMA dependencies.

		define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) {

[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking #162077

[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking #162077

Uh oh!

Conversation

Pierre-vh commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pierre-vh commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 6, 2025

Uh oh!

jayfoad commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Pierre-vh commented Oct 7, 2025

Uh oh!

Pierre-vh commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ritter-x2a left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm commented Nov 13, 2025

Uh oh!

Pierre-vh commented Nov 14, 2025

Uh oh!

jayfoad commented Nov 14, 2025

Uh oh!

arsenm commented Nov 15, 2025

Uh oh!

Pierre-vh commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jayfoad commented Nov 27, 2025

Uh oh!

Pierre-vh commented Nov 28, 2025

Pierre-vh commented Oct 6, 2025 •

edited

Loading

Pierre-vh commented Oct 6, 2025 •

edited

Loading

github-actions bot commented Oct 6, 2025 •

edited

Loading