Skip to content

Conversation

@Pierre-vh
Copy link
Contributor

@Pierre-vh Pierre-vh commented Oct 6, 2025

The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units.

There are no test changes because functionality didn't change, except:

  • We can now track more LDS DMA IDs if we need it (up to 1 << 16)
  • The debug prints also changed a bit because we now talk in terms of register units.

This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the memcpy built-in on a big test file).

I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses.

I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else.
One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR. I don't think RUs always considered registers so I can really fetch their regclass.

Copy link
Contributor Author

Pierre-vh commented Oct 6, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions
Copy link

github-actions bot commented Oct 6, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@Pierre-vh Pierre-vh marked this pull request as ready for review October 6, 2025 12:31
@llvmbot
Copy link
Member

llvmbot commented Oct 6, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units.

There are no test changes because functionality didn't change, except:

  • We can now track more LDS DMA IDs if we need it.
  • The debug prints also changed a bit because we now talk in terms of register units.

This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the memcpy built-in).

I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap
accesses.

I think this can be cleaned up a bit more. I want to see if we can remove PhysRegs from the WaitCnt bracket APIs, so it only reasons in terms of RegUnits and nothing else.
One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR. I don't think RUs always considered registers so I can really fetch their regclass.


Patch is 35.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162077.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+225-273)
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 76bfce8c0f6f9..90a5cd4e87ae7 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -97,7 +97,27 @@ auto inst_counter_types(InstCounterType MaxCounter = NUM_INST_CNTS) {
   return enum_seq(LOAD_CNT, MaxCounter);
 }
 
-using RegInterval = std::pair<int, int>;
+/// Integer IDs used to track vector memory locations we may have to wait on.
+/// Encoded as u16 chunks:
+///
+///   [0,            MAX_REGUNITS ): MCRegUnit
+///   [FIRST_LDSDMA, LAST_LDSDMA  ): LDS DMA IDs
+using VMEMID = uint32_t;
+
+enum : VMEMID {
+  TRACKINGID_RANGE_LEN = (1 << 16),
+
+  REGUNITS_BEGIN = 0,
+  REGUNITS_END = REGUNITS_BEGIN + TRACKINGID_RANGE_LEN,
+
+  // Note for LDSDMA: LDSDMA_BEGIN corresponds to the "common"
+  // entry, which is updated for all LDS DMA operations encountered.
+  // Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
+  LDSDMA_BEGIN = REGUNITS_END,
+  LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,
+
+  NUM_LDSDMA = TRACKINGID_RANGE_LEN
+};
 
 struct HardwareLimits {
   unsigned LoadcntMax; // Corresponds to VMcnt prior to gfx12.
@@ -146,30 +166,6 @@ static constexpr StringLiteral WaitEventTypeName[] = {
 #undef AMDGPU_EVENT_NAME
 // clang-format on
 
-// The mapping is:
-//  0                .. SQ_MAX_PGM_VGPRS-1               real VGPRs
-//  SQ_MAX_PGM_VGPRS .. NUM_ALL_VGPRS-1                  extra VGPR-like slots
-//  NUM_ALL_VGPRS    .. NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS-1 real SGPRs
-//  NUM_ALL_VGPRS+SQ_MAX_PGM_SGPRS ..                    SCC
-// We reserve a fixed number of VGPR slots in the scoring tables for
-// special tokens like SCMEM_LDS (needed for buffer load to LDS).
-enum RegisterMapping {
-  SQ_MAX_PGM_VGPRS = 2048, // Maximum programmable VGPRs across all targets.
-  AGPR_OFFSET = 512,       // Maximum programmable ArchVGPRs across all targets.
-  SQ_MAX_PGM_SGPRS = 128,  // Maximum programmable SGPRs across all targets.
-  // Artificial register slots to track LDS writes into specific LDS locations
-  // if a location is known. When slots are exhausted or location is
-  // unknown use the first slot. The first slot is also always updated in
-  // addition to known location's slot to properly generate waits if dependent
-  // instruction's location is unknown.
-  FIRST_LDS_VGPR = SQ_MAX_PGM_VGPRS, // Extra slots for LDS stores.
-  NUM_LDS_VGPRS = 9,                 // One more than the stores we track.
-  NUM_ALL_VGPRS = SQ_MAX_PGM_VGPRS + NUM_LDS_VGPRS, // Where SGPRs start.
-  NUM_ALL_ALLOCATABLE = NUM_ALL_VGPRS + SQ_MAX_PGM_SGPRS,
-  // Remaining non-allocatable registers
-  SCC = NUM_ALL_ALLOCATABLE
-};
-
 // Enumerate different types of result-returning VMEM operations. Although
 // s_waitcnt orders them all with a single vmcnt counter, in the absence of
 // s_waitcnt only instructions of the same VmemType are guaranteed to write
@@ -616,32 +612,26 @@ class WaitcntBrackets {
     return getScoreUB(T) - getScoreLB(T);
   }
 
-  unsigned getRegScore(int GprNo, InstCounterType T) const {
-    if (GprNo < NUM_ALL_VGPRS)
-      return VgprScores[T][GprNo];
-
-    if (GprNo < NUM_ALL_ALLOCATABLE)
-      return SgprScores[getSgprScoresIdx(T)][GprNo - NUM_ALL_VGPRS];
+  unsigned getSGPRScore(MCRegUnit RU, InstCounterType T) const {
+    auto It = SGPRs.find(RU);
+    return (It != SGPRs.end()) ? It->second.Scores[getSgprScoresIdx(T)] : 0;
+  }
 
-    assert(GprNo == SCC);
-    return SCCScore;
+  unsigned getVMemScore(VMEMID TID, InstCounterType T) const {
+    auto It = VMem.find(TID);
+    return (It != VMem.end()) ? It->second.Scores[T] : 0;
   }
 
   bool merge(const WaitcntBrackets &Other);
 
-  RegInterval getRegInterval(const MachineInstr *MI,
-                             const MachineOperand &Op) const;
-
   bool counterOutOfOrder(InstCounterType T) const;
   void simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;
   void simplifyWaitcnt(InstCounterType T, unsigned &Count) const;
 
-  void determineWait(InstCounterType T, RegInterval Interval,
-                     AMDGPU::Waitcnt &Wait) const;
-  void determineWait(InstCounterType T, int RegNo,
-                     AMDGPU::Waitcnt &Wait) const {
-    determineWait(T, {RegNo, RegNo + 1}, Wait);
-  }
+  void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg,
+                               AMDGPU::Waitcnt &Wait) const;
+  void determineWaitForLDSDMA(InstCounterType T, VMEMID TID,
+                              AMDGPU::Waitcnt &Wait) const;
   void tryClearSCCWriteEvent(MachineInstr *Inst);
 
   void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
@@ -690,19 +680,19 @@ class WaitcntBrackets {
 
   // Return true if there might be pending writes to the vgpr-interval by VMEM
   // instructions with types different from V.
-  bool hasOtherPendingVmemTypes(RegInterval Interval, VmemType V) const {
-    for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
-      assert(RegNo < NUM_ALL_VGPRS);
-      if (VgprVmemTypes[RegNo] & ~(1 << V))
+  bool hasOtherPendingVmemTypes(MCPhysReg Reg, VmemType V) const {
+    for (MCRegUnit RU : regunits(Reg)) {
+      auto It = VMem.find(RU);
+      if (It != VMem.end() && (It->second.VMEMTypes & ~(1 << V)))
         return true;
     }
     return false;
   }
 
-  void clearVgprVmemTypes(RegInterval Interval) {
-    for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
-      assert(RegNo < NUM_ALL_VGPRS);
-      VgprVmemTypes[RegNo] = 0;
+  void clearVgprVmemTypes(MCPhysReg Reg) {
+    for (MCRegUnit RU : regunits(Reg)) {
+      if (auto It = VMem.find(RU); It != VMem.end())
+        It->second.VMEMTypes = 0;
     }
   }
 
@@ -718,7 +708,7 @@ class WaitcntBrackets {
 
   bool hasPointSampleAccel(const MachineInstr &MI) const;
   bool hasPointSamplePendingVmemTypes(const MachineInstr &MI,
-                                      RegInterval Interval) const;
+                                      MCPhysReg RU) const;
 
   void print(raw_ostream &) const;
   void dump() const { print(dbgs()); }
@@ -730,9 +720,24 @@ class WaitcntBrackets {
     unsigned MyShift;
     unsigned OtherShift;
   };
+
+  void determineWaitForScore(InstCounterType T, unsigned Score,
+                             AMDGPU::Waitcnt &Wait) const;
+
   static bool mergeScore(const MergeInfo &M, unsigned &Score,
                          unsigned OtherScore);
 
+  iterator_range<MCRegUnitIterator> regunits(MCPhysReg Reg) const {
+    assert(Reg != AMDGPU::SCC && "Shouldn't be used on SCC");
+    const TargetRegisterClass *RC = Context->TRI->getPhysRegBaseClass(Reg);
+    unsigned Size = Context->TRI->getRegSizeInBits(*RC);
+    if (!Context->TRI->isInAllocatableClass(Reg))
+      return {{}, {}};
+    if (Size == 16 && Context->ST->hasD16Writes32BitVgpr())
+      Reg = Context->TRI->get32BitRegister(Reg);
+    return Context->TRI->regunits(Reg);
+  }
+
   void setScoreLB(InstCounterType T, unsigned Val) {
     assert(T < NUM_INST_CNTS);
     ScoreLBs[T] = Val;
@@ -749,15 +754,26 @@ class WaitcntBrackets {
       ScoreLBs[EXP_CNT] = ScoreUBs[EXP_CNT] - Context->getWaitCountMax(EXP_CNT);
   }
 
-  void setRegScore(int GprNo, InstCounterType T, unsigned Val) {
-    setScoreByInterval({GprNo, GprNo + 1}, T, Val);
+  void setRegScore(MCPhysReg Reg, InstCounterType T, unsigned Val) {
+    const SIRegisterInfo *TRI = Context->TRI;
+    if (Reg == AMDGPU::SCC) {
+      SCCScore = Val;
+    } else if (TRI->isVectorRegister(*Context->MRI, Reg)) {
+      for (MCRegUnit RU : regunits(Reg))
+        VMem[RU].Scores[T] = Val;
+    } else if (TRI->isSGPRReg(*Context->MRI, Reg)) {
+      auto STy = getSgprScoresIdx(T);
+      for (MCRegUnit RU : regunits(Reg))
+        SGPRs[RU].Scores[STy] = Val;
+    }
   }
 
-  void setScoreByInterval(RegInterval Interval, InstCounterType CntTy,
-                          unsigned Score);
+  void setVMemScore(VMEMID TID, InstCounterType T, unsigned Val) {
+    VMem[TID].Scores[T] = Val;
+  }
 
-  void setScoreByOperand(const MachineInstr *MI, const MachineOperand &Op,
-                         InstCounterType CntTy, unsigned Val);
+  void setScoreByOperand(const MachineOperand &Op, InstCounterType CntTy,
+                         unsigned Val);
 
   const SIInsertWaitcnts *Context;
 
@@ -768,26 +784,39 @@ class WaitcntBrackets {
   unsigned LastFlat[NUM_INST_CNTS] = {0};
   // Remember the last GDS operation.
   unsigned LastGDS = 0;
-  // wait_cnt scores for every vgpr.
-  // Keep track of the VgprUB and SgprUB to make merge at join efficient.
-  int VgprUB = -1;
-  int SgprUB = -1;
-  unsigned VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS] = {{0}};
-  // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
-  // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
-  // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps the
-  // X_CNT score.
-  unsigned SgprScores[2][SQ_MAX_PGM_SGPRS] = {{0}};
+
+  /// The score tracking logic is fragmented as follows:
+  /// - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
+  /// - SGPRs: SGPR RegUnits
+  /// - SCC
+
+  struct VGPRInfo {
+    // Scores for all instruction counters.
+    unsigned Scores[NUM_INST_CNTS] = {0};
+    // Bitmask of the VmemTypes of VMEM instructions for this VGPR.
+    unsigned VMEMTypes = 0;
+  };
+
+  struct SGPRInfo {
+    // Wait cnt scores for every sgpr, the DS_CNT (corresponding to LGKMcnt
+    // pre-gfx12) or KM_CNT (gfx12+ only), and X_CNT (gfx1250) are relevant.
+    // Row 0 represents the score for either DS_CNT or KM_CNT and row 1 keeps
+    // the X_CNT score.
+    unsigned Scores[2] = {0};
+  };
+
+  DenseMap<VMEMID, VGPRInfo> VMem; // VGPR + LDS DMA
+  DenseMap<MCRegUnit, SGPRInfo> SGPRs;
+
   // Reg score for SCC.
   unsigned SCCScore = 0;
   // The unique instruction that has an SCC write pending, if there is one.
   const MachineInstr *PendingSCCWrite = nullptr;
-  // Bitmask of the VmemTypes of VMEM instructions that might have a pending
-  // write to each vgpr.
-  unsigned char VgprVmemTypes[NUM_ALL_VGPRS] = {0};
+
   // Store representative LDS DMA operations. The only useful info here is
   // alias info. One store is kept per unique AAInfo.
-  SmallVector<const MachineInstr *, NUM_LDS_VGPRS - 1> LDSDMAStores;
+  // Entry zero is the "generic" entry that applies to all LDSDMA stores.
+  SmallVector<const MachineInstr *> LDSDMAStores;
 };
 
 class SIInsertWaitcntsLegacy : public MachineFunctionPass {
@@ -813,82 +842,10 @@ class SIInsertWaitcntsLegacy : public MachineFunctionPass {
 
 } // end anonymous namespace
 
-RegInterval WaitcntBrackets::getRegInterval(const MachineInstr *MI,
-                                            const MachineOperand &Op) const {
-  if (Op.getReg() == AMDGPU::SCC)
-    return {SCC, SCC + 1};
-
-  const SIRegisterInfo *TRI = Context->TRI;
-  const MachineRegisterInfo *MRI = Context->MRI;
-
-  if (!TRI->isInAllocatableClass(Op.getReg()))
-    return {-1, -1};
-
-  // A use via a PW operand does not need a waitcnt.
-  // A partial write is not a WAW.
-  assert(!Op.getSubReg() || !Op.isUndef());
-
-  RegInterval Result;
-
-  MCRegister MCReg = AMDGPU::getMCReg(Op.getReg(), *Context->ST);
-  unsigned RegIdx = TRI->getHWRegIndex(MCReg);
-
-  const TargetRegisterClass *RC = TRI->getPhysRegBaseClass(Op.getReg());
-  unsigned Size = TRI->getRegSizeInBits(*RC);
-
-  // AGPRs/VGPRs are tracked every 16 bits, SGPRs by 32 bits
-  if (TRI->isVectorRegister(*MRI, Op.getReg())) {
-    unsigned Reg = RegIdx << 1 | (AMDGPU::isHi16Reg(MCReg, *TRI) ? 1 : 0);
-    assert(!Context->ST->hasMAIInsts() || Reg < AGPR_OFFSET);
-    Result.first = Reg;
-    if (TRI->isAGPR(*MRI, Op.getReg()))
-      Result.first += AGPR_OFFSET;
-    assert(Result.first >= 0 && Result.first < SQ_MAX_PGM_VGPRS);
-    assert(Size % 16 == 0);
-    Result.second = Result.first + (Size / 16);
-
-    if (Size == 16 && Context->ST->hasD16Writes32BitVgpr()) {
-      // Regardless of which lo16/hi16 is used, consider the full 32-bit
-      // register used.
-      if (AMDGPU::isHi16Reg(MCReg, *TRI))
-        Result.first -= 1;
-      else
-        Result.second += 1;
-    }
-  } else if (TRI->isSGPRReg(*MRI, Op.getReg()) && RegIdx < SQ_MAX_PGM_SGPRS) {
-    // SGPRs including VCC, TTMPs and EXEC but excluding read-only scalar
-    // sources like SRC_PRIVATE_BASE.
-    Result.first = RegIdx + NUM_ALL_VGPRS;
-    Result.second = Result.first + divideCeil(Size, 32);
-  } else {
-    return {-1, -1};
-  }
-
-  return Result;
-}
-
-void WaitcntBrackets::setScoreByInterval(RegInterval Interval,
-                                         InstCounterType CntTy,
-                                         unsigned Score) {
-  for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
-    if (RegNo < NUM_ALL_VGPRS) {
-      VgprUB = std::max(VgprUB, RegNo);
-      VgprScores[CntTy][RegNo] = Score;
-    } else if (RegNo < NUM_ALL_ALLOCATABLE) {
-      SgprUB = std::max(SgprUB, RegNo - NUM_ALL_VGPRS);
-      SgprScores[getSgprScoresIdx(CntTy)][RegNo - NUM_ALL_VGPRS] = Score;
-    } else {
-      assert(RegNo == SCC);
-      SCCScore = Score;
-    }
-  }
-}
-
-void WaitcntBrackets::setScoreByOperand(const MachineInstr *MI,
-                                        const MachineOperand &Op,
+void WaitcntBrackets::setScoreByOperand(const MachineOperand &Op,
                                         InstCounterType CntTy, unsigned Score) {
-  RegInterval Interval = getRegInterval(MI, Op);
-  setScoreByInterval(Interval, CntTy, Score);
+  assert(Op.isReg());
+  setRegScore(Op.getReg().asMCReg(), CntTy, Score);
 }
 
 // Return true if the subtarget is one that enables Point Sample Acceleration
@@ -911,12 +868,12 @@ bool WaitcntBrackets::hasPointSampleAccel(const MachineInstr &MI) const {
 // one that has outstanding writes to vmem-types different than VMEM_NOSAMPLER
 // (this is the type that a point sample accelerated instruction effectively
 // becomes)
-bool WaitcntBrackets::hasPointSamplePendingVmemTypes(
-    const MachineInstr &MI, RegInterval Interval) const {
+bool WaitcntBrackets::hasPointSamplePendingVmemTypes(const MachineInstr &MI,
+                                                     MCPhysReg Reg) const {
   if (!hasPointSampleAccel(MI))
     return false;
 
-  return hasOtherPendingVmemTypes(Interval, VMEM_NOSAMPLER);
+  return hasOtherPendingVmemTypes(Reg, VMEM_NOSAMPLER);
 }
 
 void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
@@ -943,57 +900,52 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
       // All GDS operations must protect their address register (same as
       // export.)
       if (const auto *AddrOp = TII->getNamedOperand(Inst, AMDGPU::OpName::addr))
-        setScoreByOperand(&Inst, *AddrOp, EXP_CNT, CurrScore);
+        setScoreByOperand(*AddrOp, EXP_CNT, CurrScore);
 
       if (Inst.mayStore()) {
         if (const auto *Data0 =
                 TII->getNamedOperand(Inst, AMDGPU::OpName::data0))
-          setScoreByOperand(&Inst, *Data0, EXP_CNT, CurrScore);
+          setScoreByOperand(*Data0, EXP_CNT, CurrScore);
         if (const auto *Data1 =
                 TII->getNamedOperand(Inst, AMDGPU::OpName::data1))
-          setScoreByOperand(&Inst, *Data1, EXP_CNT, CurrScore);
+          setScoreByOperand(*Data1, EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst) && !SIInstrInfo::isGWS(Inst) &&
                  Inst.getOpcode() != AMDGPU::DS_APPEND &&
                  Inst.getOpcode() != AMDGPU::DS_CONSUME &&
                  Inst.getOpcode() != AMDGPU::DS_ORDERED_COUNT) {
         for (const MachineOperand &Op : Inst.all_uses()) {
           if (TRI->isVectorRegister(*MRI, Op.getReg()))
-            setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+            setScoreByOperand(Op, EXP_CNT, CurrScore);
         }
       }
     } else if (TII->isFLAT(Inst)) {
       if (Inst.mayStore()) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst)) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       }
     } else if (TII->isMIMG(Inst)) {
       if (Inst.mayStore()) {
-        setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+        setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst)) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       }
     } else if (TII->isMTBUF(Inst)) {
       if (Inst.mayStore())
-        setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+        setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
     } else if (TII->isMUBUF(Inst)) {
       if (Inst.mayStore()) {
-        setScoreByOperand(&Inst, Inst.getOperand(0), EXP_CNT, CurrScore);
+        setScoreByOperand(Inst.getOperand(0), EXP_CNT, CurrScore);
       } else if (SIInstrInfo::isAtomicRet(Inst)) {
-        setScoreByOperand(&Inst,
-                          *TII->getNamedOperand(Inst, AMDGPU::OpName::data),
+        setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::data),
                           EXP_CNT, CurrScore);
       }
     } else if (TII->isLDSDIR(Inst)) {
       // LDSDIR instructions attach the score to the destination.
-      setScoreByOperand(&Inst,
-                        *TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
+      setScoreByOperand(*TII->getNamedOperand(Inst, AMDGPU::OpName::vdst),
                         EXP_CNT, CurrScore);
     } else {
       if (TII->isEXP(Inst)) {
@@ -1003,18 +955,18 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
         // score.
         for (MachineOperand &DefMO : Inst.all_defs()) {
           if (TRI->isVGPR(*MRI, DefMO.getReg())) {
-            setScoreByOperand(&Inst, DefMO, EXP_CNT, CurrScore);
+            setScoreByOperand(DefMO, EXP_CNT, CurrScore);
           }
         }
       }
       for (const MachineOperand &Op : Inst.all_uses()) {
         if (TRI->isVectorRegister(*MRI, Op.getReg()))
-          setScoreByOperand(&Inst, Op, EXP_CNT, CurrScore);
+          setScoreByOperand(Op, EXP_CNT, CurrScore);
       }
     }
   } else if (T == X_CNT) {
     for (const MachineOperand &Op : Inst.all_uses())
-      setScoreByOperand(&Inst, Op, T, CurrScore);
+      setScoreByOperand(Op, T, CurrScore);
   } else /* LGKM_CNT || EXP_CNT || VS_CNT || NUM_INST_CNTS */ {
     // Match the score to the destination registers.
     //
@@ -1026,9 +978,9 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
     // Special cases where implicit register defs exists, such as M0 or VCC,
     // but none with memory instructions.
     for (const MachineOperand &Op : Inst.defs()) {
-      RegInterval Interval = getRegInterval(&Inst, Op);
       if (T == LOAD_CNT || T == SAMPLE_CNT || T == BVH_CNT) {
-        if (Interval.first >= NUM_ALL_VGPRS)
+        if (!Context->TRI->isVectorRegister(*Context->MRI,
+                                            Op.getReg())) // TODO: add wrapper
           continue;
         if (updateVMCntOnly(Inst)) {
           // updateVMCntOnly should only leave us with VGPRs
@@ -1041,11 +993,11 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, MachineInstr &Inst) {
           // this with another potential dependency
           if (hasPointSampleAccel(Inst))
             TypesMask |= 1 << VMEM_NOSAMPLER;
-          for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo)
-            VgprVmemTypes[RegNo] |= TypesMask;
+          for (MCRegUnit RU : regunits(Op.getReg().asMCReg()))
+            VMem[RU].VMEMTypes |= TypesMask;
         }
       }
-      setScoreByInterval(Interval, T, CurrScore);
+      setScoreByOperand(Op, T, CurrScore);
     }
     if (Inst.mayStore() &&
         (TII->isDS(Inst) || TII->mayWriteLDSThroughDMA(Inst))) {
@@ -1076,19 +1028,19...
[truncated]

@jayfoad
Copy link
Contributor

jayfoad commented Oct 6, 2025

One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR.

You can get the unit's "root" register and test that. See MCRegUnitRootIterator. AMDGPU does not use ad hoc aliasing so I think every unit should have exactly one root.

@Pierre-vh
Copy link
Contributor Author

One issue I've had with that is that I'm not sure how to tell if a RU is a SGPR or VGPR.

You can get the unit's "root" register and test that. See MCRegUnitRootIterator. AMDGPU does not use ad hoc aliasing so I think every unit should have exactly one root.

Is it worth adding this (potentially expensive?) query just so I can merge a few methods together though ?
I also thought about merging everything into a single map, then make the Value of the map some type of variant with different fields depending on whether the key is a VGPR/SGPR/SCC/LDSDMA. That'd be a bit less space efficient, but would streamline the implementation.

Perhaps I can try it in another diff on top of this ? This diff already changes a lot of thing, I want to make sure it doesn't become too big to review or debug in case of an issue.

@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 3ec5466 to 34a77ec Compare October 7, 2025 08:28
@Pierre-vh Pierre-vh requested a review from arsenm October 7, 2025 08:30
@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 34a77ec to 70baf17 Compare October 10, 2025 12:15
@Pierre-vh
Copy link
Contributor Author

Ping

@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 70baf17 to 9cecce3 Compare October 28, 2025 12:03
@Pierre-vh Pierre-vh requested a review from shiltian October 28, 2025 12:03
@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 9cecce3 to e1a66e0 Compare November 12, 2025 10:00
@Pierre-vh Pierre-vh requested a review from ritter-x2a November 12, 2025 10:00
Copy link
Member

@ritter-x2a ritter-x2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to see the impact of the DenseMap vs the vector (or lack thereof) in a compile-time regression tracker, but it doesn't seem like we have one for AMDGPU workloads working right now.
The reasoning in the commit message sounds reasonable to me, though, so LGTM.

@arsenm
Copy link
Contributor

arsenm commented Nov 13, 2025

This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the memcpy built-in on a big test file).

Why would this get memcpied, and how big is it? I have a hard time believing this is big enough to be a problem

@Pierre-vh
Copy link
Contributor Author

Why would this get memcpied, and how big is it? I have a hard time believing this is big enough to be a problem

What we had in WaitCntBracket was

unsigned VgprScores[NUM_INST_CNTS][NUM_ALL_VGPRS] = {{0}};

That's (roughly) 4B * 10 * 2048 = 81920B of data per basic block just for VGPRs. Most of these entries will never be used, it's just wasted memory.
I cant' find the source of the memcpy, I can just see a memcpy built-in taking about 30-40% of the runtime of the pass before the patch. I think it's likely inserted by the compiler to initialize the object or copy it somewhere.

Note that the main reason for this patch is streamlining the code of InsertWaitCnt. Moving the tracking to DenseMap is something I've done alongside it as it makes things simpler and (in my limited testing) is either as fast, or faster than the big array.
I could revert to using an array if there is a good reason for it.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 14, 2025

Why would this get memcpied

It's memcpyed here to save the WaitCntBracket state at the end of a basic block, to be used as the initial state for a successor block:

SuccBI.Incoming = std::make_unique<WaitcntBrackets>(*Brackets);

@arsenm
Copy link
Contributor

arsenm commented Nov 15, 2025

Maybe this should use SparseBitVector?

@Pierre-vh
Copy link
Contributor Author

Maybe this should use SparseBitVector?

It stores counters, not single bits. I think using a SparseBitVector will just make this more complicated than it has to be
Is there a good reason why DenseMap isn't good here?

using VMEMID = uint32_t;

enum : VMEMID {
TRACKINGID_RANGE_LEN = (1 << 16),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just an arbitrary value larger than MAX_REGUNITS? Can you assert somewhere that it is >= TRI.getNumRegUnits()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the ctor of WaitcntBrackets. I also added a comment to clarify the value is arbitrary and can be changed if more is needed.

LDSDMA_BEGIN = REGUNITS_END,
LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,

NUM_LDSDMA = TRACKINGID_RANGE_LEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot more than the 9 that we used to track! But I guess there is no downside, now that we are using DenseMap instead of arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a u32 and each "slice" is u16 so we can store 65535 slices and we only use 2.
It's a bit overkill, we can always re-partition the VMEMID if we ever have an issue with it. It's an implementation detail

Comment on lines +622 to +647
auto It = SGPRs.find(RU);
return It != SGPRs.end() ? It->second.Scores[getSgprScoresIdx(T)] : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just SGPRs.lookup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but it'd create a temporary value just to extract a zero out of it. Right now Scores is just 2 ints so it wouldn't be a problem, but I prefer to use the find pattern to be consistent with the rest, and also to avoid the temporary in case we add more to Scores over time and the temporary becomes big

Comment on lines +640 to +667
void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg,
AMDGPU::Waitcnt &Wait) const;
void determineWaitForLDSDMA(InstCounterType T, VMEMID TID,
AMDGPU::Waitcnt &Wait) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these be two overloads of determineWait, or does that cause problems because MCPhysReg and VMEMID have more or less the same underlying type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they're both integers. I could make the VMEMID an enum class but that'd add a bunch of casts all over the place. It's a tradeoff

// The score tracking logic is fragmented as follows:
// - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
// - SGPRs: SGPR RegUnits
// - SCC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't SCC be handled like any other SGPR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, SCC is not a SGPR. At least it's not part of the SGPR reg classes.

This patch aims to be NFCI, so I didn't try hard to fix things like these because I didn't want to bloat the patch too much. I want to come back to the pass and take another look once this lands so I added a TODO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCC isn't an SGPR; it's not general purpose and not allocatable

}
}

void WaitcntBrackets::setScoreByOperand(const MachineInstr *MI,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted you could precommit the change to remove MI here, since it is trivially unused here and in getRegInterval. That would avoid some churn in this PR.

// entry, which is updated for all LDS DMA operations encountered.
// Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
LDSDMA_BEGIN = REGUNITS_END,
LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure LDSDMA_END is needed? In a couple of places you use it in assertions, but there is nothing that would prevent those assertions from failing if LDS IDs happened to climb high enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

determineWaitForLDSDMA would fail because it checks the ID is in the right range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it would fail an assertion, but you should not be writing assertions that can fail on valid user input, and if I understand correctly then a program that accesses enough different LDS allocations will fail the assertion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check so we can't allocate IDs above the limit, like we had before


for (int J = 0; J <= VgprUB; J++) {
unsigned RegScore = getRegScore(J, T);
for (auto &[ID, Info] : VMem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to print the entries in non-deterministic order? That could be annoying, although it is only debug output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorted the keys, I think quality of debug output is important.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 27, 2025

One more thought: is there a risk of the DenseMaps growing ever larger because we never remove entries from them? Maybe the merge function would be a good opportunity to purge useless entries?

@Pierre-vh Pierre-vh requested a review from jayfoad November 28, 2025 10:14
@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch 2 times, most recently from f88e053 to e944acd Compare November 28, 2025 11:57
@Pierre-vh
Copy link
Contributor Author

One more thought: is there a risk of the DenseMaps growing ever larger because we never remove entries from them? Maybe the merge function would be a good opportunity to purge useless entries?

I added a method to purge the map, and I also made clearVgprVmemTypes erase the map entry if it causes it to be empty.

The map can't grow huge in an uncontrolled way. The worst case (if we don't purge it) is that we end up with one entry for each register unit used across the function in every WaitcntBracket instance.

I collected statistics locally using an assert in the destructor of WaitcntBrackets, and the worst I saw was about 122 VMem map entries that were empty (before implementing the fixes, now it's zero).
Should I add the assert back in ? It may be useful to prevent accidental mis-use of the map

@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from e944acd to 10726b1 Compare December 2, 2025 10:55
} else if (TRI->isVectorRegister(*Context->MRI, Reg)) {
for (MCRegUnit RU : regunits(Reg))
VMem[toVMEMID(RU)].Scores[T] = Val;
} else if (TRI->isSGPRReg(*Context->MRI, Reg)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be else if (isSGPR()) else { }? There aren't registers that are something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that are handled right now but I'll add an unreachable there just in case.

// The score tracking logic is fragmented as follows:
// - VMem: VGPR RegUnits and LDS DMA IDs, see the VMEMID encoding.
// - SGPRs: SGPR RegUnits
// - SCC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCC isn't an SGPR; it's not general purpose and not allocatable

@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 10726b1 to 1a77571 Compare December 3, 2025 09:32
@jayfoad
Copy link
Contributor

jayfoad commented Dec 3, 2025

I collected statistics locally using an assert in the destructor of WaitcntBrackets, and the worst I saw was about 122 VMem map entries that were empty (before implementing the fixes, now it's zero). Should I add the assert back in ? It may be useful to prevent accidental mis-use of the map

Yes I think the assert sounds useful.

// entry, which is updated for all LDS DMA operations encountered.
// Specific LDS DMA IDs start at LDSDMA_BEGIN + 1.
LDSDMA_BEGIN = REGUNITS_END,
LDSDMA_END = LDSDMA_BEGIN + TRACKINGID_RANGE_LEN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it would fail an assertion, but you should not be writing assertions that can fail on valid user input, and if I understand correctly then a program that accesses enough different LDS allocations will fail the assertion.

@Pierre-vh
Copy link
Contributor Author

Yes I think the assert sounds useful.

I added a debug-only destructor that checks the maps.

@Pierre-vh Pierre-vh requested a review from jayfoad December 3, 2025 15:02

; There are 8 pseudo registers defined to track LDS DMA dependencies.

define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#170660 should fix this. I'll rebase once it lands.

Clean up the tracking logic to rely on register units. The pass was
already "reinventing" the concept just to deal with 16 bit registers.

There are no test changes, functionality is the same, except we can
now track more LDS DMA IDs if we need it. The debug prints also changed
a bit because we now talk in terms of register units.

This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller
memory footprint. Allocating and memsetting a huge table to zero
caused a non-negligible performance impact (I've observed up to 50%
of the time in the pass spent in the `memcpy` built-in).

I also think we don't access these often enough to really justify
using a vector. We do a few accesses per instruction, but not much
more. In a huge 120MB LL file, I can barely see the trace of the DenseMap
accesses.

This still isn't as clean as I'd like it to be though. There is a mix
of "VMEMID", "LDS DMA ID", "SGPR RegUnit" and "PhysReg" in the API of WaitCntBrackets.
There is no type safety to avoid mix-ups as these are all integers.
We could add another layer of abstraction on top, but I feel like it's going to add
too much code/boilerplate for such a small issue.
@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/refactor-insertwaitcnt-regunits branch from 9560519 to c1b5840 Compare December 9, 2025 09:19
Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your patience.

if (Slot)
setRegScore(FIRST_LDS_VGPR, T, CurrScore);
setVMemScore(LDSDMA_BEGIN, T, CurrScore);
if (Slot && Slot < NUM_LDSDMA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: I think Slot can be zero here but only if MemOp does not have a suitable MMO with AA info. Is that case still handled conservatively correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we always set the LDSDMA_BEGIN slot no matter what, so we can fall back to that if needed.

@Pierre-vh Pierre-vh merged commit bf93440 into main Dec 9, 2025
10 checks passed
@Pierre-vh Pierre-vh deleted the users/pierre-vh/refactor-insertwaitcnt-regunits branch December 9, 2025 12:51
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 9, 2025

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/31920

Here is the relevant piece of the build log for the reference
Step 3 (annotate) failure: '../llvm-zorg/zorg/buildbot/builders/annotated/hip-build.sh --jobs=' (failure)
...
[59/61] Linking CXX executable External/HIP/cmath-hip-7.0.2
[60/61] Building CXX object External/HIP/CMakeFiles/TheNextWeek-hip-7.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o
[61/61] Linking CXX executable External/HIP/TheNextWeek-hip-7.0.2
+ build_step 'Testing HIP test-suite'
+ echo '@@@BUILD_STEP Testing HIP test-suite@@@'
+ ninja check-hip-simple
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP && /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/llvm-lit -sv array-hip-7.0.2.test empty-hip-7.0.2.test with-fopenmp-hip-7.0.2.test saxpy-hip-7.0.2.test memmove-hip-7.0.2.test memset-hip-7.0.2.test split-kernel-args-hip-7.0.2.test builtin-logb-scalbn-hip-7.0.2.test TheNextWeek-hip-7.0.2.test algorithm-hip-7.0.2.test cmath-hip-7.0.2.test complex-hip-7.0.2.test math_h-hip-7.0.2.test new-hip-7.0.2.test blender.test
-- Testing: 15 tests, 15 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: test-suite :: External/HIP/blender.test (15 of 15)
******************** TEST 'test-suite :: External/HIP/blender.test' FAILED ********************

/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out --redirect-input /dev/null --summary /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.time /bin/bash test_blender.sh
/bin/bash verify_blender.sh /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out
Begin Blender test.
TEST_SUITE_HIP_ROOT=/opt/botworker/llvm/External/hip
Render /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend
Blender 4.1.1 (hash e1743a0317bc built 2024-04-15 23:47:45)
Read blend: "/opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend"
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
I1209 13:04:21.770818 1302883 device.cpp:39] HIPEW initialization succeeded
I1209 13:04:21.774643 1302883 device.cpp:45] Found HIPCC hipcc
I1209 13:04:21.859560 1302883 device.cpp:207] Device has compute preemption or is not used for display.
I1209 13:04:21.859576 1302883 device.cpp:211] Added device "" with id "HIP__0000:83:00".
I1209 13:04:21.859656 1302883 device.cpp:568] Mapped host memory limit set to 1,009,924,165,632 bytes. (940.56G)
I1209 13:04:21.859932 1302883 device_impl.cpp:63] Using AVX2 CPU kernels.
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_rim
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.015
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.025
Fra:1 Mem:524.12M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Cables.004
Fra:1 Mem:532.72M (Peak 533.27M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors
Fra:1 Mem:534.08M (Peak 534.08M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.009
Fra:1 Mem:534.20M (Peak 534.20M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Pistons
Fra:1 Mem:534.34M (Peak 534.33M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_Insides
Fra:1 Mem:534.72M (Peak 534.72M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_greeble
Step 12 (Testing HIP test-suite) failure: Testing HIP test-suite (failure)
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP && /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/llvm/bin/llvm-lit -sv array-hip-7.0.2.test empty-hip-7.0.2.test with-fopenmp-hip-7.0.2.test saxpy-hip-7.0.2.test memmove-hip-7.0.2.test memset-hip-7.0.2.test split-kernel-args-hip-7.0.2.test builtin-logb-scalbn-hip-7.0.2.test TheNextWeek-hip-7.0.2.test algorithm-hip-7.0.2.test cmath-hip-7.0.2.test complex-hip-7.0.2.test math_h-hip-7.0.2.test new-hip-7.0.2.test blender.test
-- Testing: 15 tests, 15 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: test-suite :: External/HIP/blender.test (15 of 15)
******************** TEST 'test-suite :: External/HIP/blender.test' FAILED ********************

/home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out --redirect-input /dev/null --summary /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.time /bin/bash test_blender.sh
/bin/bash verify_blender.sh /home/botworker/bbot/clang-hip-vega20/botworker/clang-hip-vega20/test-suite-build/External/HIP/Output/blender.test.out
Begin Blender test.
TEST_SUITE_HIP_ROOT=/opt/botworker/llvm/External/hip
Render /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend
Blender 4.1.1 (hash e1743a0317bc built 2024-04-15 23:47:45)
Read blend: "/opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo_release.blend"
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
Could not open as Ogawa file from provided streams.
Unable to open /opt/botworker/llvm/External/hip/Blender_Scenes/290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.002", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.003", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.004", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
WARN (bke.modifier): source/blender/blenkernel/intern/modifier.cc:425 BKE_modifier_set_error: Object: "GEO-flag.001", Modifier: "MeshSequenceCache", Could not create reader for file //290skydemo2_flags.abc
I1209 13:04:21.770818 1302883 device.cpp:39] HIPEW initialization succeeded
I1209 13:04:21.774643 1302883 device.cpp:45] Found HIPCC hipcc
I1209 13:04:21.859560 1302883 device.cpp:207] Device has compute preemption or is not used for display.
I1209 13:04:21.859576 1302883 device.cpp:211] Added device "" with id "HIP__0000:83:00".
I1209 13:04:21.859656 1302883 device.cpp:568] Mapped host memory limit set to 1,009,924,165,632 bytes. (940.56G)
I1209 13:04:21.859932 1302883 device_impl.cpp:63] Using AVX2 CPU kernels.
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_rim
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.015
Fra:1 Mem:524.00M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Rivets.025
Fra:1 Mem:524.12M (Peak 524.70M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Cables.004
Fra:1 Mem:532.72M (Peak 533.27M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors
Fra:1 Mem:534.08M (Peak 534.08M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.009
Fra:1 Mem:534.20M (Peak 534.20M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Pistons
Fra:1 Mem:534.34M (Peak 534.33M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Eyepiece_Insides
Fra:1 Mem:534.72M (Peak 534.72M) | Time:00:00.48 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_greeble
Fra:1 Mem:535.03M (Peak 535.04M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_greeble.005
Fra:1 Mem:535.44M (Peak 535.42M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Wires
Fra:1 Mem:602.29M (Peak 602.29M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_wires
Fra:1 Mem:605.83M (Peak 605.83M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Curve_Connectors.003
Fra:1 Mem:605.88M (Peak 605.89M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Head_plates.001
Fra:1 Mem:605.93M (Peak 605.93M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | ENV-fog
Fra:1 Mem:607.30M (Peak 607.30M) | Time:00:00.49 | Mem:0.00M, Peak:0.00M | Scene, View Layer | Synchronizing object | GEO-Ground

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment