Skip to content

Conversation

@perlfu
Copy link
Contributor

@perlfu perlfu commented May 7, 2025

Remove recursion to avoid stack overflow on large CFGs.
Avoid worklist for hazard search within single MachineBasicBlock.
Ensure predecessors are visited for all state combinations.

@perlfu perlfu requested review from jayfoad and rampitec May 7, 2025 10:22
@llvmbot
Copy link
Member

llvmbot commented May 7, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Carl Ritson (perlfu)

Changes

Remove recursion to avoid stack overflow on large CFGs.
Avoid worklist for hazard search within single MachineBasicBlock.
Ensure predecessors are visited for all state combinations.


Full diff: https://github.com/llvm/llvm-project/pull/138841.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+48-31)
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index aaefe27b1324f..644fbb77a495a 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -436,42 +436,55 @@ using IsExpiredFn = function_ref<bool(const MachineInstr &, int WaitStates)>;
 using GetNumWaitStatesFn = function_ref<unsigned int(const MachineInstr &)>;
 
 // Search for a hazard in a block and its predecessors.
+// StateT must implement getHashValue().
 template <typename StateT>
 static bool
-hasHazard(StateT State,
+hasHazard(StateT InitialState,
           function_ref<HazardFnResult(StateT &, const MachineInstr &)> IsHazard,
           function_ref<void(StateT &, const MachineInstr &)> UpdateState,
-          const MachineBasicBlock *MBB,
-          MachineBasicBlock::const_reverse_instr_iterator I,
-          DenseSet<const MachineBasicBlock *> &Visited) {
-  for (auto E = MBB->instr_rend(); I != E; ++I) {
-    // No need to look at parent BUNDLE instructions.
-    if (I->isBundle())
-      continue;
+          const MachineBasicBlock *InitialMBB,
+          MachineBasicBlock::const_reverse_instr_iterator InitialI) {
+  SmallVector<std::pair<const MachineBasicBlock *, StateT>> Worklist;
+  DenseSet<std::pair<const MachineBasicBlock *, unsigned>> Visited;
+  const MachineBasicBlock *MBB = InitialMBB;
+  StateT State = InitialState;
+  auto I = InitialI;
+
+  for (;;) {
+    bool Expired = false;
+    for (auto E = MBB->instr_rend(); I != E; ++I) {
+      // No need to look at parent BUNDLE instructions.
+      if (I->isBundle())
+        continue;
 
-    switch (IsHazard(State, *I)) {
-    case HazardFound:
-      return true;
-    case HazardExpired:
-      return false;
-    default:
-      // Continue search
-      break;
-    }
+      auto Result = IsHazard(State, *I);
+      if (Result == HazardFound)
+        return true;
+      if (Result == HazardExpired) {
+        Expired = true;
+        break;
+      }
 
-    if (I->isInlineAsm() || I->isMetaInstruction())
-      continue;
+      if (I->isInlineAsm() || I->isMetaInstruction())
+        continue;
 
-    UpdateState(State, *I);
-  }
+      UpdateState(State, *I);
+    }
 
-  for (MachineBasicBlock *Pred : MBB->predecessors()) {
-    if (!Visited.insert(Pred).second)
-      continue;
+    if (!Expired) {
+      unsigned StateHash = State.getHashValue();
+      for (MachineBasicBlock *Pred : MBB->predecessors()) {
+        if (!Visited.insert(std::pair(Pred, StateHash)).second)
+          continue;
+        Worklist.emplace_back(Pred, State);
+      }
+    }
 
-    if (hasHazard(State, IsHazard, UpdateState, Pred, Pred->instr_rbegin(),
-                  Visited))
-      return true;
+    if (Worklist.empty())
+      break;
+
+    std::tie(MBB, State) = Worklist.pop_back_val();
+    I = MBB->instr_rbegin();
   }
 
   return false;
@@ -1624,6 +1637,10 @@ bool GCNHazardRecognizer::fixVALUPartialForwardingHazard(MachineInstr *MI) {
     SmallDenseMap<Register, int, 4> DefPos;
     int ExecPos = std::numeric_limits<int>::max();
     int VALUs = 0;
+
+    unsigned getHashValue() const {
+      return hash_combine(ExecPos, VALUs, hash_combine_range(DefPos));
+    }
   };
 
   StateType State;
@@ -1718,9 +1735,8 @@ bool GCNHazardRecognizer::fixVALUPartialForwardingHazard(MachineInstr *MI) {
       State.VALUs += 1;
   };
 
-  DenseSet<const MachineBasicBlock *> Visited;
   if (!hasHazard<StateType>(State, IsHazardFn, UpdateStateFn, MI->getParent(),
-                            std::next(MI->getReverseIterator()), Visited))
+                            std::next(MI->getReverseIterator())))
     return false;
 
   BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
@@ -1761,6 +1777,8 @@ bool GCNHazardRecognizer::fixVALUTransUseHazard(MachineInstr *MI) {
   struct StateType {
     int VALUs = 0;
     int TRANS = 0;
+
+    unsigned getHashValue() const { return hash_combine(VALUs, TRANS); }
   };
 
   StateType State;
@@ -1796,9 +1814,8 @@ bool GCNHazardRecognizer::fixVALUTransUseHazard(MachineInstr *MI) {
       State.TRANS += 1;
   };
 
-  DenseSet<const MachineBasicBlock *> Visited;
   if (!hasHazard<StateType>(State, IsHazardFn, UpdateStateFn, MI->getParent(),
-                            std::next(MI->getReverseIterator()), Visited))
+                            std::next(MI->getReverseIterator())))
     return false;
 
   // Hazard is observed - insert a wait on va_dst counter to ensure hazard is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't safe in the presence of hash collisions.

As an alternative, could StateT have a "merge" or "min" function which takes two states and chooses the most constrained one, or the most contraining combination of them? Then if we reach the same predecessor twice by different paths we can merge the states and continue with the merged state.

Then perhaps the worklist could be a MapVector<const MachineBasicBlock *, StateT> with no need for a separate Visited set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this on the basis that is safer than current solution that never revisits a predecessor.

I had considered a solution similar to the one you propose.
But had performance concerns.

I'll work it through and test performance, but I do think we should consider a partial solution such as the one thing patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more and I concluded a merge or min state is not viable.
The min state would have to be "unsafer" of the two, which could create convergence problems.
Merging states would lead to visiting blocks with synthetic states that don't actually exist.
For some hazards it might not be possible to derive a merged state or determine the least safe state.

I have reworked the implementation to address hash collisions.
This now keeps a deduplicated vector or states, which is the only place states are retained.
Worklist and visited set retain indexes into this vector which are constant and unique for execution of the search.
Hashing is used to map states to vector indexes with a slow path provided for collisions.
(I have seen no collisions in my in depth testing.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more and I concluded a merge or min state is not viable.
The min state would have to be "unsafer" of the two, which could create convergence problems.
Merging states would lead to visiting blocks with synthetic states that don't actually exist.
For some hazards it might not be possible to derive a merged state or determine the least safe state.

I disagree with your conclusions, but I guess your way is OK for now, since you've implemented it. Just to be clear: the tradeoff you're making here is that you will potentially revisit the same MBB multiple times with different states, corresponding to different paths back through the CFG that can reach that block (even in the absence of loops).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a fairly simple hazard like GCNHazardRecognizer::fixVALUTransUseHazard.
The state is only two counts {VALUs, transcendental instructions}.
How do you merge for example {3, 0} and {2, 1}?
The most pessimistic option is {2, 0} -- least likely for hazard to expire.
The optimistic option is {3, 1} -- but can miss hazards.
The next visited block might equally expire both of the existing states, but not the pessimistically merged one, potentially increasing the number of blocks search over simply testing each state or at least equally it.
This approach is also likely to yield false positives.
If we avoid duplicate visits for the same state, then convergence is possible.

In the above example I can propose a merge state, but with sets of registers, I do not think I reasonably can.
Hazard behaviour can be inherently different for each state.
Overall my position is that for correctness we must visit each MBB with each possible reachable state.
The concession is that we avoid duplicates visits for the same state, and my testing has already shown this provides a performance gain over the existing implementation

@perlfu
Copy link
Contributor Author

perlfu commented May 22, 2025

Ping

@perlfu
Copy link
Contributor Author

perlfu commented Jun 2, 2025

Ping

@perlfu perlfu requested review from arsenm and piotrAMD June 16, 2025 01:16
@perlfu
Copy link
Contributor Author

perlfu commented Jun 30, 2025

Ping

@perlfu perlfu force-pushed the amdgpu-has-hazard-rework branch from c8c36ab to 4d6dded Compare July 15, 2025 10:39
@perlfu
Copy link
Contributor Author

perlfu commented Jul 16, 2025

Can anyone take another look at this?
I'd like to land this and #138663

Remove recursion to avoid stack overflow on large CFGs.
Avoid worklist for hazard search within single MachineBasicBlock.
Ensure predecessors are visited for all state combinations.
- Handle hashing collisions
@perlfu perlfu force-pushed the amdgpu-has-hazard-rework branch from 4d6dded to 9532454 Compare September 4, 2025 05:53
Copy link
Collaborator

@rovka rovka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me with a couple readability nits, but please wait a few days in case Jay wants to have another look.

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behaviour you've imlemented is OK but I'd really like to see the implementation cleaned up as mentioned inline.

Comment on lines 448 to 449
function_ref<HazardFnResult(StateT &, const MachineInstr &)> IsHazard,
function_ref<void(StateT &, const MachineInstr &)> UpdateState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something to address in the current patch, but I think this IsHazard/UpdateState interface is confusing because IsHazard also updates the state. So technically I think there is no need for the separate UpdateState function, it could just become part of what IsHazard does before returning NoHazardFound.

SmallDenseMap<unsigned, unsigned> StateHash2Idx;
SmallVector<StateT> States;

// States contains a vector of unique state structures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels overcomplicated and it's also similar to what MapVector does internally. I think you could probably reimplement this with a MapVector<StateT, void>, using MapVector::getArrayRef to convert between states and StateIdxs. This would have the (I claim) benefit that StateT would have to support standard DenseMap traits instead of the less standard getHashValue and isEqual methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back a bit, why do you want this level of indirection in the first place? Is it "just" a performance thing, to avoid copying around states which could be large, and avoid storing multiple copies of identical states? The MapVector solution would still store two copies of each state, since MapVector has a SmallVector of keys+values as well as a DenseMap of the keys.

Another possibility might be to have a DenseMap of pointers to states hashed by the state itself not the pointer value, a bit like this does for MachineInstrs: **

/// Special DenseMapInfo traits to compare MachineInstr* by *value* of the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably reimplement MapVector<StateT, void>, using MapVector::getArrayRef
getArrayRef only provides access to the values, not the keys.

Stepping back a bit, why do you want this level of indirection in the first place?
Avoiding copying for performance is the primary reason. Because the data is backed by a SmallVector pointers are also not an option (will invalid).

DenseMap allows for alternate Key and LookupKey types specified via traits.
I have reimplemented using this mechanism; however, the actual key type needs to be a pointer to the backing SmallVector and an index into it, because traits methods are static -- if they were not static I could capture the SmallVector reference and keys would only need to be indexes into the array.

Overall implementation is more code, but runs with comparable performance to previous version in this PR. Slightly worse, but still net improvement against current code without this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, the actual key type needs to be a pointer to the backing SmallVector and an index into it, because traits methods are static -- if they were not static I could capture the SmallVector reference and keys would only need to be indexes into the array.

If you use std::deque then you can work with pointers to the StateT objects, not indexes, because deque promises not to move existing allocations as it grows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put up a version that uses deque and SmallDenseMap from pointer to pointer if you prefer?
It doesn't look that different to this, but it is measurably a little slower.

A lot of the benefit of this refactor comes from the fact that most hasHazard calls can be handled within a block.
Hence avoiding touching any data structures and doing any memory allocation is beneficial -- most std::deque implementation do alloca on init AFAIK.

Beyond improved correctness the secondary goal of this patch is to try and claw back some of the impact of #138663.
Using std::deque certainly negates a lot of that.
Version prior to current was still the best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I still don't like the hand-rolled hashtable implementation, but I guess I won't object if it's measurably faster and if you can find someone to review it for correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you accept the current implementation using SmallVector and SmallDenseMap?
To re-iterate, this version while not the best is still better than current implementation in both performance and correctness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll take another look at the current version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more and I concluded a merge or min state is not viable.
The min state would have to be "unsafer" of the two, which could create convergence problems.
Merging states would lead to visiting blocks with synthetic states that don't actually exist.
For some hazards it might not be possible to derive a merged state or determine the least safe state.

I disagree with your conclusions, but I guess your way is OK for now, since you've implemented it. Just to be clear: the tradeoff you're making here is that you will potentially revisit the same MBB multiple times with different states, corresponding to different paths back through the CFG that can reach that block (even in the absence of loops).

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@perlfu perlfu merged commit e60ca86 into llvm:main Sep 24, 2025
9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Sep 26, 2025

LLVM Buildbot has detected a new failure on builder ppc64le-flang-rhel-clang running on ppc64le-flang-rhel-test while building llvm at step 4 "cmake-configure".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/157/builds/40283

Here is the relevant piece of the build log for the reference
Step 4 (cmake-configure) failure: cmake (failure) (timed out)
-- The C compiler identification is Clang 19.1.7
command timed out: 1200 seconds without output running [b'cmake', b'-DLLVM_TARGETS_TO_BUILD=PowerPC', b'-DLLVM_INSTALL_UTILS=ON', b'-DCMAKE_CXX_STANDARD=17', b'-DLLVM_LIT_ARGS=-vj 256', b'-DFLANG_ENABLE_WERROR=ON', b'-DLLVM_ENABLE_ASSERTIONS=ON', b'-DCMAKE_C_COMPILER_LAUNCHER=ccache', b'-DCMAKE_CXX_COMPILER_LAUNCHER=ccache', b'-DLLVM_ENABLE_PROJECTS=mlir;flang;llvm;clang', b'-DLLVM_ENABLE_RUNTIMES=openmp;flang-rt', b'-DCMAKE_BUILD_TYPE=Release', b'-GNinja', b'../llvm-project/llvm'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1203.359984

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants