Skip to content

Conversation

@linuxrocks123
Copy link
Contributor

@linuxrocks123 linuxrocks123 commented Aug 8, 2025

This attempts to resolve SWDEV-547512 by inhibiting alloca to LDS promotion when register pressure is high. Draft because:

  • There are no tests.
  • The heuristic probably needs to be refined.
  • This may not even be a good idea; performance testing must be completed to determine that.

@github-actions
Copy link

github-actions bot commented Aug 8, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on density of your map one could use getNumber() on BasicBlock to get a unique number for mapping.
E.g. this could be a SmallVector of sets with a size of F.getMaxBlockNumber().
For the set I'd suggest using a SmallPtrSet instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may perform slightly better, but it would be harder to read. Perhaps it would be possible to encapsulate this algorithm into a set data structure somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgymnich I can change it to SmallPtrSet for the value type, but the problem with using a vector is that I need to know whether the map contains a value. That would require making a vector of pointers to the SmallPtrSet instead of a vector of SmallPtrSets. I'm not sure if that would perform better than what I currently have.

@linuxrocks123 linuxrocks123 force-pushed the swdev-547512 branch 2 times, most recently from 8a3fddc to 177ee8a Compare August 11, 2025 20:07
@kzhuravl kzhuravl requested review from Pierre-vh and shiltian August 12, 2025 14:23
@linuxrocks123 linuxrocks123 marked this pull request as ready for review August 14, 2025 21:31
@llvmbot
Copy link
Member

llvmbot commented Aug 14, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Patrick Simmons (linuxrocks123)

Changes

This attempts to resolve SWDEV-547512 by inhibiting alloca to LDS promotion when register pressure is high. Draft because:

  • There are no tests.
  • The heuristic probably needs to be refined.
  • This may not even be a good idea; performance testing must be completed to determine that.

Full diff: https://github.com/llvm/llvm-project/pull/152814.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+92)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index f226c7f381aa2..fe41705beeb2a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -27,7 +27,9 @@
 
 #include "AMDGPU.h"
 #include "GCNSubtarget.h"
+#include "SIRegisterInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/Analysis/CaptureTracking.h"
 #include "llvm/Analysis/InstSimplifyFolder.h"
@@ -36,6 +38,7 @@
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instruction.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/IntrinsicsR600.h"
@@ -45,6 +48,9 @@
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/SSAUpdater.h"
 
+#include <algorithm>
+#include <unordered_set>
+
 #define DEBUG_TYPE "amdgpu-promote-alloca"
 
 using namespace llvm;
@@ -100,6 +106,14 @@ class AMDGPUPromoteAllocaImpl {
   unsigned VGPRBudgetRatio;
   unsigned MaxVectorRegs;
 
+  std::unordered_map<BasicBlock *, std::unordered_set<Instruction *>>
+      SGPRLiveIns;
+  size_t getSGPRPressureEstimate(AllocaInst &I);
+
+  std::unordered_map<BasicBlock *, std::unordered_set<Instruction *>>
+      VGPRLiveIns;
+  size_t getVGPRPressureEstimate(AllocaInst &I);
+
   bool IsAMDGCN = false;
   bool IsAMDHSA = false;
 
@@ -1471,9 +1485,87 @@ bool AMDGPUPromoteAllocaImpl::hasSufficientLocalMem(const Function &F) {
   return true;
 }
 
+size_t AMDGPUPromoteAllocaImpl::getSGPRPressureEstimate(AllocaInst &I) {
+  Function &F = *I.getFunction();
+  size_t MaxLive = 0;
+  for (BasicBlock *BB : post_order(&F)) {
+    if (SGPRLiveIns.count(BB))
+      continue;
+
+    std::unordered_set<Instruction *> CurrentlyLive;
+    for (BasicBlock *SuccBB : successors(BB))
+      if (SGPRLiveIns.count(SuccBB))
+        for (const auto &R : SGPRLiveIns[SuccBB])
+          CurrentlyLive.insert(R);
+
+    for (auto RIt = BB->rbegin(); RIt != BB->rend(); RIt++) {
+      if (&*RIt == &I)
+        return MaxLive;
+
+      MaxLive = std::max(MaxLive, CurrentlyLive.size());
+
+      for (auto &Op : RIt->operands())
+        if (!Op.get()->getType()->isVectorTy())
+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (!RIt->getType()->isVectorTy())
+        CurrentlyLive.erase(&*RIt);
+    }
+
+    SGPRLiveIns[BB] = CurrentlyLive;
+  }
+
+  llvm_unreachable("Woops, we fell off the edge of the world.  Bye bye.");
+}
+
+size_t AMDGPUPromoteAllocaImpl::getVGPRPressureEstimate(AllocaInst &I) {
+  Function &F = *I.getParent()->getParent();
+  size_t MaxLive = 0;
+  for (BasicBlock *BB : post_order(&F)) {
+    if (VGPRLiveIns.count(BB))
+      continue;
+
+    std::unordered_set<Instruction *> CurrentlyLive;
+    for (BasicBlock *SuccBB : successors(BB))
+      if (VGPRLiveIns.count(SuccBB))
+        for (const auto &R : VGPRLiveIns[SuccBB])
+          CurrentlyLive.insert(R);
+
+    for (auto RIt = BB->rbegin(); RIt != BB->rend(); RIt++) {
+      if (&*RIt == &I)
+        return MaxLive;
+
+      MaxLive = std::max(MaxLive, CurrentlyLive.size());
+
+      for (auto &Op : RIt->operands())
+        if (Op.get()->getType()->isVectorTy())
+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (RIt->getType()->isVectorTy())
+        CurrentlyLive.erase(&*RIt);
+    }
+
+    VGPRLiveIns[BB] = CurrentlyLive;
+  }
+
+  llvm_unreachable("Woops, we fell off the edge of the world.  Bye bye.");
+}
+
 // FIXME: Should try to pick the most likely to be profitable allocas first.
 bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaInst &I,
                                                     bool SufficientLDS) {
+  const unsigned SGPRPressureLimit = AMDGPU::SGPR_32RegClass.getNumRegs();
+  const unsigned VGPRPressureLimit = AMDGPU::VGPR_32RegClass.getNumRegs();
+
+  if (getSGPRPressureEstimate(I) > SGPRPressureLimit ||
+      getVGPRPressureEstimate(I) > VGPRPressureLimit) {
+    LLVM_DEBUG(dbgs() << "Declining to promote " << I
+                      << " to LDS since pressure is relatively high.\n");
+    return false;
+  }
+
   LLVM_DEBUG(dbgs() << "Trying to promote to LDS: " << I << '\n');
 
   if (DisablePromoteAllocaToLDS) {

@linuxrocks123
Copy link
Contributor Author

@arsenm I have some thoughts about this that aren't directly related to this PR but are rather related to this pass in general.

I feel like IR-level is too early to be running this pass, and I think that this promotion logic may be more appropriate as part of the register allocator's spill logic. In particular, the spill handler could check if a value can be spilled to LDS instead of the stack and then spill it there instead of the stack. Perhaps the spill cost estimator could also consider LDS-eligibility when deciding how expensive a particular value would be to spill.

Implementing this would require annotating alloca() calls that have been promoted to virtual registers with some sort of notation indicating that it is legal to spill them to LDS instead of the stack, but that shouldn't be too hard to do, right?

Please let me know your thoughts on this.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea, and you're taking what they are asking for at face value.
If anything we should try to be more aggressive about LDS usage when pressure is high.

I maintain:

  • LDS is always preferable to stack access, which is the alternative here.
  • We basically always want to do this if the LDS budget permits
  • The testcase claiming this is slower needs more investigation for why it's slower, it sounds more like a second order effect that should be addressed separately

@arsenm
Copy link
Contributor

arsenm commented Aug 15, 2025

I feel like IR-level is too early to be running this pass, and I think that this promotion logic may be more appropriate as part of the register allocator's spill logic.

These are complementary things, we ought to be doing both. There have been previous attempts to implement spill to LDS, but that would never replace this. We would need to tweak the heuristics to reserve LDS space for the later spilling, but these solve different problems.

Implementing this would require annotating alloca() calls that have been promoted to virtual registers with some sort of notation indicating that it is legal to spill them to LDS instead of the stack, but that shouldn't be too hard to do, right?

I'm not really sure what you're describing here. SROA + this pass are the main "promote alloca to virtual register" cases, nothing in codegen does that. The allocas also no longer exist, so there's nothing to track? The allocator isn't trying to reinterpret allocas (nor should it?)

llvm_unreachable("Woops, we fell off the edge of the world. Bye bye.");
}

size_t AMDGPUPromoteAllocaImpl::getVGPRPressureEstimate(AllocaInst &I) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IR has no knowledge of SGPRs or VGPRs, and you're missing out on all the pressures exposed by legalization

if (Instruction *U = dyn_cast<Instruction>(Op))
CurrentlyLive.insert(U);

if (!RIt->getType()->isVectorTy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGPR does not mean "not an IR vector type"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm do you know of a better heuristic for estimating SGPR pressure at the IR level?

if (Instruction *U = dyn_cast<Instruction>(Op))
CurrentlyLive.insert(U);

if (RIt->getType()->isVectorTy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VGPR doesn't mean "IR vector type"

Copy link
Contributor Author

@linuxrocks123 linuxrocks123 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm do you know of a better heuristic for estimating VGPR pressure at the IR level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but the type has nothing to do with it. We don't have any real / precise attempts pressure heuristics in IR

@linuxrocks123
Copy link
Contributor Author

I don't think this is a good idea, and you're taking what they are asking for at face value. If anything we should try to be more aggressive about LDS usage when pressure is high.

I maintain:

* LDS is always preferable to stack access, which is the alternative here.

* We basically always want to do this if the LDS budget permits

* The testcase claiming this is slower needs more investigation for why it's slower, it sounds more like a second order effect that should be addressed separately

Thanks for the feedback. I am still becoming familiar with our ISA. Some thoughts:

  • Perhaps the real problem for the user is that SROA is not promoting the alloca() to virtual registers, and SROA should therefore be examined to see if it's missing an opportunity for the user's testcase.
  • Perhaps some optimization is able to eliminate unnecessary loads from and stores to the stack, because the stack is known to be thread-local (it is, right?), but that optimization is unaware of our LDS intrinsics, or maybe it is aware but since LDS is shared between threads it can't operate on it, and therefore fails to do the same for those loads and stores. This could be solved by improving that optimization and/or creating new intrinsics when the LDS loads and stores are known to be thread-local.

Do you have any thoughts on these possibilities?

@arsenm
Copy link
Contributor

arsenm commented Aug 20, 2025

  • Perhaps the real problem for the user is that SROA is not promoting the alloca() to virtual registers, and SROA should therefore be examined to see if it's missing an opportunity for the user's testcase.

Yes, that's always preferable. No stack is better than LDS is better than stack, which is usually the single worst thing you can use. At some point you need a large junk of addressable memory, the stack is just the worst case.

  • Perhaps some optimization is able to eliminate unnecessary loads from and stores to the stack,

Those are the fundamentals of optimization

because the stack is known to be thread-local (it is, right?), but that optimization is unaware of our LDS intrinsics,

Ordinary LDS usage is just regular load/store with addrspace(3). There aren't special LDS intrinsics for these cases. In this example we're just swapping out the memory access type, which should be treated equivalently well by downstream optimizations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants