Limit Alloca->LDS promotion based on speculations as to eventual register pressure #152814

linuxrocks123 · 2025-08-08T23:43:07Z

This attempts to resolve SWDEV-547512 by inhibiting alloca to LDS promotion when register pressure is high. Draft because:

There are no tests.
The heuristic probably needs to be refined.
This may not even be a good idea; performance testing must be completed to determine that.

github-actions · 2025-08-08T23:43:24Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

tgymnich · 2025-08-09T14:38:53Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Depending on density of your map one could use getNumber() on BasicBlock to get a unique number for mapping.
E.g. this could be a SmallVector of sets with a size of F.getMaxBlockNumber().
For the set I'd suggest using a SmallPtrSet instead.

That may perform slightly better, but it would be harder to read. Perhaps it would be possible to encapsulate this algorithm into a set data structure somehow.

@tgymnich I can change it to SmallPtrSet for the value type, but the problem with using a vector is that I need to know whether the map contains a value. That would require making a vector of pointers to the SmallPtrSet instead of a vector of SmallPtrSets. I'm not sure if that would perform better than what I currently have.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

…ster pressure

Co-authored-by: Tim Gymnich <[email protected]>

llvmbot · 2025-08-14T21:31:59Z

@llvm/pr-subscribers-backend-amdgpu

Author: Patrick Simmons (linuxrocks123)

Changes

This attempts to resolve SWDEV-547512 by inhibiting alloca to LDS promotion when register pressure is high. Draft because:

There are no tests.
The heuristic probably needs to be refined.
This may not even be a good idea; performance testing must be completed to determine that.

Full diff: https://github.com/llvm/llvm-project/pull/152814.diff

1 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+92)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index f226c7f381aa2..fe41705beeb2a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -27,7 +27,9 @@
 
 #include "AMDGPU.h"
 #include "GCNSubtarget.h"
+#include "SIRegisterInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/Analysis/CaptureTracking.h"
 #include "llvm/Analysis/InstSimplifyFolder.h"
@@ -36,6 +38,7 @@
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instruction.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/IntrinsicsR600.h"
@@ -45,6 +48,9 @@
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/SSAUpdater.h"
 
+#include <algorithm>
+#include <unordered_set>
+
 #define DEBUG_TYPE "amdgpu-promote-alloca"
 
 using namespace llvm;
@@ -100,6 +106,14 @@ class AMDGPUPromoteAllocaImpl {
   unsigned VGPRBudgetRatio;
   unsigned MaxVectorRegs;
 
+  std::unordered_map<BasicBlock *, std::unordered_set<Instruction *>>
+      SGPRLiveIns;
+  size_t getSGPRPressureEstimate(AllocaInst &I);
+
+  std::unordered_map<BasicBlock *, std::unordered_set<Instruction *>>
+      VGPRLiveIns;
+  size_t getVGPRPressureEstimate(AllocaInst &I);
+
   bool IsAMDGCN = false;
   bool IsAMDHSA = false;
 
@@ -1471,9 +1485,87 @@ bool AMDGPUPromoteAllocaImpl::hasSufficientLocalMem(const Function &F) {
   return true;
 }
 
+size_t AMDGPUPromoteAllocaImpl::getSGPRPressureEstimate(AllocaInst &I) {
+  Function &F = *I.getFunction();
+  size_t MaxLive = 0;
+  for (BasicBlock *BB : post_order(&F)) {
+    if (SGPRLiveIns.count(BB))
+      continue;
+
+    std::unordered_set<Instruction *> CurrentlyLive;
+    for (BasicBlock *SuccBB : successors(BB))
+      if (SGPRLiveIns.count(SuccBB))
+        for (const auto &R : SGPRLiveIns[SuccBB])
+          CurrentlyLive.insert(R);
+
+    for (auto RIt = BB->rbegin(); RIt != BB->rend(); RIt++) {
+      if (&*RIt == &I)
+        return MaxLive;
+
+      MaxLive = std::max(MaxLive, CurrentlyLive.size());
+
+      for (auto &Op : RIt->operands())
+        if (!Op.get()->getType()->isVectorTy())
+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (!RIt->getType()->isVectorTy())
+        CurrentlyLive.erase(&*RIt);
+    }
+
+    SGPRLiveIns[BB] = CurrentlyLive;
+  }
+
+  llvm_unreachable("Woops, we fell off the edge of the world.  Bye bye.");
+}
+
+size_t AMDGPUPromoteAllocaImpl::getVGPRPressureEstimate(AllocaInst &I) {
+  Function &F = *I.getParent()->getParent();
+  size_t MaxLive = 0;
+  for (BasicBlock *BB : post_order(&F)) {
+    if (VGPRLiveIns.count(BB))
+      continue;
+
+    std::unordered_set<Instruction *> CurrentlyLive;
+    for (BasicBlock *SuccBB : successors(BB))
+      if (VGPRLiveIns.count(SuccBB))
+        for (const auto &R : VGPRLiveIns[SuccBB])
+          CurrentlyLive.insert(R);
+
+    for (auto RIt = BB->rbegin(); RIt != BB->rend(); RIt++) {
+      if (&*RIt == &I)
+        return MaxLive;
+
+      MaxLive = std::max(MaxLive, CurrentlyLive.size());
+
+      for (auto &Op : RIt->operands())
+        if (Op.get()->getType()->isVectorTy())
+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (RIt->getType()->isVectorTy())
+        CurrentlyLive.erase(&*RIt);
+    }
+
+    VGPRLiveIns[BB] = CurrentlyLive;
+  }
+
+  llvm_unreachable("Woops, we fell off the edge of the world.  Bye bye.");
+}
+
 // FIXME: Should try to pick the most likely to be profitable allocas first.
 bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaInst &I,
                                                     bool SufficientLDS) {
+  const unsigned SGPRPressureLimit = AMDGPU::SGPR_32RegClass.getNumRegs();
+  const unsigned VGPRPressureLimit = AMDGPU::VGPR_32RegClass.getNumRegs();
+
+  if (getSGPRPressureEstimate(I) > SGPRPressureLimit ||
+      getVGPRPressureEstimate(I) > VGPRPressureLimit) {
+    LLVM_DEBUG(dbgs() << "Declining to promote " << I
+                      << " to LDS since pressure is relatively high.\n");
+    return false;
+  }
+
   LLVM_DEBUG(dbgs() << "Trying to promote to LDS: " << I << '\n');
 
   if (DisablePromoteAllocaToLDS) {

linuxrocks123 · 2025-08-14T21:52:26Z

@arsenm I have some thoughts about this that aren't directly related to this PR but are rather related to this pass in general.

I feel like IR-level is too early to be running this pass, and I think that this promotion logic may be more appropriate as part of the register allocator's spill logic. In particular, the spill handler could check if a value can be spilled to LDS instead of the stack and then spill it there instead of the stack. Perhaps the spill cost estimator could also consider LDS-eligibility when deciding how expensive a particular value would be to spill.

Implementing this would require annotating alloca() calls that have been promoted to virtual registers with some sort of notation indicating that it is legal to spill them to LDS instead of the stack, but that shouldn't be too hard to do, right?

Please let me know your thoughts on this.

arsenm

I don't think this is a good idea, and you're taking what they are asking for at face value.
If anything we should try to be more aggressive about LDS usage when pressure is high.

I maintain:

LDS is always preferable to stack access, which is the alternative here.
We basically always want to do this if the LDS budget permits
The testcase claiming this is slower needs more investigation for why it's slower, it sounds more like a second order effect that should be addressed separately

arsenm · 2025-08-15T01:29:11Z

I feel like IR-level is too early to be running this pass, and I think that this promotion logic may be more appropriate as part of the register allocator's spill logic.

These are complementary things, we ought to be doing both. There have been previous attempts to implement spill to LDS, but that would never replace this. We would need to tweak the heuristics to reserve LDS space for the later spilling, but these solve different problems.

Implementing this would require annotating alloca() calls that have been promoted to virtual registers with some sort of notation indicating that it is legal to spill them to LDS instead of the stack, but that shouldn't be too hard to do, right?

I'm not really sure what you're describing here. SROA + this pass are the main "promote alloca to virtual register" cases, nothing in codegen does that. The allocas also no longer exist, so there's nothing to track? The allocator isn't trying to reinterpret allocas (nor should it?)

arsenm · 2025-08-15T01:37:26Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+  llvm_unreachable("Woops, we fell off the edge of the world.  Bye bye.");
+}
+
+size_t AMDGPUPromoteAllocaImpl::getVGPRPressureEstimate(AllocaInst &I) {


The IR has no knowledge of SGPRs or VGPRs, and you're missing out on all the pressures exposed by legalization

arsenm · 2025-08-15T01:37:55Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (!RIt->getType()->isVectorTy())


SGPR does not mean "not an IR vector type"

@arsenm do you know of a better heuristic for estimating SGPR pressure at the IR level?

arsenm · 2025-08-15T01:38:05Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (RIt->getType()->isVectorTy())


VGPR doesn't mean "IR vector type"

@arsenm do you know of a better heuristic for estimating VGPR pressure at the IR level?

No, but the type has nothing to do with it. We don't have any real / precise attempts pressure heuristics in IR

linuxrocks123 · 2025-08-15T02:37:20Z

I don't think this is a good idea, and you're taking what they are asking for at face value. If anything we should try to be more aggressive about LDS usage when pressure is high.

I maintain:
* LDS is always preferable to stack access, which is the alternative here.

* We basically always want to do this if the LDS budget permits

* The testcase claiming this is slower needs more investigation for why it's slower, it sounds more like a second order effect that should be addressed separately

Thanks for the feedback. I am still becoming familiar with our ISA. Some thoughts:

Perhaps the real problem for the user is that SROA is not promoting the alloca() to virtual registers, and SROA should therefore be examined to see if it's missing an opportunity for the user's testcase.
Perhaps some optimization is able to eliminate unnecessary loads from and stores to the stack, because the stack is known to be thread-local (it is, right?), but that optimization is unaware of our LDS intrinsics, or maybe it is aware but since LDS is shared between threads it can't operate on it, and therefore fails to do the same for those loads and stores. This could be solved by improving that optimization and/or creating new intrinsics when the LDS loads and stores are known to be thread-local.

Do you have any thoughts on these possibilities?

arsenm · 2025-08-20T12:40:42Z

Perhaps the real problem for the user is that SROA is not promoting the alloca() to virtual registers, and SROA should therefore be examined to see if it's missing an opportunity for the user's testcase.

Yes, that's always preferable. No stack is better than LDS is better than stack, which is usually the single worst thing you can use. At some point you need a large junk of addressable memory, the stack is just the worst case.

Perhaps some optimization is able to eliminate unnecessary loads from and stores to the stack,

Those are the fundamentals of optimization

because the stack is known to be thread-local (it is, right?), but that optimization is unaware of our LDS intrinsics,

Ordinary LDS usage is just regular load/store with addrspace(3). There aren't special LDS intrinsics for these cases. In this example we're just swapping out the memory access type, which should be treated equivalently well by downstream optimizations

tgymnich reviewed Aug 9, 2025

View reviewed changes

linuxrocks123 force-pushed the swdev-547512 branch 2 times, most recently from 8a3fddc to 177ee8a Compare August 11, 2025 20:07

linuxrocks123 and others added 4 commits August 11, 2025 15:09

Limit Alloca->LDS promotion based on speculations as to eventual regi…

9822d9b

…ster pressure

Limit based on high not low register pressure

7b86ccc

Add important qualifications.

1d3e1c1

Update llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

be47b63

Co-authored-by: Tim Gymnich <[email protected]>

linuxrocks123 force-pushed the swdev-547512 branch from 177ee8a to be47b63 Compare August 11, 2025 20:10

kzhuravl requested review from Pierre-vh and shiltian August 12, 2025 14:23

linuxrocks123 marked this pull request as ready for review August 14, 2025 21:31

llvmbot added the backend:AMDGPU label Aug 14, 2025

arsenm requested changes Aug 15, 2025

View reviewed changes

arsenm reviewed Aug 15, 2025

View reviewed changes

Limit Alloca->LDS promotion based on speculations as to eventual register pressure #152814

Are you sure you want to change the base?

Limit Alloca->LDS promotion based on speculations as to eventual register pressure #152814

Uh oh!

Conversation

linuxrocks123 commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Aug 14, 2025

Uh oh!

linuxrocks123 commented Aug 14, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm commented Aug 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linuxrocks123 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linuxrocks123 commented Aug 15, 2025

Uh oh!

arsenm commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linuxrocks123 commented Aug 8, 2025 •

edited

Loading

linuxrocks123 Aug 15, 2025 •

edited

Loading