- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
Limit Alloca->LDS promotion based on speculations as to eventual register pressure #152814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 
          
 Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using  If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on density of your map one could use getNumber() on BasicBlock to get a unique number for mapping.
E.g. this could be a SmallVector of sets with a size of F.getMaxBlockNumber().
For the set I'd suggest using a SmallPtrSet instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may perform slightly better, but it would be harder to read. Perhaps it would be possible to encapsulate this algorithm into a set data structure somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgymnich I can change it to SmallPtrSet for the value type, but the problem with using a vector is that I need to know whether the map contains a value. That would require making a vector of pointers to the SmallPtrSet instead of a vector of SmallPtrSets. I'm not sure if that would perform better than what I currently have.
8a3fddc    to
    177ee8a      
    Compare
  
    177ee8a    to
    be47b63      
    Compare
  
    | 
          
 @llvm/pr-subscribers-backend-amdgpu Author: Patrick Simmons (linuxrocks123) ChangesThis attempts to resolve SWDEV-547512 by inhibiting alloca to LDS promotion when register pressure is high. Draft because: 
 Full diff: https://github.com/llvm/llvm-project/pull/152814.diff 1 Files Affected: 
 diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index f226c7f381aa2..fe41705beeb2a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -27,7 +27,9 @@
 
 #include "AMDGPU.h"
 #include "GCNSubtarget.h"
+#include "SIRegisterInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/Analysis/CaptureTracking.h"
 #include "llvm/Analysis/InstSimplifyFolder.h"
@@ -36,6 +38,7 @@
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
 #include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instruction.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 #include "llvm/IR/IntrinsicsR600.h"
@@ -45,6 +48,9 @@
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/SSAUpdater.h"
 
+#include <algorithm>
+#include <unordered_set>
+
 #define DEBUG_TYPE "amdgpu-promote-alloca"
 
 using namespace llvm;
@@ -100,6 +106,14 @@ class AMDGPUPromoteAllocaImpl {
   unsigned VGPRBudgetRatio;
   unsigned MaxVectorRegs;
 
+  std::unordered_map<BasicBlock *, std::unordered_set<Instruction *>>
+      SGPRLiveIns;
+  size_t getSGPRPressureEstimate(AllocaInst &I);
+
+  std::unordered_map<BasicBlock *, std::unordered_set<Instruction *>>
+      VGPRLiveIns;
+  size_t getVGPRPressureEstimate(AllocaInst &I);
+
   bool IsAMDGCN = false;
   bool IsAMDHSA = false;
 
@@ -1471,9 +1485,87 @@ bool AMDGPUPromoteAllocaImpl::hasSufficientLocalMem(const Function &F) {
   return true;
 }
 
+size_t AMDGPUPromoteAllocaImpl::getSGPRPressureEstimate(AllocaInst &I) {
+  Function &F = *I.getFunction();
+  size_t MaxLive = 0;
+  for (BasicBlock *BB : post_order(&F)) {
+    if (SGPRLiveIns.count(BB))
+      continue;
+
+    std::unordered_set<Instruction *> CurrentlyLive;
+    for (BasicBlock *SuccBB : successors(BB))
+      if (SGPRLiveIns.count(SuccBB))
+        for (const auto &R : SGPRLiveIns[SuccBB])
+          CurrentlyLive.insert(R);
+
+    for (auto RIt = BB->rbegin(); RIt != BB->rend(); RIt++) {
+      if (&*RIt == &I)
+        return MaxLive;
+
+      MaxLive = std::max(MaxLive, CurrentlyLive.size());
+
+      for (auto &Op : RIt->operands())
+        if (!Op.get()->getType()->isVectorTy())
+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (!RIt->getType()->isVectorTy())
+        CurrentlyLive.erase(&*RIt);
+    }
+
+    SGPRLiveIns[BB] = CurrentlyLive;
+  }
+
+  llvm_unreachable("Woops, we fell off the edge of the world.  Bye bye.");
+}
+
+size_t AMDGPUPromoteAllocaImpl::getVGPRPressureEstimate(AllocaInst &I) {
+  Function &F = *I.getParent()->getParent();
+  size_t MaxLive = 0;
+  for (BasicBlock *BB : post_order(&F)) {
+    if (VGPRLiveIns.count(BB))
+      continue;
+
+    std::unordered_set<Instruction *> CurrentlyLive;
+    for (BasicBlock *SuccBB : successors(BB))
+      if (VGPRLiveIns.count(SuccBB))
+        for (const auto &R : VGPRLiveIns[SuccBB])
+          CurrentlyLive.insert(R);
+
+    for (auto RIt = BB->rbegin(); RIt != BB->rend(); RIt++) {
+      if (&*RIt == &I)
+        return MaxLive;
+
+      MaxLive = std::max(MaxLive, CurrentlyLive.size());
+
+      for (auto &Op : RIt->operands())
+        if (Op.get()->getType()->isVectorTy())
+          if (Instruction *U = dyn_cast<Instruction>(Op))
+            CurrentlyLive.insert(U);
+
+      if (RIt->getType()->isVectorTy())
+        CurrentlyLive.erase(&*RIt);
+    }
+
+    VGPRLiveIns[BB] = CurrentlyLive;
+  }
+
+  llvm_unreachable("Woops, we fell off the edge of the world.  Bye bye.");
+}
+
 // FIXME: Should try to pick the most likely to be profitable allocas first.
 bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToLDS(AllocaInst &I,
                                                     bool SufficientLDS) {
+  const unsigned SGPRPressureLimit = AMDGPU::SGPR_32RegClass.getNumRegs();
+  const unsigned VGPRPressureLimit = AMDGPU::VGPR_32RegClass.getNumRegs();
+
+  if (getSGPRPressureEstimate(I) > SGPRPressureLimit ||
+      getVGPRPressureEstimate(I) > VGPRPressureLimit) {
+    LLVM_DEBUG(dbgs() << "Declining to promote " << I
+                      << " to LDS since pressure is relatively high.\n");
+    return false;
+  }
+
   LLVM_DEBUG(dbgs() << "Trying to promote to LDS: " << I << '\n');
 
   if (DisablePromoteAllocaToLDS) {
 | 
    
| 
           @arsenm I have some thoughts about this that aren't directly related to this PR but are rather related to this pass in general. I feel like IR-level is too early to be running this pass, and I think that this promotion logic may be more appropriate as part of the register allocator's spill logic. In particular, the spill handler could check if a value can be spilled to LDS instead of the stack and then spill it there instead of the stack. Perhaps the spill cost estimator could also consider LDS-eligibility when deciding how expensive a particular value would be to spill. Implementing this would require annotating alloca() calls that have been promoted to virtual registers with some sort of notation indicating that it is legal to spill them to LDS instead of the stack, but that shouldn't be too hard to do, right? Please let me know your thoughts on this.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea, and you're taking what they are asking for at face value.
If anything we should try to be more aggressive about LDS usage when pressure is high.
I maintain:
- LDS is always preferable to stack access, which is the alternative here.
 - We basically always want to do this if the LDS budget permits
 - The testcase claiming this is slower needs more investigation for why it's slower, it sounds more like a second order effect that should be addressed separately
 
          
 These are complementary things, we ought to be doing both. There have been previous attempts to implement spill to LDS, but that would never replace this. We would need to tweak the heuristics to reserve LDS space for the later spilling, but these solve different problems. 
 I'm not really sure what you're describing here. SROA + this pass are the main "promote alloca to virtual register" cases, nothing in codegen does that. The allocas also no longer exist, so there's nothing to track? The allocator isn't trying to reinterpret allocas (nor should it?)  | 
    
| llvm_unreachable("Woops, we fell off the edge of the world. Bye bye."); | ||
| } | ||
| 
               | 
          ||
| size_t AMDGPUPromoteAllocaImpl::getVGPRPressureEstimate(AllocaInst &I) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IR has no knowledge of SGPRs or VGPRs, and you're missing out on all the pressures exposed by legalization
| if (Instruction *U = dyn_cast<Instruction>(Op)) | ||
| CurrentlyLive.insert(U); | ||
| 
               | 
          ||
| if (!RIt->getType()->isVectorTy()) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGPR does not mean "not an IR vector type"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm do you know of a better heuristic for estimating SGPR pressure at the IR level?
| if (Instruction *U = dyn_cast<Instruction>(Op)) | ||
| CurrentlyLive.insert(U); | ||
| 
               | 
          ||
| if (RIt->getType()->isVectorTy()) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VGPR doesn't mean "IR vector type"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm do you know of a better heuristic for estimating VGPR pressure at the IR level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but the type has nothing to do with it. We don't have any real / precise attempts pressure heuristics in IR
          
 Thanks for the feedback. I am still becoming familiar with our ISA. Some thoughts: 
 Do you have any thoughts on these possibilities?  | 
    
          
 Yes, that's always preferable. No stack is better than LDS is better than stack, which is usually the single worst thing you can use. At some point you need a large junk of addressable memory, the stack is just the worst case. 
 Those are the fundamentals of optimization 
 Ordinary LDS usage is just regular load/store with addrspace(3). There aren't special LDS intrinsics for these cases. In this example we're just swapping out the memory access type, which should be treated equivalently well by downstream optimizations  | 
    
This attempts to resolve SWDEV-547512 by inhibiting alloca to LDS promotion when register pressure is high. Draft because: