Skip to content

Conversation

@alex-t
Copy link
Contributor

@alex-t alex-t commented Aug 29, 2025

📌 Problem

SSA-based spilling decisions require knowledge of when each virtual register (or subregister lane) will be used next.
Existing LLVM analyses (e.g. LiveIntervals) give lifetime coverage, but not next-use distances that are critical for spill heuristics.

Without this, spilling heuristics either over-approximate or fail to distinguish far vs. near uses, especially in the presence of subregisters.

🧩 Solution

This patch adds a new analysis pass AMDGPUNextUseAnalysis that computes, for every (VReg, LaneMask) at each instruction:

The distance in instructions to the next use (or “infinity” if no future use exists).

Coverage semantics for overlapping subregisters (via LaneBitmask).

Fast queries by instruction iterator.

Design points:

Implemented as a MachineFunctionAnalysis.

SSA-form friendly: queries are well-defined at each def.

Separate from the spiller: consumers can use it independently.

🔬 Testing

Unit tests in llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp.

MIR tests in llvm/test/CodeGen/AMDGPU/next-use-analysis.mir:

Cover VGPR and SGPR cases.

Verify next-use distances for subregister lanes.

Use -run-pass=amdgpu-next-use with FileCheck.

🚧 Limitations / Future Work

Integration with the SSA Spiller will be in a follow-up patch.

Currently scoped to AMDGPU target; generalization may be possible later.

Debug dump output (LLVM_DEBUG) may evolve as feedback comes in.

📖 Notes

This patch is self-contained: no pipeline changes, not wired into llc by default.

Provides the analysis infrastructure needed for SSA-based spilling.

@github-actions
Copy link

github-actions bot commented Aug 29, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions h,cpp -- llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h llvm/lib/Target/AMDGPU/AMDGPUSSARAUtils.h llvm/lib/Target/AMDGPU/VRegMaskPair.h llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp llvm/lib/Target/AMDGPU/AMDGPU.h llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp
index 8bdc109e8..c13597c0e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp
@@ -1,5 +1,5 @@
-#include "AMDGPU.h"
 #include "AMDGPUNextUseAnalysis.h"
+#include "AMDGPU.h"
 
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/PostOrderIterator.h"
@@ -22,25 +22,27 @@
 using namespace llvm;
 
 // Command-line option to enable timing instrumentation
-static cl::opt<bool> EnableTimers("amdgpu-next-use-analysis-timers",
-                                  cl::desc("Enable timing for Next Use Analysis"),
-                                  cl::init(false), cl::Hidden);
+static cl::opt<bool>
+    EnableTimers("amdgpu-next-use-analysis-timers",
+                 cl::desc("Enable timing for Next Use Analysis"),
+                 cl::init(false), cl::Hidden);
 
 // Static timers for performance tracking across all analysis runs
 static llvm::TimerGroup TG("amdgpu-next-use", "AMDGPU Next Use Analysis");
 static llvm::Timer AnalyzeTimer("analyze", "Time spent in analyze()", TG);
-static llvm::Timer GetDistanceTimer("getNextUseDistance", 
-                                     "Time spent in getNextUseDistance()", TG);
+static llvm::Timer GetDistanceTimer("getNextUseDistance",
+                                    "Time spent in getNextUseDistance()", TG);
 
 // Three-tier ranking system for spiller decisions
-unsigned NextUseResult::materializeForRank(int64_t Stored, unsigned SnapshotOffset) const {
+unsigned NextUseResult::materializeForRank(int64_t Stored,
+                                           unsigned SnapshotOffset) const {
   int64_t Mat64 = materialize(Stored, SnapshotOffset);
 
   // Tier 1: Finite distances (0 to LoopTag-1) → return as-is
-  // Tier 2: Loop-exit distances (LoopTag to DeadTag-1) → map to 60000-64999 range
-  // Tier 3: Dead registers (DeadTag+) → return Infinity (65535)
+  // Tier 2: Loop-exit distances (LoopTag to DeadTag-1) → map to 60000-64999
+  // range Tier 3: Dead registers (DeadTag+) → return Infinity (65535)
   if (Mat64 >= DeadTag) {
-    return Infinity;  // Tier 3: Dead registers get maximum distance
+    return Infinity; // Tier 3: Dead registers get maximum distance
   }
   if (Mat64 >= LoopTag) {
     // Tier 2: Loop-exit distances get mapped to high range [60000, 64999]
@@ -51,12 +53,11 @@ unsigned NextUseResult::materializeForRank(int64_t Stored, unsigned SnapshotOffs
     return 60000 + ClampedRemainder;
   }
   if (Mat64 <= 0) {
-    return 0;  // Tier 1: Zero-distance for immediate uses
+    return 0; // Tier 1: Zero-distance for immediate uses
   }
-  return static_cast<unsigned>(Mat64);  // Tier 1: Finite distances as-is
+  return static_cast<unsigned>(Mat64); // Tier 1: Finite distances as-is
 }
 
-
 void NextUseResult::init(const MachineFunction &MF) {
   for (auto *L : LI->getLoopsInPreorder()) {
     SmallVector<std::pair<MachineBasicBlock *, MachineBasicBlock *>> Exiting;
@@ -100,7 +101,8 @@ void NextUseResult::analyze(const MachineFunction &MF) {
         VRegDistances SuccDist = UpwardNextUses[SuccNum];
         LLVM_DEBUG({
           dbgs() << "\nMerging "
-                 << "MBB_" << Succ->getNumber() << "." << Succ->getName() << "\n";
+                 << "MBB_" << Succ->getNumber() << "." << Succ->getName()
+                 << "\n";
         });
 
         // Check if the edge from MBB to Succ goes out of the Loop
@@ -116,7 +118,8 @@ void NextUseResult::analyze(const MachineFunction &MF) {
           // Clear out the Loop-Exiting weights.
           for (auto &P : SuccDist) {
             auto &Dists = P.second;
-            // Collect items that need to be updated to avoid iterator invalidation
+            // Collect items that need to be updated to avoid iterator
+            // invalidation
             SmallVector<std::pair<LaneBitmask, int64_t>, 4> ToUpdate;
             for (auto R : Dists) {
               if (R.second >= LoopTag) {
@@ -189,8 +192,9 @@ void NextUseResult::analyze(const MachineFunction &MF) {
           ++Offset;
       }
 
-      // EntryOff needs the TOTAL instruction count for correct predecessor distances
-      // while InstrOffset uses individual instruction offsets for materialization
+      // EntryOff needs the TOTAL instruction count for correct predecessor
+      // distances while InstrOffset uses individual instruction offsets for
+      // materialization
 
       LLVM_DEBUG({
         dbgs() << "\nFinal distances for MBB_" << MBB->getNumber() << "."
@@ -264,9 +268,8 @@ NextUseResult::getSortedSubregUses(const MachineBasicBlock::iterator I,
       });
       for (auto P : reverse(Dists)) {
         LaneBitmask UseMask = P.first;
-        LLVM_DEBUG({
-          dbgs() << "Used mask : [" << PrintLaneMask(UseMask) << "]\n";
-        });
+        LLVM_DEBUG(
+            { dbgs() << "Used mask : [" << PrintLaneMask(UseMask) << "]\n"; });
         if ((UseMask & VMP.getLaneMask()) == UseMask) {
           Result.push_back({VMP.getVReg(), UseMask});
         }
@@ -285,9 +288,8 @@ NextUseResult::getSortedSubregUses(const MachineBasicBlock &MBB,
       NextUseMap[MBBNum].Bottom.contains(VMP.getVReg())) {
     VRegDistances::SortedRecords Dists =
         NextUseMap[MBBNum].Bottom[VMP.getVReg()];
-    LLVM_DEBUG({
-      dbgs() << "Mask : [" << PrintLaneMask(VMP.getLaneMask()) << "]\n";
-    });
+    LLVM_DEBUG(
+        { dbgs() << "Mask : [" << PrintLaneMask(VMP.getLaneMask()) << "]\n"; });
     for (auto P : reverse(Dists)) {
       LaneBitmask UseMask = P.first;
       LLVM_DEBUG(dbgs() << "Used mask : [" << PrintLaneMask(UseMask) << "]\n");
@@ -313,7 +315,7 @@ unsigned NextUseResult::getNextUseDistance(const MachineBasicBlock::iterator I,
                                            const VRegMaskPair VMP) {
   if (EnableTimers)
     GetDistanceTimer.startTimer();
-  
+
   unsigned Dist = Infinity;
   const MachineBasicBlock *MBB = I->getParent();
   unsigned MBBNum = MBB->getNumber();
@@ -323,8 +325,8 @@ unsigned NextUseResult::getNextUseDistance(const MachineBasicBlock::iterator I,
     if (NextUseMap[MBBNum].InstrDist[&*I].contains(VMP.getVReg())) {
       // printSortedRecords(Dists[VMP.VReg], VMP.VReg);
       unsigned SnapOff = NextUseMap[MBBNum].InstrOffset[&*I];
-      getFromSortedRecords(Dists[VMP.getVReg()], VMP.getLaneMask(),
-                           SnapOff, Dist);
+      getFromSortedRecords(Dists[VMP.getVReg()], VMP.getLaneMask(), SnapOff,
+                           Dist);
     }
   }
 
@@ -337,7 +339,7 @@ unsigned NextUseResult::getNextUseDistance(const MachineBasicBlock &MBB,
                                            const VRegMaskPair VMP) {
   if (EnableTimers)
     GetDistanceTimer.startTimer();
-  
+
   unsigned Dist = Infinity;
   unsigned MBBNum = MBB.getNumber();
   if (NextUseMap.contains(MBBNum)) {
@@ -346,7 +348,7 @@ unsigned NextUseResult::getNextUseDistance(const MachineBasicBlock &MBB,
                            VMP.getLaneMask(), 0, Dist);
     }
   }
-  
+
   if (EnableTimers)
     GetDistanceTimer.stopTimer();
   return Dist;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h
index c27fa182b..073f4b61c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h
@@ -135,7 +135,7 @@ class NextUseResult {
     void clear(VRegMaskPair VMP) {
       if (NextUseMap.contains(VMP.getVReg())) {
         auto &Dists = NextUseMap[VMP.getVReg()];
-        for (auto It = Dists.begin(); It != Dists.end(); ) {
+        for (auto It = Dists.begin(); It != Dists.end();) {
           LaneBitmask Masked = It->first & ~VMP.getLaneMask();
           if (Masked.none()) {
             It = Dists.erase(It);
@@ -232,7 +232,8 @@ private:
   static constexpr int64_t DeadTag = (int64_t)1 << 60; // ~1e18, >> LoopTag
 
   // Unsigned Infinity for external API/DAG users who want a sentinel.
-  static constexpr unsigned PrintedInfinity = std::numeric_limits<unsigned>::max();
+  static constexpr unsigned PrintedInfinity =
+      std::numeric_limits<unsigned>::max();
 
   const uint16_t Infinity = std::numeric_limits<unsigned short>::max();
   void init(const MachineFunction &MF);
@@ -249,9 +250,9 @@ private:
   struct PrintDist {
     bool IsInfinity;
     bool IsDead;
-    int64_t LoopMultiplier;  // How many LoopTags are in the distance
-    int64_t Rema;            // remainder after extracting LoopTags
-    
+    int64_t LoopMultiplier; // How many LoopTags are in the distance
+    int64_t Rema;           // remainder after extracting LoopTags
+
     PrintDist(int64_t Mat64) {
       if (Mat64 >= DeadTag) {
         IsInfinity = false;
@@ -314,7 +315,8 @@ private:
         if (PDist.LoopMultiplier == 1)
           O << "[ LoopTag+" << PDist.Rema << " ]\n";
         else if (PDist.LoopMultiplier > 1)
-          O << "[ LoopTag*" << PDist.LoopMultiplier << "+" << PDist.Rema << " ]\n";
+          O << "[ LoopTag*" << PDist.LoopMultiplier << "+" << PDist.Rema
+            << " ]\n";
         else
           O << "[ INF+" << PDist.Rema << " ]\n";
       else
diff --git a/llvm/lib/Target/AMDGPU/VRegMaskPair.h b/llvm/lib/Target/AMDGPU/VRegMaskPair.h
index e2f588dcb..1a193c8c9 100644
--- a/llvm/lib/Target/AMDGPU/VRegMaskPair.h
+++ b/llvm/lib/Target/AMDGPU/VRegMaskPair.h
@@ -95,7 +95,7 @@ class LaneCoverageResult {
 
 public:
   LaneCoverageResult() = default;
-  LaneCoverageResult(const LaneBitmask Mask) : Data(Mask), NotCovered(Mask){};
+  LaneCoverageResult(const LaneBitmask Mask) : Data(Mask), NotCovered(Mask) {};
   bool isFullyCovered() { return Data == Covered; }
   bool isFullyUncovered() { return Data == NotCovered; }
   LaneBitmask getCovered() { return Covered; }
@@ -188,7 +188,8 @@ public:
 
   bool contains(const VRegMaskPair &VMP) const {
     auto It = SetStorage.find(VMP.VReg);
-    return It != SetStorage.end() && It->second.find(VMP.LaneMask) != It->second.end();
+    return It != SetStorage.end() &&
+           It->second.find(VMP.LaneMask) != It->second.end();
   }
 
   void clear() {
diff --git a/llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp b/llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp
index cc8e42510..73a48b190 100644
--- a/llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp
+++ b/llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp
@@ -99,7 +99,6 @@ public:
 
 std::unique_ptr<NextUseResult> NextUseAnalysisTestWrapper::Captured;
 
-
 class NextUseAnalysisTestBase : public testing::Test {
 protected:
   std::unique_ptr<LLVMContext> Ctx;
@@ -202,9 +201,10 @@ protected:
                 checkMatch[1].str() == instrMatch[1].str()) {
 
               // Found exact match, look for subsequent Vreg distance patterns
-              // Track full register uses by virtual register to apply precedence
+              // Track full register uses by virtual register to apply
+              // precedence
               llvm::DenseMap<Register, unsigned> fullRegDistances;
-              
+
               for (size_t j = i + 1; j < checkPatterns.size(); ++j) {
                 const std::string &distPattern = checkPatterns[j];
 
@@ -230,25 +230,32 @@ protected:
                   unsigned regNum = std::stoul(vregMatch[1].str());
                   std::string subRegName =
                       vregMatch[3].str(); // May be empty for full register
-                  unsigned distance = std::stoul(vregMatch[4].str());                  Register VReg = Register::index2VirtReg(regNum);
-                  LaneBitmask mask = getSubRegLaneMask(subRegName, VReg, TRI, MRI);
-                  
+                  unsigned distance = std::stoul(vregMatch[4].str());
+                  Register VReg = Register::index2VirtReg(regNum);
+                  LaneBitmask mask =
+                      getSubRegLaneMask(subRegName, VReg, TRI, MRI);
+
                   // Check if this is a full register use
-                  if (subRegName.empty() || mask == MRI->getMaxLaneMaskForVReg(VReg)) {
+                  if (subRegName.empty() ||
+                      mask == MRI->getMaxLaneMaskForVReg(VReg)) {
                     // This is a full register use - record it for precedence
                     fullRegDistances[VReg] = distance;
                     VRegMaskPair VMP(VReg, mask);
                     expectedDistances[VMP] = distance;
                   } else {
-                    // This is a sub-register use  
-                    // Check if we already have a full register use for this VReg with closer distance
+                    // This is a sub-register use
+                    // Check if we already have a full register use for this
+                    // VReg with closer distance
                     auto FullIt = fullRegDistances.find(VReg);
-                    if (FullIt != fullRegDistances.end() && FullIt->second < distance) {
-                      // Full register use exists and is closer - use its distance
+                    if (FullIt != fullRegDistances.end() &&
+                        FullIt->second < distance) {
+                      // Full register use exists and is closer - use its
+                      // distance
                       VRegMaskPair VMP(VReg, mask);
                       expectedDistances[VMP] = FullIt->second;
                     } else {
-                      // No full register use or sub-register is closer - use sub-register distance
+                      // No full register use or sub-register is closer - use
+                      // sub-register distance
                       VRegMaskPair VMP(VReg, mask);
                       expectedDistances[VMP] = distance;
                     }
@@ -290,35 +297,35 @@ protected:
   }
 
   // Helper to find all .mir files in a directory
-  std::vector<std::string> findMirFiles(const std::string& dirPath) {
+  std::vector<std::string> findMirFiles(const std::string &dirPath) {
     std::vector<std::string> mirFiles;
-    
+
     if (!std::filesystem::exists(dirPath)) {
       return mirFiles;
     }
-    
-    for (const auto& entry : std::filesystem::directory_iterator(dirPath)) {
+
+    for (const auto &entry : std::filesystem::directory_iterator(dirPath)) {
       if (entry.is_regular_file() && entry.path().extension() == ".mir") {
         mirFiles.push_back(entry.path().filename().string());
       }
     }
-    
+
     std::sort(mirFiles.begin(), mirFiles.end());
     return mirFiles;
   }
-  
+
   // Helper to parse CHECK patterns from file comments
-  std::vector<std::string> parseCheckPatterns(const std::string& filePath) {
+  std::vector<std::string> parseCheckPatterns(const std::string &filePath) {
     std::vector<std::string> patterns;
     std::ifstream file(filePath);
     std::string line;
-    
+
     while (std::getline(file, line)) {
       if (line.find("# CHECK") == 0) {
         patterns.push_back(line);
       }
     }
-    
+
     return patterns;
   }
 
@@ -382,9 +389,9 @@ protected:
 };
 
 // Parameterized test for all .mir files
-class NextUseAnalysisParameterizedTest : public NextUseAnalysisTestBase, 
-                                         public testing::WithParamInterface<std::string> {
-};
+class NextUseAnalysisParameterizedTest
+    : public NextUseAnalysisTestBase,
+      public testing::WithParamInterface<std::string> {};
 
 // Allow uninstantiated test when test files are not available
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(NextUseAnalysisParameterizedTest);
@@ -449,7 +456,7 @@ std::vector<std::string> getMirFiles() {
 
 TEST_P(NextUseAnalysisParameterizedTest, ProcessMirFile) {
   std::string mirFileName = GetParam();
-  
+
   // Get test directory from environment or use default
   std::string testDir = getTestDirectory();
 
@@ -461,11 +468,12 @@ TEST_P(NextUseAnalysisParameterizedTest, ProcessMirFile) {
   }
 
   std::string fullPath = testDir + "/" + mirFileName;
-  
+
   // Parse CHECK patterns from the file
   auto checkPatterns = parseCheckPatterns(fullPath);
-  ASSERT_FALSE(checkPatterns.empty()) << "No CHECK patterns found in " << mirFileName;
-  
+  ASSERT_FALSE(checkPatterns.empty())
+      << "No CHECK patterns found in " << mirFileName;
+
   LLVMContext Ctx;
   legacy::PassManager PM;
   auto Module = parseMIRFile(fullPath, Ctx, *TM, PM);
@@ -677,18 +685,16 @@ body: |
   }
 }
 
-INSTANTIATE_TEST_SUITE_P(
-  AllMirFiles,
-  NextUseAnalysisParameterizedTest,
-  testing::ValuesIn(getMirFiles()),
-  [](const testing::TestParamInfo<std::string>& info) {
-    std::string name = info.param;
-    // Replace non-alphanumeric characters with underscores for valid test names
-    std::replace_if(name.begin(), name.end(), [](char c) { 
-      return !std::isalnum(c); 
-    }, '_');
-    return name;
-  }
-);
+INSTANTIATE_TEST_SUITE_P(AllMirFiles, NextUseAnalysisParameterizedTest,
+                         testing::ValuesIn(getMirFiles()),
+                         [](const testing::TestParamInfo<std::string> &info) {
+                           std::string name = info.param;
+                           // Replace non-alphanumeric characters with
+                           // underscores for valid test names
+                           std::replace_if(
+                               name.begin(), name.end(),
+                               [](char c) { return !std::isalnum(c); }, '_');
+                           return name;
+                         });
 
 } // end anonymous namespace

alex-t added 9 commits August 29, 2025 19:04
  -- removed 'tick' loop updating each live reg distance on each instruction, offset trick used instead
  -- 3 tier ranking applied to the distance materialization
  -- external API unchanged
…it testing

This commit addresses multiple fundamental issues in the NextUseAnalysis
algorithm discovered during unit test development and implements a
comprehensive testing strategy.

## Algorithmic Fixes:

1. **Fix CompareByDist sorting logic**: The comparator was broken because
   LaneBitmask uniqueness meant LHS.first == RHS.first was never true,
   causing sort by LaneBitmask instead of distance. Fixed to sort by
   distance first, then LaneBitmask for tie-breaking.

2. **Fix lane mask coverage conditions**: The coverage logic was backwards
   in getFromSortedRecords. Changed from (UseMask & Mask) == UseMask to
   (Mask & UseMask) == Mask to properly check if stored use covers queried lanes.

3. **Fix LaneBitmask::getAll() usage**: Cannot use getAll() for full register
   as it returns maximum possible mask (0xFFFFFFFF) but VGPR_32 full mask is 0xF.
   Now uses MRI->getMaxLaneMaskForVReg(VReg) for correct register class masks.

4. **Implement PHI operand filtering**: PHI operands are edge-specific and
   should only be included when source block matches current predecessor.
   Added filtering: if ((BlockOp.getMBB() != MBB) || UseOp.isUndef()) remove from SuccDist.

5. **Fix lane mask coverage in insert method**: Enhanced insert() with proper
   coverage relationships - if existing use covers new use and has closer distance,
   reject new use; if new use covers existing use and has closer distance, replace.

6. **Add universal undef filtering**: Filter out undef operands from analysis
   as they don't represent real register uses.

## Unit Testing Strategy:

- **Test Framework**: Parses LIT tests from llvm/test/CodeGen/AMDGPU/NextUseAnalysis/
  and runs NextUseAnalysis once per MachineFunction, then queries distances for
  each instruction using CHECK patterns as expected results.

- **Control Flow Handling**: Properly handles divergent control flow where both
  full register and sub-register uses may be reachable through independent paths.
  Uses ordering semantics: FullRegUse->SubRegUse indicates converged paths;
  SubRegUse->FullRegUse indicates linear/post-dominated flow.

- **Full vs Sub-register Precedence**: When full register use has closer distance
  than sub-register use, uses full register distance for all sub-register queries
  of that VReg, reflecting optimal spilling decisions.

- **Enhanced parseExpectedDistances**: Implements precedence logic to handle
  full register vs sub-register distance expectations correctly.

## Testing Infrastructure:

- Added comprehensive documentation explaining coverage eviction, control flow
  semantics, and spilling optimization strategies
- Enhanced sub-register handling with proper LLVM API usage
- Parameterized tests for all MIR files with environment variable controls

This ensures NextUseAnalysis correctness for register allocation and spilling
decisions while providing robust validation through comprehensive unit tests.
…ce testing infrastructure

This commit adds comprehensive unit testing for the getSortedSubregUses API and
improves the NextUseAnalysis testing framework with several key enhancements:

**New getSortedSubregUses Unit Test:**
- Tests the API that returns sub-register uses sorted by decreasing next-use distance
- Validates optimal spilling decisions where furthest uses appear first
- Uses minimal SSA MIR pattern with 8 sub-registers at varying distances (1-9)
- Verifies exact ordering: sub0 (dist 9) → sub1 (8) → sub2 (7) → sub3 (6) →
  sub28 (4) → sub29 (3) → sub30 (2) → sub31 (1)
- Ensures all results reference the correct virtual register

**Testing Infrastructure Improvements:**
- Added parseMIRString() method for inline MIR testing without temporary files
- Refactored parseMIRFile() to reuse parseMIRString() for better maintainability
Replace C++20 features with C++17-compatible alternatives to allow
compilation with CMAKE_CXX_STANDARD 17:

1. Replace std::erase_if with manual erase loop in
   AMDGPUNextUseAnalysis.h VRegDistances::clear()

2. Replace std::set::contains() with find() != end() pattern in:
   - AMDGPUNextUseAnalysis.h VRegDistances::insert()
   - VRegMaskPair.h VRegMaskPairSet::contains()

3. Fix iterator invalidation bug in AMDGPUNextUseAnalysis.cpp
   analyze() method where loop-exiting weights are cleared.
   Use two-phase approach: collect items needing updates first,
   then apply erase/insert operations to avoid invalidating
   iterators during traversal.

4. Remove unused static materialize() function that was shadowed
   by the NextUseResult member function of the same name.

5. Remove C++20 compilation workaround from X86WinEHUnwindV2.cpp
   that is no longer needed with these C++17 compatibility fixes.

These changes maintain identical behavior while ensuring C++17
compatibility and fixing a crash caused by set modification
during iteration.
@alex-t alex-t marked this pull request as ready for review October 7, 2025 15:20
@llvmbot
Copy link
Member

llvmbot commented Oct 10, 2025

@llvm/pr-subscribers-llvm-regalloc

@llvm/pr-subscribers-backend-amdgpu

Author: None (alex-t)

Changes

📌 Problem

SSA-based spilling decisions require knowledge of when each virtual register (or subregister lane) will be used next.
Existing LLVM analyses (e.g. LiveIntervals) give lifetime coverage, but not next-use distances that are critical for spill heuristics.

Without this, spilling heuristics either over-approximate or fail to distinguish far vs. near uses, especially in the presence of subregisters.

🧩 Solution

This patch adds a new analysis pass AMDGPUNextUseAnalysis that computes, for every (VReg, LaneMask) at each instruction:

The distance in instructions to the next use (or “infinity” if no future use exists).

Coverage semantics for overlapping subregisters (via LaneBitmask).

Fast queries by instruction iterator.

Design points:

Implemented as a MachineFunctionAnalysis.

SSA-form friendly: queries are well-defined at each def.

Separate from the spiller: consumers can use it independently.

🔬 Testing

Unit tests in llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp.

MIR tests in llvm/test/CodeGen/AMDGPU/next-use-analysis.mir:

Cover VGPR and SGPR cases.

Verify next-use distances for subregister lanes.

Use -run-pass=amdgpu-next-use with FileCheck.

🚧 Limitations / Future Work

Integration with the SSA Spiller will be in a follow-up patch.

Currently scoped to AMDGPU target; generalization may be possible later.

Debug dump output (LLVM_DEBUG) may evolve as feedback comes in.

📖 Notes

This patch is self-contained: no pipeline changes, not wired into llc by default.

Provides the analysis infrastructure needed for SSA-based spilling.


Patch is 1.96 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156079.diff

27 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+3)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp (+426)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h (+448)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUSSARAUtils.h (+69)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+1)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/lib/Target/AMDGPU/VRegMaskPair.h (+398)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/README.md (+116)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/complex-control-flow-11blocks.mir (+1269)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/complex-control-flow-14blocks.mir (+1567)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/complex-single-loop-a.mir (+1767)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/complex-single-loop-b.mir (+2646)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/complex-single-loop.mir (+1021)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/if_else_with_loops_nested_in_2_outer_loops.mir (+6346)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/inner_cfg_in_2_nesteed_loops.mir (+2008)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/loop_nested_in_3_outer_loops_complex_cfg.mir (+6116)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/multi_exit_loop_followed_by_simple_loop.mir (+1380)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/nested-loops-with-side-exits-a.mir (+3948)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/nested-loops-with-side-exits-b.mir (+24320)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/sequence_2_loops.mir (+1314)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/simple-linear-block-distances.mir (+731)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/simple-loop-3blocks.mir (+388)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/three-tier-ranking-nested-loops.mir (+268)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/three_loops_sequence_nested_in_outer_loop.mir (+5011)
  • (added) llvm/test/CodeGen/AMDGPU/NextUseAnalysis/triple-nested-loops.mir (+1649)
  • (modified) llvm/unittests/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/unittests/Target/AMDGPU/NextUseAnalysisTest.cpp (+691)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index ebe38de1636be..9c2980ef735bd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -248,6 +248,9 @@ extern char &AMDGPUPreloadKernArgPrologLegacyID;
 void initializeAMDGPUPreloadKernelArgumentsLegacyPass(PassRegistry &);
 extern char &AMDGPUPreloadKernelArgumentsLegacyID;
 
+void initializeAMDGPUNextUseAnalysisWrapperPass(PassRegistry &);
+extern char &AMDGPUNextUseAnalysisID;
+
 // Passes common to R600 and SI
 FunctionPass *createAMDGPUPromoteAlloca();
 void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp
new file mode 100644
index 0000000000000..93d98de3fab3f
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.cpp
@@ -0,0 +1,426 @@
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/PostOrderIterator.h"
+#include "llvm/ADT/iterator_range.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineFunction.h"
+#include "llvm/CodeGen/MachineFunctionAnalysis.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/Passes.h"
+#include "llvm/CodeGen/SlotIndexes.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/MC/LaneBitmask.h"
+#include "llvm/Passes/PassBuilder.h"
+#include "llvm/Passes/PassPlugin.h"
+#include "llvm/Support/Timer.h"
+
+#include "AMDGPU.h"
+
+#include "AMDGPUNextUseAnalysis.h"
+
+#define DEBUG_TYPE "amdgpu-next-use"
+
+using namespace llvm;
+
+// namespace {
+
+// Three-tier ranking system for spiller decisions
+unsigned NextUseResult::materializeForRank(int64_t stored, unsigned snapshotOffset) const {
+  int64_t Mat64 = materialize(stored, snapshotOffset);
+
+  // Tier 1: Finite distances (0 to LoopTag-1) → return as-is
+  // Tier 2: Loop-exit distances (LoopTag to DeadTag-1) → map to 60000-64999 range
+  // Tier 3: Dead registers (DeadTag+) → return Infinity (65535)
+  if (Mat64 >= DeadTag) {
+    return Infinity;  // Tier 3: Dead registers get maximum distance
+  } else if (Mat64 >= LoopTag) {
+    // Tier 2: Loop-exit distances get mapped to high range [60000, 64999]
+    int64_t LoopRemainder = Mat64 - LoopTag;
+    // Clamp the remainder to fit in available range (5000 values)
+    unsigned ClampedRemainder = static_cast<unsigned>(
+        std::min(LoopRemainder, static_cast<int64_t>(4999)));
+    return 60000 + ClampedRemainder;
+  } else if (Mat64 <= 0) {
+    return 0;  // Tier 1: Zero-distance for immediate uses
+  } else {
+    return static_cast<unsigned>(Mat64);  // Tier 1: Finite distances as-is
+  }
+}
+
+
+void NextUseResult::init(const MachineFunction &MF) {
+  TG = new TimerGroup("Next Use Analysis",
+                      "Compilation Timers for Next Use Analysis");
+  T1 = new Timer("Next Use Analysis", "Time spent in analyse()", *TG);
+  T2 = new Timer("Next Use Analysis", "Time spent in computeNextUseDistance()",
+                 *TG);
+  for (auto L : LI->getLoopsInPreorder()) {
+    SmallVector<std::pair<MachineBasicBlock *, MachineBasicBlock *>> Exiting;
+    L->getExitEdges(Exiting);
+    for (auto P : Exiting) {
+      LoopExits[P.first->getNumber()] = P.second->getNumber();
+    }
+  }
+}
+
+void NextUseResult::analyze(const MachineFunction &MF) {
+  // Upward-exposed distances are only necessary to convey the data flow from
+  // the block to its predecessors. No need to store it beyond the analyze
+  // function as the analysis users are only interested in the use distances
+  // relatively to the given MI or the given block end.
+  DenseMap<unsigned, VRegDistances> UpwardNextUses;
+  T1->startTimer();
+  bool Changed = true;
+  while (Changed) {
+    Changed = false;
+    for (auto MBB : post_order(&MF)) {
+      unsigned Offset = 0;
+      unsigned MBBNum = MBB->getNumber();
+      VRegDistances Curr, Prev;
+      if (UpwardNextUses.contains(MBBNum)) {
+        Prev = UpwardNextUses[MBBNum];
+      }
+
+      LLVM_DEBUG(dbgs() << "\nMerging successors for "
+                        << "MBB_" << MBB->getNumber() << "." << MBB->getName()
+                        << "\n";);
+
+      for (auto Succ : successors(MBB)) {
+        unsigned SuccNum = Succ->getNumber();
+
+        if (!UpwardNextUses.contains(SuccNum))
+          continue;
+
+        VRegDistances SuccDist = UpwardNextUses[SuccNum];
+        LLVM_DEBUG(dbgs() << "\nMerging "
+                          << "MBB_" << Succ->getNumber() << "."
+                          << Succ->getName() << "\n");
+
+        // Check if the edge from MBB to Succ goes out of the Loop
+        int64_t EdgeWeight = 0;
+        if (LoopExits.contains(MBB->getNumber())) {
+          unsigned ExitTo = LoopExits[MBB->getNumber()];
+          if (SuccNum == ExitTo)
+            EdgeWeight = LoopTag;
+        }
+
+        if (LI->getLoopDepth(MBB) < LI->getLoopDepth(Succ)) {
+          // MBB->Succ is entering the Succ's loop
+          // Clear out the Loop-Exiting weights.
+          for (auto &P : SuccDist) {
+            auto &Dists = P.second;
+            // Collect items that need to be updated to avoid iterator invalidation
+            SmallVector<std::pair<LaneBitmask, int64_t>, 4> ToUpdate;
+            for (auto R : Dists) {
+              if (R.second >= LoopTag) {
+                ToUpdate.push_back(R);
+              }
+            }
+            // Now apply the updates
+            for (auto R : ToUpdate) {
+              Dists.erase(R);
+              R.second -= LoopTag;
+              Dists.insert(R);
+            }
+          }
+        }
+        LLVM_DEBUG(dbgs() << "\nCurr:";
+                   printVregDistances(Curr /*, 0 - we're at the block bottom*/);
+                   dbgs() << "\nSucc:";
+                   printVregDistances(SuccDist, EntryOff[SuccNum], EdgeWeight));
+
+        // Filter out successor's PHI operands with SourceBlock != MBB
+        // PHI operands are only live on their specific incoming edge
+        for (auto &PHI : Succ->phis()) {
+          // Check each PHI operand pair (value, source block)
+          for (unsigned OpIdx = 1; OpIdx < PHI.getNumOperands(); OpIdx += 2) {
+            const MachineOperand &UseOp = PHI.getOperand(OpIdx);
+            const MachineOperand &BlockOp = PHI.getOperand(OpIdx + 1);
+
+            // Skip if this operand doesn't come from current MBB
+            if (BlockOp.getMBB() != MBB) {
+              VRegMaskPair PhiVMP(UseOp, TRI, MRI);
+              // Remove this PHI operand from the successor distances
+              SuccDist.clear(PhiVMP);
+            }
+          }
+        }
+
+        Curr.merge(SuccDist, EntryOff[SuccNum], EdgeWeight);
+        LLVM_DEBUG(dbgs() << "\nCurr after merge:"; printVregDistances(Curr));
+      }
+
+      NextUseMap[MBBNum].Bottom = Curr;
+
+      for (auto &MI : make_range(MBB->rbegin(), MBB->rend())) {
+
+        for (auto &MO : MI.operands()) {
+
+          // Only process virtual register operands
+          // Undef operands don't represent real uses
+          if (!MO.isReg() || !MO.getReg().isVirtual() || MO.isUndef())
+            continue;
+
+          VRegMaskPair P(MO, TRI, MRI);
+          if (MO.isUse()) {
+            Curr.insert(P, -(int64_t)Offset);
+            UsedInBlock[MBB->getNumber()].insert(P);
+          } else if (MO.isDef()) {
+            Curr.clear(P);
+            UsedInBlock[MBB->getNumber()].remove(P);
+          }
+        }
+        NextUseMap[MBBNum].InstrDist[&MI] = Curr;
+        NextUseMap[MBBNum].InstrOffset[&MI] = Offset;
+        // printVregDistances(Curr, Offset);
+        if (!MI.isPHI())
+          ++Offset;
+      }
+
+      // EntryOff needs the TOTAL instruction count for correct predecessor distances
+      // while InstrOffset uses individual instruction offsets for materialization
+
+      LLVM_DEBUG(dbgs() << "\nFinal distances for MBB_" << MBB->getNumber()
+                        << "." << MBB->getName() << "\n";
+                 printVregDistances(Curr, Offset));
+      LLVM_DEBUG(dbgs() << "\nPrevious distances for MBB_" << MBB->getNumber()
+                        << "." << MBB->getName() << "\n";
+                 printVregDistances(Prev, Offset));
+
+      // EntryOff -offset of the first instruction in the block top-down walk
+      EntryOff[MBBNum] = Offset;
+      UpwardNextUses[MBBNum] = std::move(Curr);
+
+      bool Changed4MBB = (Prev != UpwardNextUses[MBBNum]);
+
+      Changed |= Changed4MBB;
+    }
+  }
+  // dumpUsedInBlock();
+  // Dump complete analysis results for testing
+  LLVM_DEBUG(dumpAllNextUseDistances(MF));
+  T1->stopTimer();
+  LLVM_DEBUG(TG->print(llvm::errs()));
+}
+
+void NextUseResult::getFromSortedRecords(
+    const VRegDistances::SortedRecords &Dists, LaneBitmask Mask,
+    unsigned SnapshotOffset, unsigned &D) {
+  LLVM_DEBUG(dbgs() << "Mask : [" << PrintLaneMask(Mask) << "]  "
+                    << "SnapshotOffset=" << SnapshotOffset << "\n");
+
+  // Records are sorted by stored value in increasing order. Since all entries
+  // in this snapshot share the same SnapshotOffset, ordering by stored value
+  // is equivalent to ordering by materialized distance.
+  for (const auto &P : Dists) {
+    const LaneBitmask UseMask = P.first;
+    LLVM_DEBUG(dbgs() << "  UseMask : [" << PrintLaneMask(UseMask) << "]\n");
+
+    // Require full coverage: a use contributes only if it covers the queried
+    // lanes.
+    if ((Mask & UseMask) == Mask) {
+      // Use materializeForRank for three-tier ranking system
+      int64_t Stored = static_cast<int64_t>(P.second);
+      D = materializeForRank(Stored, SnapshotOffset);
+
+      break; // first covering record is the nearest for this snapshot
+    }
+  }
+}
+
+SmallVector<VRegMaskPair>
+NextUseResult::getSortedSubregUses(const MachineBasicBlock::iterator I,
+                                   const VRegMaskPair VMP) {
+  SmallVector<VRegMaskPair> Result;
+  const MachineBasicBlock *MBB = I->getParent();
+  unsigned MBBNum = MBB->getNumber();
+  if (NextUseMap.contains(MBBNum) &&
+      NextUseMap[MBBNum].InstrDist.contains(&*I)) {
+    // VRegDistances Dists = NextUseMap[MBBNum].InstrDist[&*I];
+    if (NextUseMap[MBBNum].InstrDist[&*I].contains(VMP.getVReg())) {
+      VRegDistances::SortedRecords Dists =
+          NextUseMap[MBBNum].InstrDist[&*I][VMP.getVReg()];
+      LLVM_DEBUG(dbgs() << "Mask : [" << PrintLaneMask(VMP.getLaneMask())
+                        << "]\n");
+      for (auto P : reverse(Dists)) {
+        LaneBitmask UseMask = P.first;
+        LLVM_DEBUG(dbgs() << "Used mask : [" << PrintLaneMask(UseMask)
+                          << "]\n");
+        if ((UseMask & VMP.getLaneMask()) == UseMask) {
+          Result.push_back({VMP.getVReg(), UseMask});
+        }
+      }
+    }
+  }
+  return Result;
+}
+
+SmallVector<VRegMaskPair>
+NextUseResult::getSortedSubregUses(const MachineBasicBlock &MBB,
+                                   const VRegMaskPair VMP) {
+  SmallVector<VRegMaskPair> Result;
+  unsigned MBBNum = MBB.getNumber();
+  if (NextUseMap.contains(MBBNum) &&
+      NextUseMap[MBBNum].Bottom.contains(VMP.getVReg())) {
+    VRegDistances::SortedRecords Dists =
+        NextUseMap[MBBNum].Bottom[VMP.getVReg()];
+    LLVM_DEBUG(dbgs() << "Mask : [" << PrintLaneMask(VMP.getLaneMask())
+                      << "]\n");
+    for (auto P : reverse(Dists)) {
+      LaneBitmask UseMask = P.first;
+      LLVM_DEBUG(dbgs() << "Used mask : [" << PrintLaneMask(UseMask) << "]\n");
+      if ((UseMask & VMP.getLaneMask()) == UseMask) {
+        Result.push_back({VMP.getVReg(), UseMask});
+      }
+    }
+  }
+  return Result;
+}
+
+void NextUseResult::dumpUsedInBlock() {
+  LLVM_DEBUG(for (auto P
+                  : UsedInBlock) {
+    dbgs() << "MBB_" << P.first << ":\n";
+    for (auto VMP : P.second) {
+      dbgs() << "[ " << printReg(VMP.getVReg()) << " : <"
+             << PrintLaneMask(VMP.getLaneMask()) << "> ]\n";
+    }
+  });
+}
+
+unsigned NextUseResult::getNextUseDistance(const MachineBasicBlock::iterator I,
+                                           const VRegMaskPair VMP) {
+  unsigned Dist = Infinity;
+  const MachineBasicBlock *MBB = I->getParent();
+  unsigned MBBNum = MBB->getNumber();
+  if (NextUseMap.contains(MBBNum) &&
+      NextUseMap[MBBNum].InstrDist.contains(&*I)) {
+    VRegDistances Dists = NextUseMap[MBBNum].InstrDist[&*I];
+    if (NextUseMap[MBBNum].InstrDist[&*I].contains(VMP.getVReg())) {
+      // printSortedRecords(Dists[VMP.VReg], VMP.VReg);
+      unsigned SnapOff = NextUseMap[MBBNum].InstrOffset[&*I];
+      getFromSortedRecords(Dists[VMP.getVReg()], VMP.getLaneMask(),
+                           SnapOff, Dist);
+    }
+  }
+
+  return Dist;
+}
+
+unsigned NextUseResult::getNextUseDistance(const MachineBasicBlock &MBB,
+                                           const VRegMaskPair VMP) {
+  unsigned Dist = Infinity;
+  unsigned MBBNum = MBB.getNumber();
+  if (NextUseMap.contains(MBBNum)) {
+    if (NextUseMap[MBBNum].Bottom.contains(VMP.getVReg())) {
+      getFromSortedRecords(NextUseMap[MBBNum].Bottom[VMP.getVReg()],
+                           VMP.getLaneMask(), 0, Dist);
+    }
+  }
+  return Dist;
+}
+
+AMDGPUNextUseAnalysis::Result
+AMDGPUNextUseAnalysis::run(MachineFunction &MF,
+                           MachineFunctionAnalysisManager &MFAM) {
+  return AMDGPUNextUseAnalysis::Result(MF,
+                                       MFAM.getResult<SlotIndexesAnalysis>(MF),
+                                       MFAM.getResult<MachineLoopAnalysis>(MF));
+}
+
+AnalysisKey AMDGPUNextUseAnalysis::Key;
+
+//} // namespace
+
+extern "C" LLVM_ATTRIBUTE_WEAK ::llvm::PassPluginLibraryInfo
+llvmGetPassPluginInfo() {
+  return {LLVM_PLUGIN_API_VERSION, "AMDGPUNextUseAnalysisPass",
+          LLVM_VERSION_STRING, [](PassBuilder &PB) {
+            PB.registerAnalysisRegistrationCallback(
+                [](MachineFunctionAnalysisManager &MFAM) {
+                  MFAM.registerPass([] { return AMDGPUNextUseAnalysis(); });
+                });
+          }};
+}
+
+char AMDGPUNextUseAnalysisWrapper::ID = 0;
+char &llvm::AMDGPUNextUseAnalysisID = AMDGPUNextUseAnalysisWrapper::ID;
+INITIALIZE_PASS_BEGIN(AMDGPUNextUseAnalysisWrapper, "amdgpu-next-use",
+                      "AMDGPU Next Use Analysis", false, false)
+INITIALIZE_PASS_DEPENDENCY(SlotIndexesWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(MachineLoopInfoWrapperPass)
+INITIALIZE_PASS_END(AMDGPUNextUseAnalysisWrapper, "amdgpu-next-use",
+                    "AMDGPU Next Use Analysis", false, false)
+
+bool AMDGPUNextUseAnalysisWrapper::runOnMachineFunction(MachineFunction &MF) {
+  NU.Indexes = &getAnalysis<SlotIndexesWrapperPass>().getSI();
+  NU.LI = &getAnalysis<MachineLoopInfoWrapperPass>().getLI();
+  NU.MRI = &MF.getRegInfo();
+  NU.TRI = MF.getSubtarget<GCNSubtarget>().getRegisterInfo();
+  assert(NU.MRI->isSSA());
+  NU.init(MF);
+  NU.analyze(MF);
+  //  LLVM_DEBUG(NU.dump());
+  return false;
+}
+
+void AMDGPUNextUseAnalysisWrapper::getAnalysisUsage(AnalysisUsage &AU) const {
+  AU.setPreservesAll();
+  AU.addRequired<MachineLoopInfoWrapperPass>();
+  AU.addRequired<SlotIndexesWrapperPass>();
+  MachineFunctionPass::getAnalysisUsage(AU);
+}
+
+AMDGPUNextUseAnalysisWrapper::AMDGPUNextUseAnalysisWrapper()
+    : MachineFunctionPass(ID) {
+  initializeAMDGPUNextUseAnalysisWrapperPass(*PassRegistry::getPassRegistry());
+}
+
+void NextUseResult::dumpAllNextUseDistances(const MachineFunction &MF) {
+  LLVM_DEBUG(dbgs() << "=== NextUseAnalysis Results for " << MF.getName()
+                    << " ===\n");
+
+  for (const auto &MBB : MF) {
+    const unsigned MBBNum = MBB.getNumber();
+    LLVM_DEBUG(dbgs() << "\n--- MBB_" << MBBNum << " ---\n");
+
+    if (!NextUseMap.contains(MBBNum)) {
+      LLVM_DEBUG(dbgs() << "  No analysis data for this block\n");
+      continue;
+    }
+
+    const NextUseInfo &Info = NextUseMap.at(MBBNum);
+
+    // Per-instruction dump (materialized with per-MI snapshot offset).
+    for (auto II = MBB.begin(), IE = MBB.end(); II != IE; ++II) {
+      const MachineInstr &MI = *II;
+
+      LLVM_DEBUG(dbgs() << "  Instr: ");
+      LLVM_DEBUG(MI.print(dbgs(), /*IsStandalone=*/false, /*SkipOpers=*/false,
+                          /*SkipDebugLoc=*/true, /*AddNewLine=*/false));
+      LLVM_DEBUG(dbgs() << "\n");
+
+      LLVM_DEBUG(dbgs() << "    Next-use distances:\n");
+      if (Info.InstrDist.contains(&MI)) {
+        const VRegDistances &Dists = Info.InstrDist.at(&MI);
+        const unsigned SnapOff = Info.InstrOffset.lookup(&MI); // 0 if absent
+        const bool Any =
+            printVregDistances(Dists, SnapOff, 0, dbgs(), "      ");
+        if (!Any)
+          LLVM_DEBUG(dbgs() << "      (no register uses)\n");
+      } else {
+        LLVM_DEBUG(dbgs() << "      (no distance data)\n");
+      }
+      LLVM_DEBUG(dbgs() << "\n");
+    }
+
+    // Block-end dump (materialized with offset = 0).
+    LLVM_DEBUG(dbgs() << "  Block End Distances:\n");
+    const bool AnyEnd = printVregDistances(Info.Bottom, /*SnapshotOffset=*/0,
+                                           /* EdgeWeight */ 0, dbgs(), "    ");
+    if (!AnyEnd)
+      LLVM_DEBUG(dbgs() << "    (no registers live at block end)\n");
+  }
+
+  LLVM_DEBUG(dbgs() << "\n=== End NextUseAnalysis Results ===\n");
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h
new file mode 100644
index 0000000000000..d938c967acaa5
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUNextUseAnalysis.h
@@ -0,0 +1,448 @@
+//===- AMDGPUNextUseAnalysis.h ----------------------------------------*- C++-
+//*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_NEXT_USE_ANALYSIS_H
+#define LLVM_LIB_TARGET_AMDGPU_NEXT_USE_ANALYSIS_H
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
+#include "llvm/CodeGen/SlotIndexes.h"
+
+#include "AMDGPUSSARAUtils.h"
+#include "GCNSubtarget.h"
+#include "SIRegisterInfo.h"
+#include "VRegMaskPair.h"
+
+#include <algorithm>
+#include <limits>
+#include <set>
+
+using namespace llvm;
+
+// namespace {
+
+// Helper function for rebasing successor distances into current block frame
+static inline int64_t rebaseFromSucc(int64_t succStored, unsigned succEntryOff,
+                                     int64_t edgeWeight /*0 or LoopTag*/) {
+  // Move succ-relative value into "current block end" frame.
+  return (int64_t)succStored + (int64_t)succEntryOff + (int64_t)edgeWeight;
+}
+
+class NextUseResult {
+  friend class AMDGPUNextUseAnalysisWrapper;
+  SlotIndexes *Indexes;
+  const MachineRegisterInfo *MRI;
+  const SIRegisterInfo *TRI;
+  MachineLoopInfo *LI;
+
+  TimerGroup *TG;
+  Timer *T1;
+  Timer *T2;
+
+  class VRegDistances {
+
+    using Record = std::pair<LaneBitmask, int64_t>;
+    struct CompareByDist {
+      bool operator()(const Record &LHS, const Record &RHS) const {
+        if (LHS.second != RHS.second)     // Different distances
+          return LHS.second < RHS.second; // Closest first
+        return LHS.first.getAsInteger() <
+               RHS.first.getAsInteger(); // Tiebreaker
+      }
+    };
+
+  public:
+    using SortedRecords = std::set<Record, CompareByDist>;
+
+  private:
+    DenseMap<unsigned, SortedRecords> NextUseMap;
+
+  public:
+    auto begin() { return NextUseMap.begin(); }
+    auto end() { return NextUseMap.end(); }
+
+    auto begin() const { return NextUseMap.begin(); }
+    auto end() const { return NextUseMap.end(); }
+
+    size_t size() const { return NextUseMap.size(); }
+    std::pair<bool, SortedRecords> get(unsigned Key) const {
+      if (NextUseMap.contains(Key))
+        return {true, NextUseMap.find(Key)->second};
+      return {false, SortedRecords()};
+    }
+
+    SortedRecords &operator[](unsigned Key) { return NextUseMap[Key]; }
+
+    SmallVector<unsigned> keys() {
+      SmallVector<unsigned> Keys;
+      for (auto P : NextUseMap)
+        Keys.push_back(P.first);
+      return Keys;
+    }
+
+    bool contains(unsigned Key) { return NextUseMap.contains(Key); }
+
+    bool insert(VRegMaskPair VMP, int6...
[truncated]

@alex-t
Copy link
Contributor Author

alex-t commented Oct 14, 2025

@arsenm , your comments have been addressed. Could you please take a look?

@alex-t
Copy link
Contributor Author

alex-t commented Nov 27, 2025

heads up! this is going to be landed soon! @arsenm, @dstutt please let me know if you have any objections.

@arsenm
Copy link
Contributor

arsenm commented Nov 27, 2025

heads up! this is going to be landed soon! @arsenm, @dstutt please let me know if you have any objections.

This shall not land without proper full review

@alex-t
Copy link
Contributor Author

alex-t commented Nov 28, 2025

heads up! this is going to be landed soon! @arsenm, @dstutt please let me know if you have any objections.

This shall not land without proper full review

@arsenm I have addressed all your comments a 1,5 months ago.
I have been waiting for you to reply all that time.
Now you say about "proper full review". Really?
Please suggest the proper reviewer then.

@@ -0,0 +1,445 @@
#include "AMDGPU.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM license thingy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants