[AMDGPU] Prioritize allocation of low 256 VGPR classes #167978

rampitec · 2025-11-13T23:31:43Z

If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.

Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.

If we have 1024 VGPRs available we need to give priority to the allocation of these registers where operands can only use low 256. That is noteably scale operands of V_WMMA_SCALE instructions. Otherwise large tuples will be allocated first and take all low registers, so we would have to spill to get a room for these scale registers. Allocation priority itself does not eliminate spilling completely in large kernels, although helps to some degree. Increasing spill weight of a restricted class on top of it helps.

rampitec · 2025-11-13T23:31:57Z

[AMDGPU] Prioritize allocation of low 256 VGPR classes #167978 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-11-13T23:32:38Z

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.

Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.

Full diff: https://github.com/llvm/llvm-project/pull/167978.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.h (+11)
(modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.td (+1-1)

diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
index a6af25dfd7d6f..28ab2137b193c 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
@@ -501,6 +501,17 @@ class SIRegisterInfo final : public AMDGPUGenRegisterInfo {
 
   SmallVector<StringLiteral>
   getVRegFlagsOfReg(Register Reg, const MachineFunction &MF) const override;
+
+  float
+  getSpillWeightScaleFactor(const TargetRegisterClass *RC) const override {
+    // Prioritize VGPR_32_Lo256 over other classes which may occupy registers
+    // beyond v256.
+    return AMDGPUGenRegisterInfo::getSpillWeightScaleFactor(RC) *
+           ((RC == &AMDGPU::VGPR_32_Lo256RegClass ||
+             RC == &AMDGPU::VReg_64_Lo256_Align2RegClass)
+                ? 2.0
+                : 1.0);
+  }
 };
 
 namespace AMDGPU {
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
index abe12c17ae76c..5cff5f2248b02 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
@@ -644,7 +644,7 @@ def VGPR_32_Lo128 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg1
 // Identical to VGPR_32 except it only contains the low 256 (Lo256) registers.
 def VGPR_32_Lo256 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
                                     (add (sequence "VGPR%u", 0, 255))> {
-  let AllocationPriority = 0;
+  let AllocationPriority = !add(3, !mul(BaseClassPriority, BaseClassScaleFactor));
   let GeneratePressureSet = 0;
   let Size = 32;
   let Weight = 1;

rampitec · 2025-11-13T23:32:39Z

I was unable to come up with a reasonably small testcase so far.

arsenm

No test changes?

arsenm · 2025-11-13T23:58:06Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.h

+    // Prioritize VGPR_32_Lo256 over other classes which may occupy registers
+    // beyond v256.
+    return AMDGPUGenRegisterInfo::getSpillWeightScaleFactor(RC) *
+           ((RC == &AMDGPU::VGPR_32_Lo256RegClass ||
+             RC == &AMDGPU::VReg_64_Lo256_Align2RegClass)
+                ? 2.0
+                : 1.0);
+  }


This shouldn't require manual overriding

At some level this is even right. Using this or that register is not more expensive. That is when it comes to spilling it matters.

arsenm · 2025-11-13T23:58:47Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.td

 def VGPR_32_Lo256 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
                                    (add (sequence "VGPR%u", 0, 255))> {
-  let AllocationPriority = 0;
+  let AllocationPriority = !add(3, !mul(BaseClassPriority, BaseClassScaleFactor));


Did you try setting CostPerUse? I think that makes more sense than changing priority

That's for registers, right? In this case it shall be per regclass. I.e., even if I increase it for low 256, it will be the same for a smaller and larger size. Here it is really a property of operand, not a register itself.

More or less that would make almost everything worse. Setting CostPerUse to non-zero for all low 256 VGPRs will persuade RA to allocate registers above 256 by default. At the end you will have programs using registers starting just from v256 and nothing at the range v0-v255.

rampitec · 2025-11-14T00:18:42Z

No test changes?

No, at least not yet. The real test is huge and I didn't reproduce spilling in anything smaller yet.

rampitec · 2025-11-14T20:30:35Z

No test changes?

No, at least not yet. The real test is huge and I didn't reproduce spilling in anything smaller yet.

This seems to be very difficult to force RA spilling in a small test, it has to be really huge.

rampitec · 2025-11-15T00:53:51Z

This seems to be very difficult to force RA spilling in a small test, it has to be really huge.

The minimal working testcase llvm-reduce left me with is 102K of IR, 416 IR instructions. Do you want me to add it?

rampitec · 2025-11-15T01:33:52Z

This seems to be very difficult to force RA spilling in a small test, it has to be really huge.

The minimal working testcase llvm-reduce left me with is 102K of IR, 416 IR instructions. Do you want me to add it?

That's the test: #168163. It seems I cannot make it smaller.

rampitec · 2025-11-15T09:36:12Z

Even this is not a magic bullet. I still can seduce RA to spill in the pathological very big case. Still correct, but will spill on lo256 operands. I guess this is the nature of having limited VGPR classes and inability to encode all operands to use all possible registers.

rampitec requested a review from qcolombet November 13, 2025 23:32

rampitec marked this pull request as ready for review November 13, 2025 23:32

llvmbot added the backend:AMDGPU label Nov 13, 2025

arsenm reviewed Nov 13, 2025

View reviewed changes

rampitec mentioned this pull request Nov 15, 2025

[AMDGPU] Add baseline test to show spilling of wmma scale. NFC #168163

Open

[AMDGPU] Prioritize allocation of low 256 VGPR classes #167978

Are you sure you want to change the base?

[AMDGPU] Prioritize allocation of low 256 VGPR classes #167978

Conversation

rampitec commented Nov 13, 2025

Uh oh!

rampitec commented Nov 13, 2025

Uh oh!

llvmbot commented Nov 13, 2025

Uh oh!

rampitec commented Nov 13, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

rampitec Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

rampitec Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

rampitec Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

rampitec commented Nov 14, 2025

Uh oh!

rampitec commented Nov 14, 2025

Uh oh!

rampitec commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rampitec commented Nov 15, 2025

Uh oh!

rampitec commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rampitec commented Nov 15, 2025 •

edited

Loading