-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Prioritize allocation of low 256 VGPR classes #167978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[AMDGPU] Prioritize allocation of low 256 VGPR classes #167978
Conversation
If we have 1024 VGPRs available we need to give priority to the allocation of these registers where operands can only use low 256. That is noteably scale operands of V_WMMA_SCALE instructions. Otherwise large tuples will be allocated first and take all low registers, so we would have to spill to get a room for these scale registers. Allocation priority itself does not eliminate spilling completely in large kernels, although helps to some degree. Increasing spill weight of a restricted class on top of it helps.
|
@llvm/pr-subscribers-backend-amdgpu Author: Stanislav Mekhanoshin (rampitec) ChangesIf we have 1024 VGPRs available we need to give priority to the Allocation priority itself does not eliminate spilling completely Full diff: https://github.com/llvm/llvm-project/pull/167978.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
index a6af25dfd7d6f..28ab2137b193c 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.h
@@ -501,6 +501,17 @@ class SIRegisterInfo final : public AMDGPUGenRegisterInfo {
SmallVector<StringLiteral>
getVRegFlagsOfReg(Register Reg, const MachineFunction &MF) const override;
+
+ float
+ getSpillWeightScaleFactor(const TargetRegisterClass *RC) const override {
+ // Prioritize VGPR_32_Lo256 over other classes which may occupy registers
+ // beyond v256.
+ return AMDGPUGenRegisterInfo::getSpillWeightScaleFactor(RC) *
+ ((RC == &AMDGPU::VGPR_32_Lo256RegClass ||
+ RC == &AMDGPU::VReg_64_Lo256_Align2RegClass)
+ ? 2.0
+ : 1.0);
+ }
};
namespace AMDGPU {
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
index abe12c17ae76c..5cff5f2248b02 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
@@ -644,7 +644,7 @@ def VGPR_32_Lo128 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg1
// Identical to VGPR_32 except it only contains the low 256 (Lo256) registers.
def VGPR_32_Lo256 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
(add (sequence "VGPR%u", 0, 255))> {
- let AllocationPriority = 0;
+ let AllocationPriority = !add(3, !mul(BaseClassPriority, BaseClassScaleFactor));
let GeneratePressureSet = 0;
let Size = 32;
let Weight = 1;
|
|
I was unable to come up with a reasonably small testcase so far. |
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No test changes?
| // Prioritize VGPR_32_Lo256 over other classes which may occupy registers | ||
| // beyond v256. | ||
| return AMDGPUGenRegisterInfo::getSpillWeightScaleFactor(RC) * | ||
| ((RC == &AMDGPU::VGPR_32_Lo256RegClass || | ||
| RC == &AMDGPU::VReg_64_Lo256_Align2RegClass) | ||
| ? 2.0 | ||
| : 1.0); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't require manual overriding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some level this is even right. Using this or that register is not more expensive. That is when it comes to spilling it matters.
| def VGPR_32_Lo256 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32, | ||
| (add (sequence "VGPR%u", 0, 255))> { | ||
| let AllocationPriority = 0; | ||
| let AllocationPriority = !add(3, !mul(BaseClassPriority, BaseClassScaleFactor)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try setting CostPerUse? I think that makes more sense than changing priority
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's for registers, right? In this case it shall be per regclass. I.e., even if I increase it for low 256, it will be the same for a smaller and larger size. Here it is really a property of operand, not a register itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less that would make almost everything worse. Setting CostPerUse to non-zero for all low 256 VGPRs will persuade RA to allocate registers above 256 by default. At the end you will have programs using registers starting just from v256 and nothing at the range v0-v255.
No, at least not yet. The real test is huge and I didn't reproduce spilling in anything smaller yet. |
This seems to be very difficult to force RA spilling in a small test, it has to be really huge. |
The minimal working testcase llvm-reduce left me with is 102K of IR, 416 IR instructions. Do you want me to add it? |
That's the test: #168163. It seems I cannot make it smaller. |
|
Even this is not a magic bullet. I still can seduce RA to spill in the pathological very big case. Still correct, but will spill on lo256 operands. I guess this is the nature of having limited VGPR classes and inability to encode all operands to use all possible registers. |

If we have 1024 VGPRs available we need to give priority to the
allocation of these registers where operands can only use low 256.
That is noteably scale operands of V_WMMA_SCALE instructions.
Otherwise large tuples will be allocated first and take all low
registers, so we would have to spill to get a room for these
scale registers.
Allocation priority itself does not eliminate spilling completely
in large kernels, although helps to some degree. Increasing spill
weight of a restricted class on top of it helps.