-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
The recent change to SIFixSGPRCopies::processPHINode in #169038 addressed the problem that we reported in #168761, but this caused a regression in our matmul kernel where PHI nodes that had av_class_128 were then converted to vreg_128 with copies introduced to move back to areg_128 registers.
I put a repro case of the kernel I'm looking at here: https://gist.github.com/raiseirql/ebe539468b5a741ff561a85832205a3c
Should the change in #169038 be refined or should a later pass have cleaned up the codegen to remove the copies.
In order to get AGPRs to be used due to other recent backend changes, we have needed to apply this patch. @arsenm had been looking into how to fix this properly.
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index b398db4f7caf..42af5c19cae2 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -90,7 +90,11 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
auto [MinNumAGPRAttr, MaxNumAGPRAttr] =
AMDGPU::getIntegerPairAttribute(F, "amdgpu-agpr-alloc", {~0u, ~0u},
/*OnlyFirstRequired=*/true);
- MinNumAGPRs = MinNumAGPRAttr;
+ MinNumAGPRs = ~0u;
+ if (MFMAVGPRForm ||
+ (ST.getMaxNumVGPRs(F) <= ST.getAddressableNumArchVGPRs() &&
+ !mayUseAGPRs(F)))
+ MinNumAGPRs = 0;
}
if (AMDGPU::isChainCC(CC)) {
Before si-fix-sgpr-copies in the matmul inner loop. Same basic pattern after si-fix-sgpr-copies without #169038.
bb.2 (%ir-block.634):
; predecessors: %bb.1
...
%706:areg_128_align2 = PHI %352:av_128_align2, %bb.1
%707:areg_128_align2 = PHI %353:av_128_align2, %bb.1
...
%1147:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %778:av_128_align2, %770:av_128_align2, %706:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
%1148:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %779:av_128_align2, %770:av_128_align2, %707:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $execAfter si-fix-sgpr-copies with #169038:
%2483:vgpr_32 = COPY %791:av_32, implicit $exec
%2485:vgpr_32 = COPY %791:av_32, implicit $exec
%2487:vgpr_32 = COPY %791:av_32, implicit $exec
%2489:vgpr_32 = COPY %791:av_32, implicit $exec
%2491:vgpr_32 = COPY %791:av_32, implicit $exec
%2493:vgpr_32 = COPY %791:av_32, implicit $exec
%2495:vgpr_32 = COPY %791:av_32, implicit $exec
%2497:vgpr_32 = COPY %791:av_32, implicit $exec
bb.1 (%ir-block.184):
; predecessors: %bb.0, %bb.1
successors: %bb.2(0x04000000), %bb.1(0x7c000000); %bb.2(3.12%), %bb.1(96.88%)
...
%328:av_32 = PHI %2483:vgpr_32, %bb.0, %2484:vgpr_32, %bb.1
%329:av_32 = PHI %2485:vgpr_32, %bb.0, %2486:vgpr_32, %bb.1
%330:av_32 = PHI %2487:vgpr_32, %bb.0, %2488:vgpr_32, %bb.1
%331:av_32 = PHI %2489:vgpr_32, %bb.0, %2490:vgpr_32, %bb.1
%332:av_32 = PHI %2491:vgpr_32, %bb.0, %2492:vgpr_32, %bb.1
%333:av_32 = PHI %2493:vgpr_32, %bb.0, %2494:vgpr_32, %bb.1
%334:av_32 = PHI %2495:vgpr_32, %bb.0, %2496:vgpr_32, %bb.1
%335:av_32 = PHI %2497:vgpr_32, %bb.0, %2498:vgpr_32, %bb.1
...
%908:areg_128_align2 = REG_SEQUENCE %332:av_32, %subreg.sub0, %333:av_32, %subreg.sub1, %334:av_32, %subreg.sub2, %335:av_32, %subreg.sub3
%909:areg_128_align2 = REG_SEQUENCE %328:av_32, %subreg.sub0, %329:av_32, %subreg.sub1, %330:av_32, %subreg.sub2, %331:av_32, %subreg.sub3
...
%988:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %55:av_128_align2, %71:av_128_align2, %908:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
%989:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %54:av_128_align2, %71:av_128_align2, %909:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $execThe end effect of all this is the matmul kernel for a test shape of 8Kx8Kx8K goes from 0.85ms to over 15ms. The inner loop is has several scratch reads and copies between vgpr/agpr:
v_mfma_f32_16x16x32_bf16 a[216:219], v[162:165], v[54:57], a[216:219]
s_nop 4
v_accvgpr_read_b32 v241, a135
v_accvgpr_read_b32 v240, a134
v_accvgpr_read_b32 v239, a133
v_mfma_f32_16x16x32_bf16 a[152:155], v[162:165], v[46:49], a[172:175]
v_accvgpr_read_b32 v238, a132
v_mfma_f32_16x16x32_bf16 a[128:131], v[162:165], v[50:53], a[128:131]
scratch_load_dwordx4 v[166:169], off, off offset:144cc @jayfoad