[AMDGPU] regression from `SIFixSGPRCopies::processPHINode` change

The recent change to `SIFixSGPRCopies::processPHINode` in #169038 addressed the problem that we reported in #168761, but this caused a regression in our matmul kernel where PHI nodes that had `av_class_128` were then converted to `vreg_128` with copies introduced to move back to `areg_128` registers.

I put a repro case of the kernel I'm looking at here: https://gist.github.com/raiseirql/ebe539468b5a741ff561a85832205a3c

Should the change in #169038 be refined or should a later pass have cleaned up the codegen to remove the copies.

In order to get AGPRs to be used due to other recent backend changes, we have needed to apply this patch. @arsenm had been looking into how to fix this properly.
```
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index b398db4f7caf..42af5c19cae2 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -90,7 +90,11 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
     auto [MinNumAGPRAttr, MaxNumAGPRAttr] =
         AMDGPU::getIntegerPairAttribute(F, "amdgpu-agpr-alloc", {~0u, ~0u},
                                         /*OnlyFirstRequired=*/true);
-    MinNumAGPRs = MinNumAGPRAttr;
+    MinNumAGPRs = ~0u;
+    if (MFMAVGPRForm ||
+         (ST.getMaxNumVGPRs(F) <= ST.getAddressableNumArchVGPRs() &&
+          !mayUseAGPRs(F)))
+      MinNumAGPRs = 0;
   }

   if (AMDGPU::isChainCC(CC)) {
``` 

Before `si-fix-sgpr-copies` in the matmul inner loop. Same basic pattern after `si-fix-sgpr-copies` without #169038.
```llvm
bb.2 (%ir-block.634):
; predecessors: %bb.1

...
  %706:areg_128_align2 = PHI %352:av_128_align2, %bb.1
  %707:areg_128_align2 = PHI %353:av_128_align2, %bb.1
...
  %1147:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %778:av_128_align2, %770:av_128_align2, %706:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %1148:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %779:av_128_align2, %770:av_128_align2, %707:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
``` 

After `si-fix-sgpr-copies` with #169038:
```llvm
  %2483:vgpr_32 = COPY %791:av_32, implicit $exec
  %2485:vgpr_32 = COPY %791:av_32, implicit $exec
  %2487:vgpr_32 = COPY %791:av_32, implicit $exec
  %2489:vgpr_32 = COPY %791:av_32, implicit $exec
  %2491:vgpr_32 = COPY %791:av_32, implicit $exec
  %2493:vgpr_32 = COPY %791:av_32, implicit $exec
  %2495:vgpr_32 = COPY %791:av_32, implicit $exec
  %2497:vgpr_32 = COPY %791:av_32, implicit $exec

bb.1 (%ir-block.184):
; predecessors: %bb.0, %bb.1
  successors: %bb.2(0x04000000), %bb.1(0x7c000000); %bb.2(3.12%), %bb.1(96.88%)
...
  %328:av_32 = PHI %2483:vgpr_32, %bb.0, %2484:vgpr_32, %bb.1
  %329:av_32 = PHI %2485:vgpr_32, %bb.0, %2486:vgpr_32, %bb.1
  %330:av_32 = PHI %2487:vgpr_32, %bb.0, %2488:vgpr_32, %bb.1
  %331:av_32 = PHI %2489:vgpr_32, %bb.0, %2490:vgpr_32, %bb.1
  %332:av_32 = PHI %2491:vgpr_32, %bb.0, %2492:vgpr_32, %bb.1
  %333:av_32 = PHI %2493:vgpr_32, %bb.0, %2494:vgpr_32, %bb.1
  %334:av_32 = PHI %2495:vgpr_32, %bb.0, %2496:vgpr_32, %bb.1
  %335:av_32 = PHI %2497:vgpr_32, %bb.0, %2498:vgpr_32, %bb.1
...
  %908:areg_128_align2 = REG_SEQUENCE %332:av_32, %subreg.sub0, %333:av_32, %subreg.sub1, %334:av_32, %subreg.sub2, %335:av_32, %subreg.sub3
  %909:areg_128_align2 = REG_SEQUENCE %328:av_32, %subreg.sub0, %329:av_32, %subreg.sub1, %330:av_32, %subreg.sub2, %331:av_32, %subreg.sub3
...
  %988:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %55:av_128_align2, %71:av_128_align2, %908:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %989:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %54:av_128_align2, %71:av_128_align2, %909:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
``` 

The end effect of all this is the matmul kernel for a test shape of 8Kx8Kx8K goes from 0.85ms to over 15ms. The inner loop is has several scratch reads and copies between vgpr/agpr:
```assembly
        v_mfma_f32_16x16x32_bf16 a[216:219], v[162:165], v[54:57], a[216:219]
        s_nop 4
        v_accvgpr_read_b32 v241, a135
        v_accvgpr_read_b32 v240, a134
        v_accvgpr_read_b32 v239, a133
        v_mfma_f32_16x16x32_bf16 a[152:155], v[162:165], v[46:49], a[172:175]
        v_accvgpr_read_b32 v238, a132
        v_mfma_f32_16x16x32_bf16 a[128:131], v[162:165], v[50:53], a[128:131]
        scratch_load_dwordx4 v[166:169], off, off offset:144
``` 

cc @jayfoad 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] regression from `SIFixSGPRCopies::processPHINode` change #169552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AMDGPU] regression from SIFixSGPRCopies::processPHINode change #169552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[AMDGPU] regression from `SIFixSGPRCopies::processPHINode` change #169552