Skip to content

Conversation

@cdevadas
Copy link
Collaborator

For some targets, it is required to identify the COPY instruction
corresponds to the RA inserted live range split. Adding the new
flag MachineInstr::LRSplit to serve the purpose.

Copy link
Collaborator Author

cdevadas commented Nov 25, 2024

@llvmbot
Copy link
Member

llvmbot commented Nov 25, 2024

@llvm/pr-subscribers-llvm-regalloc

Author: Christudasan Devadasan (cdevadas)

Changes

For some targets, it is required to identify the COPY instruction
corresponds to the RA inserted live range split. Adding the new
flag MachineInstr::LRSplit to serve the purpose.


Full diff: https://github.com/llvm/llvm-project/pull/117543.diff

2 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineInstr.h (+2-1)
  • (modified) llvm/lib/CodeGen/SplitKit.cpp (+2)
diff --git a/llvm/include/llvm/CodeGen/MachineInstr.h b/llvm/include/llvm/CodeGen/MachineInstr.h
index ead6bbe1d5f641..4545b205d07466 100644
--- a/llvm/include/llvm/CodeGen/MachineInstr.h
+++ b/llvm/include/llvm/CodeGen/MachineInstr.h
@@ -119,7 +119,8 @@ class MachineInstr
     Disjoint = 1 << 19,      // Each bit is zero in at least one of the inputs.
     NoUSWrap = 1 << 20,      // Instruction supports geps
                              // no unsigned signed wrap.
-    SameSign = 1 << 21       // Both operands have the same sign.
+    SameSign = 1 << 21,      // Both operands have the same sign.
+    LRSplit = 1 << 22        // Instruction for live range split.
   };
 
 private:
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index eb33b93c197d7c..5042f074c26c45 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -533,6 +533,7 @@ SlotIndex SplitEditor::buildSingleSubRegCopy(
               | getInternalReadRegState(!FirstCopy), SubIdx)
       .addReg(FromReg, 0, SubIdx);
 
+  CopyMI->setFlag(MachineInstr::LRSplit);
   SlotIndexes &Indexes = *LIS.getSlotIndexes();
   if (FirstCopy) {
     Def = Indexes.insertMachineInstrInMaps(*CopyMI, Late).getRegSlot();
@@ -552,6 +553,7 @@ SlotIndex SplitEditor::buildCopy(Register FromReg, Register ToReg,
     // The full vreg is copied.
     MachineInstr *CopyMI =
         BuildMI(MBB, InsertBefore, DebugLoc(), Desc, ToReg).addReg(FromReg);
+    CopyMI->setFlag(MachineInstr::LRSplit);
     return Indexes.insertMachineInstrInMaps(*CopyMI, Late).getRegSlot();
   }
 

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this information?

At the end of the day this is just a regular copy.

@cdevadas
Copy link
Collaborator Author

Why do you need this information?

At the end of the day this is just a regular copy.

Can you see the other PR in the stack? #117544
It is indeed just another copy. That's the real problem when identifying the LR split instructions from the other COPY instructions. AMDGPU has multiple regalloc pipelines (per regclass). We depend on the BBProlog concept (isBasicBlockPrologue) while the spills/copies are inserted during RA. This is primarily needed to push down the VGPR spills/copies in certain blocks at the right point after the exec mask values are manipulated for divergent execution.
I could have directly used COPY to identify the split instruction if this target hook isBasicBlockPrologue is used only during RA. But it is integrated inside the helper functions SkipPHIsAndLabels & SkipPHIsLabelsAndDebug which are used to skip certain Pseudo/Meta instructions from the BB top. These functions are also used during PHI elimination, MI Sink, etc., and cause trouble.

@cdevadas
Copy link
Collaborator Author

Ping

@qcolombet
Copy link
Collaborator

Why do you need this information?
At the end of the day this is just a regular copy.

Can you see the other PR in the stack? #117544 It is indeed just another copy. That's the real problem when identifying the LR split instructions from the other COPY instructions.

My point is why do you have to distinguish them to begin with?
Can't you apply your "push down" transformation on all the COPYs that you can?

@cdevadas
Copy link
Collaborator Author

cdevadas commented Dec 2, 2024

My point is why do you have to distinguish them to begin with? Can't you apply your "push down" transformation on all the COPYs that you can?

For AMDGPU the bb prolog instructions are the RA inserted spills and LR split copies. Any other instruction included as part of BBProlog would result in wrong insertion point leading to buggy CodeGen. PHI elimination pass, for instance, uses the same hook to identify the insertion point while inserting copies at the predecessor blocks. Any COPY at a block begin that was part of regular CodeGen would then be included in the prolog leading to incorrect insertion points.

@cdevadas
Copy link
Collaborator Author

cdevadas commented Dec 9, 2024

Ping

@cdevadas
Copy link
Collaborator Author

Ping @qcolombet.

@cdevadas
Copy link
Collaborator Author

Ping.

@cdevadas
Copy link
Collaborator Author

cdevadas commented Feb 4, 2025

Ping. @qcolombet this patch addresses a critical bug in the AMDGPU codegen. Please take a look.

@carlobertolli
Copy link
Member

I verified that when making all copy instructions part of the BB prologue, we see LIT regressions pointing to incorrect instruction scheduling (the program semantics is changed).
Example 1:
mubuf-legalize-operands-non-ptr-intrinsics.ll that should look like this

; W64-O0-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Spill
; W64-O0-NEXT:    s_xor_b64 exec, exec, s[4:5]
; W64-O0-NEXT:    s_cbranch_execnz .LBB0_1
; W64-O0-NEXT:  ; %bb.3:
; W64-O0-NEXT:    s_or_saveexec_b64 s[16:17], -1
; W64-O0-NEXT:    buffer_load_dword v7, off, s[0:3], s32 ; 4-byte Folded Reload
; W64-O0-NEXT:    s_mov_b64 exec, s[16:17]
; W64-O0-NEXT:    s_waitcnt vmcnt(0)
; W64-O0-NEXT:    v_readlane_b32 s4, v7, 1
; W64-O0-NEXT:    v_readlane_b32 s5, v7, 2
; W64-O0-NEXT:    s_mov_b64 exec, s[4:5]
; W64-O0-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Reload
; W64-O0-NEXT:    s_xor_saveexec_b64 s[4:5], -1

but I get this out

  buffer_store_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Spill
  s_xor_b64 exec, exec, s[4:5]
  s_cbranch_execnz .LBB0_1
; %bb.3:
  buffer_load_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Reload
  s_or_saveexec_b64 s[16:17], -1
  buffer_load_dword v7, off, s[0:3], s32  ; 4-byte Folded Reload
  s_mov_b64 exec, s[16:17]
  s_waitcnt vmcnt(0)
  v_readlane_b32 s4, v7, 1
  v_readlane_b32 s5, v7, 2
  s_mov_b64 exec, s[4:5]
  s_xor_saveexec_b64 s[4:5], -1
  buffer_load_dword v7, off, s[0:3], s32 offset:28 ; 4-byte Folded Reload
  s_mov_b64 exec, s[4:5]

This instruction:

buffer_load_dword v7, off, s[0:3], s32 ; 4-byte Folded Reload
should be after

s_or_saveexec_b64 s[16:17], -1

but it is not anymore (buffer loads are conditionally executed based on exec reg value).

There are also a bunch of unhappy stack traces, such as

Bad machine code: Using an undefined physical register
function: test_loop
basic block: %bb.1 loop.body (0x55555c1565f8)
instruction: renamable $sgpr8 = COPY renamable $sgpr5
operand 1: renamable $sgpr5
LLVM ERROR: Found 1 machine code errors.

in test basic-loop.ll.

This is due to isBasicBlockPrologue being used across different functionalities, such as PHIElimination, MachineIRSink, and others.

I think this patch and the associated #117544 are the minimum viable to fix this issue.
@arsenm can you please verify the above?

@carlobertolli
Copy link
Member

I verified that this PR and the one associate to it also fix:
#166657
The bug report above indicates that the issue is seen in multiple archs, including gfx9 and gfx11.

For some targets, it is required to identify the COPY instruction
corresponds to the RA inserted live range split. Adding the new
flag `MachineInstr::LRSplit` to serve the purpose.
@cdevadas cdevadas force-pushed the users/cdevadas/mi-flag-for-lr-split-insn branch from 90f830f to a9587c1 Compare November 7, 2025 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants