[CodeGen] Introduce MI flag for Live Range split instructions #117543

cdevadas · 2024-11-25T11:48:12Z

For some targets, it is required to identify the COPY instruction
corresponds to the RA inserted live range split. Adding the new
flag MachineInstr::LRSplit to serve the purpose.

cdevadas · 2024-11-25T11:48:28Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2024-11-25T11:51:33Z

@llvm/pr-subscribers-llvm-regalloc

Author: Christudasan Devadasan (cdevadas)

Changes

For some targets, it is required to identify the COPY instruction
corresponds to the RA inserted live range split. Adding the new
flag MachineInstr::LRSplit to serve the purpose.

Full diff: https://github.com/llvm/llvm-project/pull/117543.diff

2 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineInstr.h (+2-1)
(modified) llvm/lib/CodeGen/SplitKit.cpp (+2)

diff --git a/llvm/include/llvm/CodeGen/MachineInstr.h b/llvm/include/llvm/CodeGen/MachineInstr.h
index ead6bbe1d5f641..4545b205d07466 100644
--- a/llvm/include/llvm/CodeGen/MachineInstr.h
+++ b/llvm/include/llvm/CodeGen/MachineInstr.h
@@ -119,7 +119,8 @@ class MachineInstr
     Disjoint = 1 << 19,      // Each bit is zero in at least one of the inputs.
     NoUSWrap = 1 << 20,      // Instruction supports geps
                              // no unsigned signed wrap.
-    SameSign = 1 << 21       // Both operands have the same sign.
+    SameSign = 1 << 21,      // Both operands have the same sign.
+    LRSplit = 1 << 22        // Instruction for live range split.
   };
 
 private:
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index eb33b93c197d7c..5042f074c26c45 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -533,6 +533,7 @@ SlotIndex SplitEditor::buildSingleSubRegCopy(
               | getInternalReadRegState(!FirstCopy), SubIdx)
       .addReg(FromReg, 0, SubIdx);
 
+  CopyMI->setFlag(MachineInstr::LRSplit);
   SlotIndexes &Indexes = *LIS.getSlotIndexes();
   if (FirstCopy) {
     Def = Indexes.insertMachineInstrInMaps(*CopyMI, Late).getRegSlot();
@@ -552,6 +553,7 @@ SlotIndex SplitEditor::buildCopy(Register FromReg, Register ToReg,
     // The full vreg is copied.
     MachineInstr *CopyMI =
         BuildMI(MBB, InsertBefore, DebugLoc(), Desc, ToReg).addReg(FromReg);
+    CopyMI->setFlag(MachineInstr::LRSplit);
     return Indexes.insertMachineInstrInMaps(*CopyMI, Late).getRegSlot();
   }

qcolombet

Why do you need this information?

At the end of the day this is just a regular copy.

cdevadas · 2024-11-26T05:12:30Z

Why do you need this information?

At the end of the day this is just a regular copy.

Can you see the other PR in the stack? #117544
It is indeed just another copy. That's the real problem when identifying the LR split instructions from the other COPY instructions. AMDGPU has multiple regalloc pipelines (per regclass). We depend on the BBProlog concept (isBasicBlockPrologue) while the spills/copies are inserted during RA. This is primarily needed to push down the VGPR spills/copies in certain blocks at the right point after the exec mask values are manipulated for divergent execution.
I could have directly used COPY to identify the split instruction if this target hook isBasicBlockPrologue is used only during RA. But it is integrated inside the helper functions SkipPHIsAndLabels & SkipPHIsLabelsAndDebug which are used to skip certain Pseudo/Meta instructions from the BB top. These functions are also used during PHI elimination, MI Sink, etc., and cause trouble.

cdevadas · 2024-11-29T20:19:03Z

Ping

qcolombet · 2024-12-02T10:18:32Z

Why do you need this information?
At the end of the day this is just a regular copy.

Can you see the other PR in the stack? #117544 It is indeed just another copy. That's the real problem when identifying the LR split instructions from the other COPY instructions.

My point is why do you have to distinguish them to begin with?
Can't you apply your "push down" transformation on all the COPYs that you can?

cdevadas · 2024-12-02T18:44:01Z

My point is why do you have to distinguish them to begin with? Can't you apply your "push down" transformation on all the COPYs that you can?

For AMDGPU the bb prolog instructions are the RA inserted spills and LR split copies. Any other instruction included as part of BBProlog would result in wrong insertion point leading to buggy CodeGen. PHI elimination pass, for instance, uses the same hook to identify the insertion point while inserting copies at the predecessor blocks. Any COPY at a block begin that was part of regular CodeGen would then be included in the prolog leading to incorrect insertion points.

cdevadas · 2024-12-09T10:27:33Z

Ping

cdevadas · 2024-12-20T05:09:30Z

Ping @qcolombet.

cdevadas · 2025-01-28T17:11:12Z

Ping.

cdevadas · 2025-02-04T15:51:03Z

Ping. @qcolombet this patch addresses a critical bug in the AMDGPU codegen. Please take a look.

carlobertolli · 2025-11-04T16:54:41Z

I verified that when making all copy instructions part of the BB prologue, we see LIT regressions pointing to incorrect instruction scheduling (the program semantics is changed).
Example 1:
mubuf-legalize-operands-non-ptr-intrinsics.ll that should look like this

; W64-O0-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Spill
; W64-O0-NEXT:    s_xor_b64 exec, exec, s[4:5]
; W64-O0-NEXT:    s_cbranch_execnz .LBB0_1
; W64-O0-NEXT:  ; %bb.3:
; W64-O0-NEXT:    s_or_saveexec_b64 s[16:17], -1
; W64-O0-NEXT:    buffer_load_dword v7, off, s[0:3], s32 ; 4-byte Folded Reload
; W64-O0-NEXT:    s_mov_b64 exec, s[16:17]
; W64-O0-NEXT:    s_waitcnt vmcnt(0)
; W64-O0-NEXT:    v_readlane_b32 s4, v7, 1
; W64-O0-NEXT:    v_readlane_b32 s5, v7, 2
; W64-O0-NEXT:    s_mov_b64 exec, s[4:5]
; W64-O0-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Reload
; W64-O0-NEXT:    s_xor_saveexec_b64 s[4:5], -1

but I get this out

  buffer_store_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Spill
  s_xor_b64 exec, exec, s[4:5]
  s_cbranch_execnz .LBB0_1
; %bb.3:
  buffer_load_dword v0, off, s[0:3], s32 offset:24 ; 4-byte Folded Reload
  s_or_saveexec_b64 s[16:17], -1
  buffer_load_dword v7, off, s[0:3], s32  ; 4-byte Folded Reload
  s_mov_b64 exec, s[16:17]
  s_waitcnt vmcnt(0)
  v_readlane_b32 s4, v7, 1
  v_readlane_b32 s5, v7, 2
  s_mov_b64 exec, s[4:5]
  s_xor_saveexec_b64 s[4:5], -1
  buffer_load_dword v7, off, s[0:3], s32 offset:28 ; 4-byte Folded Reload
  s_mov_b64 exec, s[4:5]

This instruction:

buffer_load_dword v7, off, s[0:3], s32 ; 4-byte Folded Reload
should be after

s_or_saveexec_b64 s[16:17], -1

but it is not anymore (buffer loads are conditionally executed based on exec reg value).

There are also a bunch of unhappy stack traces, such as

Bad machine code: Using an undefined physical register
function: test_loop
basic block: %bb.1 loop.body (0x55555c1565f8)
instruction: renamable $sgpr8 = COPY renamable $sgpr5
operand 1: renamable $sgpr5
LLVM ERROR: Found 1 machine code errors.

in test basic-loop.ll.

This is due to isBasicBlockPrologue being used across different functionalities, such as PHIElimination, MachineIRSink, and others.

I think this patch and the associated #117544 are the minimum viable to fix this issue.
@arsenm can you please verify the above?

carlobertolli · 2025-11-06T18:06:52Z

I verified that this PR and the one associate to it also fix:
#166657
The bug report above indicates that the issue is seen in multiple archs, including gfx9 and gfx11.

For some targets, it is required to identify the COPY instruction corresponds to the RA inserted live range split. Adding the new flag `MachineInstr::LRSplit` to serve the purpose.

cdevadas mentioned this pull request Nov 25, 2024

[AMDGPU] Add liverange split instructions into BB Prolog #117544

Open

cdevadas requested review from arsenm and qcolombet November 25, 2024 11:49

cdevadas marked this pull request as ready for review November 25, 2024 11:50

llvmbot added the llvm:regalloc label Nov 25, 2024

qcolombet reviewed Nov 25, 2024

View reviewed changes

carlobertolli requested review from arsenm and removed request for arsenm November 4, 2025 19:54

carlobertolli mentioned this pull request Nov 6, 2025

[AMDGPU] register spill instructions are generated inside control flow with exec=0 #166657

Open

[CodeGen] Introduce MI flag for Live Range split instructions

a9587c1

For some targets, it is required to identify the COPY instruction corresponds to the RA inserted live range split. Adding the new flag `MachineInstr::LRSplit` to serve the purpose.

cdevadas force-pushed the users/cdevadas/mi-flag-for-lr-split-insn branch from 90f830f to a9587c1 Compare November 7, 2025 06:50

llvmbot added the llvm:codegen label Nov 7, 2025

[CodeGen] Introduce MI flag for Live Range split instructions #117543

Are you sure you want to change the base?

[CodeGen] Introduce MI flag for Live Range split instructions #117543

Uh oh!

Conversation

cdevadas commented Nov 25, 2024

Uh oh!

cdevadas commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 25, 2024

Uh oh!

qcolombet left a comment

Choose a reason for hiding this comment

Uh oh!

cdevadas commented Nov 26, 2024

Uh oh!

cdevadas commented Nov 29, 2024

Uh oh!

qcolombet commented Dec 2, 2024

Uh oh!

cdevadas commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdevadas commented Dec 9, 2024

Uh oh!

cdevadas commented Dec 20, 2024

Uh oh!

cdevadas commented Jan 28, 2025

Uh oh!

cdevadas commented Feb 4, 2025

Uh oh!

carlobertolli commented Nov 4, 2025

Uh oh!

carlobertolli commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cdevadas commented Nov 25, 2024 •

edited

Loading

cdevadas commented Dec 2, 2024 •

edited

Loading