Skip to content

Conversation

@lukel97
Copy link
Contributor

@lukel97 lukel97 commented Nov 6, 2025

In some workloads we see an argument passed on the stack where it is loaded, only for it to be immediately spilled to a different slot on the stack and then reloaded from that spill slot later on.

We can avoid the unnecessary spill by marking loads as rematerializable and just directly loading from where the argument was originally passed on the stack. TargetTransformInfo::isReMaterializableImpl checks to make sure that any loads are MI.isDereferenceableInvariantLoad(), so we should be able to move the load down to the remat site.

This gives a 14.8% reduction in spills in 544.nab_r on rva23u64 -O3, and a few other smaller reductions on llvm-test-suite. I didn't find any benchmarks where the number of spills/reloads increased.

Related: #165761

@llvmbot
Copy link
Member

llvmbot commented Nov 6, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

In some workloads we see an argument passed on the stack where it is loaded, only for it to be immediately spilled to a different slot on the stack and then reloaded from that spill slot.

We can avoid the unnecessary spill by marking loads as rematerializable. TargetTransformInfo::isReMaterializableImpl checks to make sure that any loads are MI.isDereferenceableInvariantLoad(), so we should be able to move the load down to the remat site.

This gives a 14.8% reduction in spills in 544.nab_r on rva23u64 -O3, and a few other smaller reductions on llvm-test-suite. I didn't find any benchmarks where the number of spills/reloads increased.

Related: #165761


Full diff: https://github.com/llvm/llvm-project/pull/166774.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVInstrInfo.td (+2-2)
  • (modified) llvm/lib/Target/RISCV/RISCVInstrInfoD.td (+1-1)
  • (modified) llvm/lib/Target/RISCV/RISCVInstrInfoF.td (+1-1)
  • (modified) llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/remat.ll (+92)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+1-3)
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.td b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
index 9cb53fb27a2d2..84b962b2a8607 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
@@ -768,7 +768,7 @@ def BGE  : BranchCC_rri<0b101, "bge">;
 def BLTU : BranchCC_rri<0b110, "bltu">;
 def BGEU : BranchCC_rri<0b111, "bgeu">;
 
-let IsSignExtendingOpW = 1, canFoldAsLoad = 1 in {
+let IsSignExtendingOpW = 1, canFoldAsLoad = 1, isReMaterializable = 1 in {
 def LB  : Load_ri<0b000, "lb">, Sched<[WriteLDB, ReadMemBase]>;
 def LH  : Load_ri<0b001, "lh">, Sched<[WriteLDH, ReadMemBase]>;
 def LW  : Load_ri<0b010, "lw">, Sched<[WriteLDW, ReadMemBase]>;
@@ -889,7 +889,7 @@ def CSRRCI : CSR_ii<0b111, "csrrci">;
 /// RV64I instructions
 
 let Predicates = [IsRV64] in {
-let canFoldAsLoad = 1 in {
+let canFoldAsLoad = 1, isReMaterializable = 1 in {
 def LWU   : Load_ri<0b110, "lwu">, Sched<[WriteLDW, ReadMemBase]>;
 def LD    : Load_ri<0b011, "ld">, Sched<[WriteLDD, ReadMemBase]>;
 }
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoD.td b/llvm/lib/Target/RISCV/RISCVInstrInfoD.td
index 4ffe3e62ac501..deacd41e6469a 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoD.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoD.td
@@ -71,7 +71,7 @@ defvar DExtsRV64 = [DExt, ZdinxExt];
 //===----------------------------------------------------------------------===//
 
 let Predicates = [HasStdExtD] in {
-let canFoldAsLoad = 1 in
+let canFoldAsLoad = 1, isReMaterializable = 1 in
 def FLD : FPLoad_r<0b011, "fld", FPR64, WriteFLD64>;
 
 // Operands for stores are in the order srcreg, base, offset rather than
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoF.td b/llvm/lib/Target/RISCV/RISCVInstrInfoF.td
index b30f8ec820c15..bd191001b75ec 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoF.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoF.td
@@ -330,7 +330,7 @@ class PseudoFROUND<DAGOperand Ty, ValueType vt, ValueType intvt = XLenVT>
 //===----------------------------------------------------------------------===//
 
 let Predicates = [HasStdExtF] in {
-let canFoldAsLoad = 1 in
+let canFoldAsLoad = 1, isReMaterializable = 1 in
 def FLW : FPLoad_r<0b010, "flw", FPR32, WriteFLD32>;
 
 // Operands for stores are in the order srcreg, base, offset rather than
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td b/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td
index 1c6a5afcda49b..c172d1739ba61 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td
@@ -90,7 +90,7 @@ defvar ZfhminDExts = [ZfhminDExt, ZhinxminZdinxExt, ZhinxminZdinx32Ext];
 //===----------------------------------------------------------------------===//
 
 let Predicates = [HasHalfFPLoadStoreMove] in {
-let canFoldAsLoad = 1 in
+let canFoldAsLoad = 1, isReMaterializable = 1 in
 def FLH : FPLoad_r<0b001, "flh", FPR16, WriteFLD16>;
 
 // Operands for stores are in the order srcreg, base, offset rather than
diff --git a/llvm/test/CodeGen/RISCV/remat.ll b/llvm/test/CodeGen/RISCV/remat.ll
index 92ae85f560cd4..88240474f30bf 100644
--- a/llvm/test/CodeGen/RISCV/remat.ll
+++ b/llvm/test/CodeGen/RISCV/remat.ll
@@ -201,3 +201,95 @@ for.end:                                          ; preds = %for.inc, %entry
 }
 
 declare i32 @foo(i32, i32, i32, i32, i32, i32)
+
+define i32 @remat_ld(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %stack0, i32 %stack1, ptr %p, ptr %q) {
+; RV32I-LABEL: remat_ld:
+; RV32I:       # %bb.0: # %entry
+; RV32I-NEXT:    addi sp, sp, -64
+; RV32I-NEXT:    .cfi_def_cfa_offset 64
+; RV32I-NEXT:    sw ra, 60(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s0, 56(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s1, 52(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s2, 48(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s3, 44(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s4, 40(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s5, 36(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s6, 32(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s7, 28(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s8, 24(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s9, 20(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s10, 16(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s11, 12(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    .cfi_offset ra, -4
+; RV32I-NEXT:    .cfi_offset s0, -8
+; RV32I-NEXT:    .cfi_offset s1, -12
+; RV32I-NEXT:    .cfi_offset s2, -16
+; RV32I-NEXT:    .cfi_offset s3, -20
+; RV32I-NEXT:    .cfi_offset s4, -24
+; RV32I-NEXT:    .cfi_offset s5, -28
+; RV32I-NEXT:    .cfi_offset s6, -32
+; RV32I-NEXT:    .cfi_offset s7, -36
+; RV32I-NEXT:    .cfi_offset s8, -40
+; RV32I-NEXT:    .cfi_offset s9, -44
+; RV32I-NEXT:    .cfi_offset s10, -48
+; RV32I-NEXT:    .cfi_offset s11, -52
+; RV32I-NEXT:    sw a0, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    lw a2, 68(sp)
+; RV32I-NEXT:    lw a3, 64(sp)
+; RV32I-NEXT:    lw a0, 72(sp)
+; RV32I-NEXT:    lw a1, 76(sp)
+; RV32I-NEXT:    sw a3, 0(a0)
+; RV32I-NEXT:    sw a2, 0(a1)
+; RV32I-NEXT:    #APP
+; RV32I-NEXT:    #NO_APP
+; RV32I-NEXT:    lw a0, 8(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    beqz a0, .LBB1_2
+; RV32I-NEXT:  # %bb.1: # %falsebb
+; RV32I-NEXT:    li a0, 0
+; RV32I-NEXT:    j .LBB1_3
+; RV32I-NEXT:  .LBB1_2: # %truebb
+; RV32I-NEXT:    lw a0, 68(sp)
+; RV32I-NEXT:    lw a1, 64(sp)
+; RV32I-NEXT:    add a0, a1, a0
+; RV32I-NEXT:  .LBB1_3: # %falsebb
+; RV32I-NEXT:    lw ra, 60(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s0, 56(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s1, 52(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s2, 48(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s3, 44(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s4, 40(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s5, 36(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s6, 32(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s7, 28(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s8, 24(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s9, 20(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s10, 16(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s11, 12(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    .cfi_restore ra
+; RV32I-NEXT:    .cfi_restore s0
+; RV32I-NEXT:    .cfi_restore s1
+; RV32I-NEXT:    .cfi_restore s2
+; RV32I-NEXT:    .cfi_restore s3
+; RV32I-NEXT:    .cfi_restore s4
+; RV32I-NEXT:    .cfi_restore s5
+; RV32I-NEXT:    .cfi_restore s6
+; RV32I-NEXT:    .cfi_restore s7
+; RV32I-NEXT:    .cfi_restore s8
+; RV32I-NEXT:    .cfi_restore s9
+; RV32I-NEXT:    .cfi_restore s10
+; RV32I-NEXT:    .cfi_restore s11
+; RV32I-NEXT:    addi sp, sp, 64
+; RV32I-NEXT:    .cfi_def_cfa_offset 0
+; RV32I-NEXT:    ret
+entry:
+  store i32 %stack0, ptr %p
+  store i32 %stack1, ptr %q
+  tail call void asm sideeffect "", "~{x1},~{x3},~{x4},~{x5},~{x6},~{x7},~{x8},~{x9},~{x10},~{x11},~{x12},~{x13},~{x14},~{x15},~{x16},~{x17},~{x18},~{x19},~{x20},~{x21},~{x22},~{x23},~{x24},~{x25},~{x26},~{x27},~{x28},~{x29},~{x30},~{x31}"()
+  %a = icmp eq i32 %0, 0
+  br i1 %a, label %truebb, label %falsebb
+truebb:
+  %b = add i32 %stack0, %stack1
+  ret i32 %b
+falsebb:
+  ret i32 0
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/pr95865.ll b/llvm/test/CodeGen/RISCV/rvv/pr95865.ll
index ab9849631663c..01d66b344ec2e 100644
--- a/llvm/test/CodeGen/RISCV/rvv/pr95865.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/pr95865.ll
@@ -40,8 +40,6 @@ define i32 @main(i1 %arg.1, i64 %arg.2, i1 %arg.3, i64 %arg.4, i1 %arg.5, <vscal
 ; CHECK-NEXT:    li t0, 12
 ; CHECK-NEXT:    li s0, 4
 ; CHECK-NEXT:    li t1, 20
-; CHECK-NEXT:    ld a1, 112(sp)
-; CHECK-NEXT:    sd a1, 0(sp) # 8-byte Folded Spill
 ; CHECK-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
 ; CHECK-NEXT:    vmv.v.i v8, 0
 ; CHECK-NEXT:    andi t3, a4, 1
@@ -142,7 +140,7 @@ define i32 @main(i1 %arg.1, i64 %arg.2, i1 %arg.3, i64 %arg.4, i1 %arg.5, <vscal
 ; CHECK-NEXT:    j .LBB0_11
 ; CHECK-NEXT:  .LBB0_12: # %for.body7.us.19
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m8, ta, ma
-; CHECK-NEXT:    ld a0, 0(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld a0, 112(sp)
 ; CHECK-NEXT:    vmv.s.x v16, a0
 ; CHECK-NEXT:    vmv.v.i v8, 0
 ; CHECK-NEXT:    vsetivli zero, 2, e32, m1, tu, ma

; RV32I-NEXT: sw s9, 20(sp) # 4-byte Folded Spill
; RV32I-NEXT: sw s10, 16(sp) # 4-byte Folded Spill
; RV32I-NEXT: sw s11, 12(sp) # 4-byte Folded Spill
; RV32I-NEXT: .cfi_offset ra, -4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nounwind?

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing some offline conversation with Craig. We happened to chat about this idea yesterday.

Cases this might kick in:

  • Arguments - will be frame indices during regalloc, and rewritten afterwards, so no concerns for e.g. dynamic alloc offset calculations.
  • Globals, particularly pc-relative. Remating the LD without the whole sequence might break the assumption of the AUIPC offset?

Craig, can you expand on the later just a bit? As long as we don't move the AUIPC, isn't the low offset fixed? It's the pc relative offset from the AUIPC to the global, not from the LD to the global isn't it?

Luke, could you add a test case which does a constant global load before the asm block, and then a use after? We should be able to remat that, but it's the case we were a bit unclear on so having specific testing is good.

store i32 %stack1, ptr %q
tail call void asm sideeffect "", "~{x1},~{x3},~{x4},~{x5},~{x6},~{x7},~{x8},~{x9},~{x10},~{x11},~{x12},~{x13},~{x14},~{x15},~{x16},~{x17},~{x18},~{x19},~{x20},~{x21},~{x22},~{x23},~{x24},~{x25},~{x26},~{x27},~{x28},~{x29},~{x30},~{x31}"()
%a = icmp eq i32 %0, 0
br i1 %a, label %truebb, label %falsebb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simplify the test. Is the conditional block doing anything in particularly here? If you just had the inline assembly followed by a unconditional use of %stack0 and %stack1, I think that still remats? Or even, just return one of %stack0 or %stack1?

Copy link
Contributor Author

@lukel97 lukel97 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, was able to simplify it to just a volatile store after the asm without branching. For some reason, the float stack arguments also require a use before the asm in order to force the initial load off the stack

@topperc
Copy link
Collaborator

topperc commented Nov 6, 2025

Craig, can you expand on the later just a bit? As long as we don't move the AUIPC, isn't the low offset fixed? It's the pc relative offset from the AUIPC to the global, not from the LD to the global isn't it?

You're right. Remating the load should be fine. It has a reference to a symbol on the auipc. We can remat that since we can have multiple instructions referring to the same auipc. It's auipc that we can't rematerialize since that would need a new symbol.

@bababuck
Copy link
Contributor

bababuck commented Nov 7, 2025

Any reason this wouldn't apply to vector as well? Though I don't recall ever seeing any cases in the wild that would benefit from this though...

@topperc
Copy link
Collaborator

topperc commented Nov 7, 2025

Any reason this wouldn't apply to vector as well? Though I don't recall ever seeing any cases in the wild that would benefit from this though...

I don't think the generic infrastructure can identify an invariant vector load from stack since the frame index operand won't be part of the load.

Maybe it's possible for other types of invariant loads?

lukel97 added a commit that referenced this pull request Nov 7, 2025
This is to allow us to test rematting i64 lds in #166774. Checked and
we're still rematting the same lis on rv64 in this test.
@lukel97
Copy link
Contributor Author

lukel97 commented Nov 7, 2025

Apologies for the force push, I precomitted 630f43a and rebased on top of it so we can test i64 ld remat.

Luke, could you add a test case which does a constant global load before the asm block, and then a use after? We should be able to remat that, but it's the case we were a bit unclear on so having specific testing is good.

I've added a test for a constant global load, but it looks like we can't rematerialize global loads yet because even though the load is invariant, the upper bits of the address won't be live at the point of rematerialization. I think in order to remat this we either need to extend the live range or allow rematerializing multiple instructions.

; CHECK-NEXT: addi sp, sp, 208
; CHECK-NEXT: ret
entry:
; Force loading the stack arguments to create their live interval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this comment? This doesn't make sense to me.

vinay-deshmukh pushed a commit to vinay-deshmukh/llvm-project that referenced this pull request Nov 8, 2025
This is to allow us to test rematting i64 lds in llvm#166774. Checked and
we're still rematting the same lis on rv64 in this test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants