-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[RISCV] Enable rematerialization for scalar loads #166774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesIn some workloads we see an argument passed on the stack where it is loaded, only for it to be immediately spilled to a different slot on the stack and then reloaded from that spill slot. We can avoid the unnecessary spill by marking loads as rematerializable. TargetTransformInfo::isReMaterializableImpl checks to make sure that any loads are This gives a 14.8% reduction in spills in 544.nab_r on rva23u64 -O3, and a few other smaller reductions on llvm-test-suite. I didn't find any benchmarks where the number of spills/reloads increased. Related: #165761 Full diff: https://github.com/llvm/llvm-project/pull/166774.diff 6 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.td b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
index 9cb53fb27a2d2..84b962b2a8607 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
@@ -768,7 +768,7 @@ def BGE : BranchCC_rri<0b101, "bge">;
def BLTU : BranchCC_rri<0b110, "bltu">;
def BGEU : BranchCC_rri<0b111, "bgeu">;
-let IsSignExtendingOpW = 1, canFoldAsLoad = 1 in {
+let IsSignExtendingOpW = 1, canFoldAsLoad = 1, isReMaterializable = 1 in {
def LB : Load_ri<0b000, "lb">, Sched<[WriteLDB, ReadMemBase]>;
def LH : Load_ri<0b001, "lh">, Sched<[WriteLDH, ReadMemBase]>;
def LW : Load_ri<0b010, "lw">, Sched<[WriteLDW, ReadMemBase]>;
@@ -889,7 +889,7 @@ def CSRRCI : CSR_ii<0b111, "csrrci">;
/// RV64I instructions
let Predicates = [IsRV64] in {
-let canFoldAsLoad = 1 in {
+let canFoldAsLoad = 1, isReMaterializable = 1 in {
def LWU : Load_ri<0b110, "lwu">, Sched<[WriteLDW, ReadMemBase]>;
def LD : Load_ri<0b011, "ld">, Sched<[WriteLDD, ReadMemBase]>;
}
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoD.td b/llvm/lib/Target/RISCV/RISCVInstrInfoD.td
index 4ffe3e62ac501..deacd41e6469a 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoD.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoD.td
@@ -71,7 +71,7 @@ defvar DExtsRV64 = [DExt, ZdinxExt];
//===----------------------------------------------------------------------===//
let Predicates = [HasStdExtD] in {
-let canFoldAsLoad = 1 in
+let canFoldAsLoad = 1, isReMaterializable = 1 in
def FLD : FPLoad_r<0b011, "fld", FPR64, WriteFLD64>;
// Operands for stores are in the order srcreg, base, offset rather than
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoF.td b/llvm/lib/Target/RISCV/RISCVInstrInfoF.td
index b30f8ec820c15..bd191001b75ec 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoF.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoF.td
@@ -330,7 +330,7 @@ class PseudoFROUND<DAGOperand Ty, ValueType vt, ValueType intvt = XLenVT>
//===----------------------------------------------------------------------===//
let Predicates = [HasStdExtF] in {
-let canFoldAsLoad = 1 in
+let canFoldAsLoad = 1, isReMaterializable = 1 in
def FLW : FPLoad_r<0b010, "flw", FPR32, WriteFLD32>;
// Operands for stores are in the order srcreg, base, offset rather than
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td b/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td
index 1c6a5afcda49b..c172d1739ba61 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoZfh.td
@@ -90,7 +90,7 @@ defvar ZfhminDExts = [ZfhminDExt, ZhinxminZdinxExt, ZhinxminZdinx32Ext];
//===----------------------------------------------------------------------===//
let Predicates = [HasHalfFPLoadStoreMove] in {
-let canFoldAsLoad = 1 in
+let canFoldAsLoad = 1, isReMaterializable = 1 in
def FLH : FPLoad_r<0b001, "flh", FPR16, WriteFLD16>;
// Operands for stores are in the order srcreg, base, offset rather than
diff --git a/llvm/test/CodeGen/RISCV/remat.ll b/llvm/test/CodeGen/RISCV/remat.ll
index 92ae85f560cd4..88240474f30bf 100644
--- a/llvm/test/CodeGen/RISCV/remat.ll
+++ b/llvm/test/CodeGen/RISCV/remat.ll
@@ -201,3 +201,95 @@ for.end: ; preds = %for.inc, %entry
}
declare i32 @foo(i32, i32, i32, i32, i32, i32)
+
+define i32 @remat_ld(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %stack0, i32 %stack1, ptr %p, ptr %q) {
+; RV32I-LABEL: remat_ld:
+; RV32I: # %bb.0: # %entry
+; RV32I-NEXT: addi sp, sp, -64
+; RV32I-NEXT: .cfi_def_cfa_offset 64
+; RV32I-NEXT: sw ra, 60(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s0, 56(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s1, 52(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s2, 48(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s3, 44(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s4, 40(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s5, 36(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s6, 32(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s7, 28(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s8, 24(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s9, 20(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s10, 16(sp) # 4-byte Folded Spill
+; RV32I-NEXT: sw s11, 12(sp) # 4-byte Folded Spill
+; RV32I-NEXT: .cfi_offset ra, -4
+; RV32I-NEXT: .cfi_offset s0, -8
+; RV32I-NEXT: .cfi_offset s1, -12
+; RV32I-NEXT: .cfi_offset s2, -16
+; RV32I-NEXT: .cfi_offset s3, -20
+; RV32I-NEXT: .cfi_offset s4, -24
+; RV32I-NEXT: .cfi_offset s5, -28
+; RV32I-NEXT: .cfi_offset s6, -32
+; RV32I-NEXT: .cfi_offset s7, -36
+; RV32I-NEXT: .cfi_offset s8, -40
+; RV32I-NEXT: .cfi_offset s9, -44
+; RV32I-NEXT: .cfi_offset s10, -48
+; RV32I-NEXT: .cfi_offset s11, -52
+; RV32I-NEXT: sw a0, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT: lw a2, 68(sp)
+; RV32I-NEXT: lw a3, 64(sp)
+; RV32I-NEXT: lw a0, 72(sp)
+; RV32I-NEXT: lw a1, 76(sp)
+; RV32I-NEXT: sw a3, 0(a0)
+; RV32I-NEXT: sw a2, 0(a1)
+; RV32I-NEXT: #APP
+; RV32I-NEXT: #NO_APP
+; RV32I-NEXT: lw a0, 8(sp) # 4-byte Folded Reload
+; RV32I-NEXT: beqz a0, .LBB1_2
+; RV32I-NEXT: # %bb.1: # %falsebb
+; RV32I-NEXT: li a0, 0
+; RV32I-NEXT: j .LBB1_3
+; RV32I-NEXT: .LBB1_2: # %truebb
+; RV32I-NEXT: lw a0, 68(sp)
+; RV32I-NEXT: lw a1, 64(sp)
+; RV32I-NEXT: add a0, a1, a0
+; RV32I-NEXT: .LBB1_3: # %falsebb
+; RV32I-NEXT: lw ra, 60(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s0, 56(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s1, 52(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s2, 48(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s3, 44(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s4, 40(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s5, 36(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s6, 32(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s7, 28(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s8, 24(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s9, 20(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s10, 16(sp) # 4-byte Folded Reload
+; RV32I-NEXT: lw s11, 12(sp) # 4-byte Folded Reload
+; RV32I-NEXT: .cfi_restore ra
+; RV32I-NEXT: .cfi_restore s0
+; RV32I-NEXT: .cfi_restore s1
+; RV32I-NEXT: .cfi_restore s2
+; RV32I-NEXT: .cfi_restore s3
+; RV32I-NEXT: .cfi_restore s4
+; RV32I-NEXT: .cfi_restore s5
+; RV32I-NEXT: .cfi_restore s6
+; RV32I-NEXT: .cfi_restore s7
+; RV32I-NEXT: .cfi_restore s8
+; RV32I-NEXT: .cfi_restore s9
+; RV32I-NEXT: .cfi_restore s10
+; RV32I-NEXT: .cfi_restore s11
+; RV32I-NEXT: addi sp, sp, 64
+; RV32I-NEXT: .cfi_def_cfa_offset 0
+; RV32I-NEXT: ret
+entry:
+ store i32 %stack0, ptr %p
+ store i32 %stack1, ptr %q
+ tail call void asm sideeffect "", "~{x1},~{x3},~{x4},~{x5},~{x6},~{x7},~{x8},~{x9},~{x10},~{x11},~{x12},~{x13},~{x14},~{x15},~{x16},~{x17},~{x18},~{x19},~{x20},~{x21},~{x22},~{x23},~{x24},~{x25},~{x26},~{x27},~{x28},~{x29},~{x30},~{x31}"()
+ %a = icmp eq i32 %0, 0
+ br i1 %a, label %truebb, label %falsebb
+truebb:
+ %b = add i32 %stack0, %stack1
+ ret i32 %b
+falsebb:
+ ret i32 0
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/pr95865.ll b/llvm/test/CodeGen/RISCV/rvv/pr95865.ll
index ab9849631663c..01d66b344ec2e 100644
--- a/llvm/test/CodeGen/RISCV/rvv/pr95865.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/pr95865.ll
@@ -40,8 +40,6 @@ define i32 @main(i1 %arg.1, i64 %arg.2, i1 %arg.3, i64 %arg.4, i1 %arg.5, <vscal
; CHECK-NEXT: li t0, 12
; CHECK-NEXT: li s0, 4
; CHECK-NEXT: li t1, 20
-; CHECK-NEXT: ld a1, 112(sp)
-; CHECK-NEXT: sd a1, 0(sp) # 8-byte Folded Spill
; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
; CHECK-NEXT: vmv.v.i v8, 0
; CHECK-NEXT: andi t3, a4, 1
@@ -142,7 +140,7 @@ define i32 @main(i1 %arg.1, i64 %arg.2, i1 %arg.3, i64 %arg.4, i1 %arg.5, <vscal
; CHECK-NEXT: j .LBB0_11
; CHECK-NEXT: .LBB0_12: # %for.body7.us.19
; CHECK-NEXT: vsetvli a0, zero, e32, m8, ta, ma
-; CHECK-NEXT: ld a0, 0(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld a0, 112(sp)
; CHECK-NEXT: vmv.s.x v16, a0
; CHECK-NEXT: vmv.v.i v8, 0
; CHECK-NEXT: vsetivli zero, 2, e32, m1, tu, ma
|
llvm/test/CodeGen/RISCV/remat.ll
Outdated
| ; RV32I-NEXT: sw s9, 20(sp) # 4-byte Folded Spill | ||
| ; RV32I-NEXT: sw s10, 16(sp) # 4-byte Folded Spill | ||
| ; RV32I-NEXT: sw s11, 12(sp) # 4-byte Folded Spill | ||
| ; RV32I-NEXT: .cfi_offset ra, -4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nounwind?
preames
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing some offline conversation with Craig. We happened to chat about this idea yesterday.
Cases this might kick in:
- Arguments - will be frame indices during regalloc, and rewritten afterwards, so no concerns for e.g. dynamic alloc offset calculations.
- Globals, particularly pc-relative. Remating the LD without the whole sequence might break the assumption of the AUIPC offset?
Craig, can you expand on the later just a bit? As long as we don't move the AUIPC, isn't the low offset fixed? It's the pc relative offset from the AUIPC to the global, not from the LD to the global isn't it?
Luke, could you add a test case which does a constant global load before the asm block, and then a use after? We should be able to remat that, but it's the case we were a bit unclear on so having specific testing is good.
llvm/test/CodeGen/RISCV/remat.ll
Outdated
| store i32 %stack1, ptr %q | ||
| tail call void asm sideeffect "", "~{x1},~{x3},~{x4},~{x5},~{x6},~{x7},~{x8},~{x9},~{x10},~{x11},~{x12},~{x13},~{x14},~{x15},~{x16},~{x17},~{x18},~{x19},~{x20},~{x21},~{x22},~{x23},~{x24},~{x25},~{x26},~{x27},~{x28},~{x29},~{x30},~{x31}"() | ||
| %a = icmp eq i32 %0, 0 | ||
| br i1 %a, label %truebb, label %falsebb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simplify the test. Is the conditional block doing anything in particularly here? If you just had the inline assembly followed by a unconditional use of %stack0 and %stack1, I think that still remats? Or even, just return one of %stack0 or %stack1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, was able to simplify it to just a volatile store after the asm without branching. For some reason, the float stack arguments also require a use before the asm in order to force the initial load off the stack
You're right. Remating the load should be fine. It has a reference to a symbol on the auipc. We can remat that since we can have multiple instructions referring to the same auipc. It's auipc that we can't rematerialize since that would need a new symbol. |
|
Any reason this wouldn't apply to vector as well? Though I don't recall ever seeing any cases in the wild that would benefit from this though... |
I don't think the generic infrastructure can identify an invariant vector load from stack since the frame index operand won't be part of the load. Maybe it's possible for other types of invariant loads? |
This is to allow us to test rematting i64 lds in #166774. Checked and we're still rematting the same lis on rv64 in this test.
e7ebe51 to
2b0f29f
Compare
|
Apologies for the force push, I precomitted 630f43a and rebased on top of it so we can test i64 ld remat.
I've added a test for a constant global load, but it looks like we can't rematerialize global loads yet because even though the load is invariant, the upper bits of the address won't be live at the point of rematerialization. I think in order to remat this we either need to extend the live range or allow rematerializing multiple instructions. |
| ; CHECK-NEXT: addi sp, sp, 208 | ||
| ; CHECK-NEXT: ret | ||
| entry: | ||
| ; Force loading the stack arguments to create their live interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this comment? This doesn't make sense to me.
This is to allow us to test rematting i64 lds in llvm#166774. Checked and we're still rematting the same lis on rv64 in this test.
In some workloads we see an argument passed on the stack where it is loaded, only for it to be immediately spilled to a different slot on the stack and then reloaded from that spill slot later on.
We can avoid the unnecessary spill by marking loads as rematerializable and just directly loading from where the argument was originally passed on the stack. TargetTransformInfo::isReMaterializableImpl checks to make sure that any loads are
MI.isDereferenceableInvariantLoad(), so we should be able to move the load down to the remat site.This gives a 14.8% reduction in spills in 544.nab_r on rva23u64 -O3, and a few other smaller reductions on llvm-test-suite. I didn't find any benchmarks where the number of spills/reloads increased.
Related: #165761