-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[RISCV] Enable (non trivial) remat for most scalar instructions #162311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[RISCV] Enable (non trivial) remat for most scalar instructions #162311
Conversation
This is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions. Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work.
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-risc-v Author: Philip Reames (preames) ChangesThis is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions. Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work. Patch is 419.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162311.diff 6 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.td b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
index 9855c47a63392..f1ac3a5b7e9a5 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
@@ -780,21 +780,18 @@ def SB : Store_rri<0b000, "sb">, Sched<[WriteSTB, ReadStoreData, ReadMemBase]>;
def SH : Store_rri<0b001, "sh">, Sched<[WriteSTH, ReadStoreData, ReadMemBase]>;
def SW : Store_rri<0b010, "sw">, Sched<[WriteSTW, ReadStoreData, ReadMemBase]>;
-// ADDI isn't always rematerializable, but isReMaterializable will be used as
-// a hint which is verified in isReMaterializableImpl.
-let isReMaterializable = 1, isAsCheapAsAMove = 1 in
+let isReMaterializable = 1, isAsCheapAsAMove = 1 in {
def ADDI : ALU_ri<0b000, "addi">;
+def XORI : ALU_ri<0b100, "xori">;
+def ORI : ALU_ri<0b110, "ori">;
+}
-let IsSignExtendingOpW = 1 in {
+let IsSignExtendingOpW = 1, isReMaterializable = 1 in {
def SLTI : ALU_ri<0b010, "slti">;
def SLTIU : ALU_ri<0b011, "sltiu">;
}
-let isReMaterializable = 1, isAsCheapAsAMove = 1 in {
-def XORI : ALU_ri<0b100, "xori">;
-def ORI : ALU_ri<0b110, "ori">;
-}
-
+let isReMaterializable = 1 in {
def ANDI : ALU_ri<0b111, "andi">;
def SLLI : Shift_ri<0b00000, 0b001, "slli">,
@@ -826,6 +823,7 @@ def OR : ALU_rr<0b0000000, 0b110, "or", Commutable=1>,
Sched<[WriteIALU, ReadIALU, ReadIALU]>;
def AND : ALU_rr<0b0000000, 0b111, "and", Commutable=1>,
Sched<[WriteIALU, ReadIALU, ReadIALU]>;
+}
let hasSideEffects = 1, mayLoad = 0, mayStore = 0 in {
def FENCE : RVInstI<0b000, OPC_MISC_MEM, (outs),
@@ -893,7 +891,7 @@ def LWU : Load_ri<0b110, "lwu">, Sched<[WriteLDW, ReadMemBase]>;
def LD : Load_ri<0b011, "ld">, Sched<[WriteLDD, ReadMemBase]>;
def SD : Store_rri<0b011, "sd">, Sched<[WriteSTD, ReadStoreData, ReadMemBase]>;
-let IsSignExtendingOpW = 1 in {
+let IsSignExtendingOpW = 1, isReMaterializable = 1 in {
let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
def ADDIW : RVInstI<0b000, OPC_OP_IMM_32, (outs GPR:$rd),
(ins GPR:$rs1, simm12_lo:$imm12),
@@ -917,7 +915,7 @@ def SRLW : ALUW_rr<0b0000000, 0b101, "srlw">,
Sched<[WriteShiftReg32, ReadShiftReg32, ReadShiftReg32]>;
def SRAW : ALUW_rr<0b0100000, 0b101, "sraw">,
Sched<[WriteShiftReg32, ReadShiftReg32, ReadShiftReg32]>;
-} // IsSignExtendingOpW = 1
+} // IsSignExtendingOpW = 1, isReMaterializable = 1
} // Predicates = [IsRV64]
//===----------------------------------------------------------------------===//
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
index ca9f7637388f7..74c31a229dad4 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
@@ -3000,9 +3000,9 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
; RV32I-NEXT: sw s9, 20(sp) # 4-byte Folded Spill
; RV32I-NEXT: sw s10, 16(sp) # 4-byte Folded Spill
; RV32I-NEXT: sw s11, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT: li a4, 0
+; RV32I-NEXT: li a5, 0
; RV32I-NEXT: lbu a3, 0(a0)
-; RV32I-NEXT: lbu a5, 1(a0)
+; RV32I-NEXT: lbu a4, 1(a0)
; RV32I-NEXT: lbu a6, 2(a0)
; RV32I-NEXT: lbu a7, 3(a0)
; RV32I-NEXT: lbu t0, 4(a0)
@@ -3013,736 +3013,750 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
; RV32I-NEXT: lbu t5, 9(a0)
; RV32I-NEXT: lbu t6, 10(a0)
; RV32I-NEXT: lbu s0, 11(a0)
-; RV32I-NEXT: slli a5, a5, 8
+; RV32I-NEXT: slli a4, a4, 8
; RV32I-NEXT: slli a7, a7, 8
; RV32I-NEXT: slli t1, t1, 8
-; RV32I-NEXT: or a3, a5, a3
-; RV32I-NEXT: or a7, a7, a6
-; RV32I-NEXT: or t1, t1, t0
-; RV32I-NEXT: lbu a6, 13(a0)
-; RV32I-NEXT: lbu a5, 14(a0)
-; RV32I-NEXT: lbu s1, 15(a0)
+; RV32I-NEXT: or a3, a4, a3
+; RV32I-NEXT: or a4, a7, a6
+; RV32I-NEXT: or a7, t1, t0
+; RV32I-NEXT: lbu t0, 13(a0)
+; RV32I-NEXT: lbu a6, 14(a0)
+; RV32I-NEXT: lbu t1, 15(a0)
; RV32I-NEXT: slli t3, t3, 8
; RV32I-NEXT: slli t5, t5, 8
; RV32I-NEXT: slli s0, s0, 8
-; RV32I-NEXT: or t3, t3, t2
-; RV32I-NEXT: or t0, t5, t4
-; RV32I-NEXT: or t5, s0, t6
-; RV32I-NEXT: lbu t2, 1(a1)
-; RV32I-NEXT: lbu t4, 0(a1)
+; RV32I-NEXT: or s1, t3, t2
+; RV32I-NEXT: or t2, t5, t4
+; RV32I-NEXT: or t4, s0, t6
+; RV32I-NEXT: lbu t3, 1(a1)
+; RV32I-NEXT: lbu t5, 0(a1)
; RV32I-NEXT: lbu t6, 2(a1)
; RV32I-NEXT: lbu a1, 3(a1)
-; RV32I-NEXT: slli t2, t2, 8
-; RV32I-NEXT: or s0, t2, t4
-; RV32I-NEXT: slli t2, s1, 8
+; RV32I-NEXT: slli t3, t3, 8
+; RV32I-NEXT: or t5, t3, t5
+; RV32I-NEXT: slli t3, t1, 8
; RV32I-NEXT: slli a1, a1, 8
; RV32I-NEXT: or a1, a1, t6
-; RV32I-NEXT: slli t4, a7, 16
-; RV32I-NEXT: slli a7, t3, 16
-; RV32I-NEXT: slli t3, t5, 16
-; RV32I-NEXT: slli t5, a1, 16
-; RV32I-NEXT: or a1, a7, t1
-; RV32I-NEXT: or a7, t5, s0
+; RV32I-NEXT: slli a4, a4, 16
+; RV32I-NEXT: slli s1, s1, 16
+; RV32I-NEXT: slli t4, t4, 16
+; RV32I-NEXT: slli t1, a1, 16
+; RV32I-NEXT: or s5, s1, a7
+; RV32I-NEXT: or a7, t1, t5
; RV32I-NEXT: slli a7, a7, 3
; RV32I-NEXT: srli t1, a7, 5
; RV32I-NEXT: andi t5, a7, 31
; RV32I-NEXT: neg s3, t5
; RV32I-NEXT: beqz t5, .LBB12_2
; RV32I-NEXT: # %bb.1:
-; RV32I-NEXT: sll a4, a1, s3
+; RV32I-NEXT: sll a5, s5, s3
; RV32I-NEXT: .LBB12_2:
-; RV32I-NEXT: or s7, t4, a3
-; RV32I-NEXT: lbu t4, 12(a0)
-; RV32I-NEXT: lbu t6, 19(a0)
-; RV32I-NEXT: slli s1, a6, 8
-; RV32I-NEXT: or a5, t2, a5
-; RV32I-NEXT: or a3, t3, t0
+; RV32I-NEXT: or a4, a4, a3
+; RV32I-NEXT: lbu t6, 12(a0)
+; RV32I-NEXT: lbu s0, 19(a0)
+; RV32I-NEXT: slli s1, t0, 8
+; RV32I-NEXT: or t0, t3, a6
+; RV32I-NEXT: or a1, t4, t2
; RV32I-NEXT: beqz t1, .LBB12_4
; RV32I-NEXT: # %bb.3:
-; RV32I-NEXT: li s0, 0
+; RV32I-NEXT: mv s11, a4
+; RV32I-NEXT: li a4, 0
; RV32I-NEXT: j .LBB12_5
; RV32I-NEXT: .LBB12_4:
-; RV32I-NEXT: srl s0, s7, a7
-; RV32I-NEXT: or s0, s0, a4
+; RV32I-NEXT: mv s11, a4
+; RV32I-NEXT: srl a6, a4, a7
+; RV32I-NEXT: or a4, a6, a5
; RV32I-NEXT: .LBB12_5:
; RV32I-NEXT: li a6, 0
-; RV32I-NEXT: lbu t0, 17(a0)
-; RV32I-NEXT: lbu a4, 18(a0)
-; RV32I-NEXT: slli s4, t6, 8
-; RV32I-NEXT: or s2, s1, t4
-; RV32I-NEXT: slli a5, a5, 16
-; RV32I-NEXT: li s5, 1
-; RV32I-NEXT: sll t6, a3, s3
+; RV32I-NEXT: lbu s2, 17(a0)
+; RV32I-NEXT: lbu a5, 18(a0)
+; RV32I-NEXT: slli s4, s0, 8
+; RV32I-NEXT: or s1, s1, t6
+; RV32I-NEXT: slli t0, t0, 16
+; RV32I-NEXT: li t3, 1
+; RV32I-NEXT: sll s6, a1, s3
; RV32I-NEXT: beqz t5, .LBB12_7
; RV32I-NEXT: # %bb.6:
-; RV32I-NEXT: mv a6, t6
+; RV32I-NEXT: mv a6, s6
; RV32I-NEXT: .LBB12_7:
; RV32I-NEXT: lbu t2, 16(a0)
-; RV32I-NEXT: lbu t3, 23(a0)
-; RV32I-NEXT: slli s1, t0, 8
-; RV32I-NEXT: or t4, s4, a4
-; RV32I-NEXT: srl a4, a1, a7
-; RV32I-NEXT: or a5, a5, s2
-; RV32I-NEXT: bne t1, s5, .LBB12_9
+; RV32I-NEXT: lbu t4, 23(a0)
+; RV32I-NEXT: slli s0, s2, 8
+; RV32I-NEXT: or t6, s4, a5
+; RV32I-NEXT: srl a3, s5, a7
+; RV32I-NEXT: or a5, t0, s1
+; RV32I-NEXT: sw a3, 0(sp) # 4-byte Folded Spill
+; RV32I-NEXT: bne t1, t3, .LBB12_9
; RV32I-NEXT: # %bb.8:
-; RV32I-NEXT: or s0, a4, a6
+; RV32I-NEXT: or a4, a3, a6
; RV32I-NEXT: .LBB12_9:
; RV32I-NEXT: li t0, 0
-; RV32I-NEXT: lbu s5, 21(a0)
+; RV32I-NEXT: lbu s2, 21(a0)
; RV32I-NEXT: lbu a6, 22(a0)
-; RV32I-NEXT: slli s4, t3, 8
-; RV32I-NEXT: or t2, s1, t2
-; RV32I-NEXT: slli s6, t4, 16
-; RV32I-NEXT: li s8, 2
-; RV32I-NEXT: sll t3, a5, s3
+; RV32I-NEXT: slli s1, t4, 8
+; RV32I-NEXT: or t2, s0, t2
+; RV32I-NEXT: slli s4, t6, 16
+; RV32I-NEXT: li a3, 2
+; RV32I-NEXT: sll s8, a5, s3
; RV32I-NEXT: beqz t5, .LBB12_11
; RV32I-NEXT: # %bb.10:
-; RV32I-NEXT: mv t0, t3
+; RV32I-NEXT: mv t0, s8
; RV32I-NEXT: .LBB12_11:
-; RV32I-NEXT: lbu s1, 20(a0)
-; RV32I-NEXT: lbu s2, 27(a0)
-; RV32I-NEXT: slli s5, s5, 8
-; RV32I-NEXT: or s4, s4, a6
-; RV32I-NEXT: srl t4, a3, a7
-; RV32I-NEXT: or a6, s6, t2
-; RV32I-NEXT: bne t1, s8, .LBB12_13
+; RV32I-NEXT: lbu t6, 20(a0)
+; RV32I-NEXT: lbu s0, 27(a0)
+; RV32I-NEXT: slli s2, s2, 8
+; RV32I-NEXT: or s1, s1, a6
+; RV32I-NEXT: srl t3, a1, a7
+; RV32I-NEXT: or a6, s4, t2
+; RV32I-NEXT: sw s5, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT: bne t1, a3, .LBB12_13
; RV32I-NEXT: # %bb.12:
-; RV32I-NEXT: or s0, t4, t0
+; RV32I-NEXT: or a4, t3, t0
; RV32I-NEXT: .LBB12_13:
-; RV32I-NEXT: sw s7, 4(sp) # 4-byte Folded Spill
; RV32I-NEXT: li t2, 0
-; RV32I-NEXT: lbu s6, 25(a0)
+; RV32I-NEXT: lbu s4, 25(a0)
; RV32I-NEXT: lbu t0, 26(a0)
-; RV32I-NEXT: slli s8, s2, 8
-; RV32I-NEXT: or s7, s5, s1
-; RV32I-NEXT: slli s9, s4, 16
-; RV32I-NEXT: sll s11, a6, s3
+; RV32I-NEXT: slli s7, s0, 8
+; RV32I-NEXT: or s5, s2, t6
+; RV32I-NEXT: slli s9, s1, 16
+; RV32I-NEXT: li t6, 3
+; RV32I-NEXT: sll t4, a6, s3
; RV32I-NEXT: beqz t5, .LBB12_15
; RV32I-NEXT: # %bb.14:
-; RV32I-NEXT: mv t2, s11
+; RV32I-NEXT: mv t2, t4
; RV32I-NEXT: .LBB12_15:
-; RV32I-NEXT: lbu s1, 24(a0)
-; RV32I-NEXT: lbu s2, 31(a0)
-; RV32I-NEXT: slli s5, s6, 8
-; RV32I-NEXT: or s4, s8, t0
-; RV32I-NEXT: srl ra, a5, a7
-; RV32I-NEXT: or t0, s9, s7
-; RV32I-NEXT: li s6, 3
-; RV32I-NEXT: bne t1, s6, .LBB12_17
+; RV32I-NEXT: lbu s0, 24(a0)
+; RV32I-NEXT: lbu s1, 31(a0)
+; RV32I-NEXT: slli s4, s4, 8
+; RV32I-NEXT: or s2, s7, t0
+; RV32I-NEXT: srl a3, a5, a7
+; RV32I-NEXT: or t0, s9, s5
+; RV32I-NEXT: li s9, 3
+; RV32I-NEXT: bne t1, t6, .LBB12_17
; RV32I-NEXT: # %bb.16:
-; RV32I-NEXT: or s0, ra, t2
+; RV32I-NEXT: or a4, a3, t2
; RV32I-NEXT: .LBB12_17:
+; RV32I-NEXT: mv t6, t3
; RV32I-NEXT: li t2, 0
; RV32I-NEXT: lbu s7, 29(a0)
-; RV32I-NEXT: lbu s6, 30(a0)
-; RV32I-NEXT: slli s8, s2, 8
-; RV32I-NEXT: or s2, s5, s1
-; RV32I-NEXT: slli s5, s4, 16
-; RV32I-NEXT: li s9, 4
-; RV32I-NEXT: sll s1, t0, s3
-; RV32I-NEXT: sw s1, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT: lbu s5, 30(a0)
+; RV32I-NEXT: slli s1, s1, 8
+; RV32I-NEXT: or s10, s4, s0
+; RV32I-NEXT: slli s2, s2, 16
+; RV32I-NEXT: li a3, 4
+; RV32I-NEXT: sll s0, t0, s3
; RV32I-NEXT: beqz t5, .LBB12_19
; RV32I-NEXT: # %bb.18:
-; RV32I-NEXT: lw t2, 8(sp) # 4-byte Folded Reload
+; RV32I-NEXT: mv t2, s0
; RV32I-NEXT: .LBB12_19:
-; RV32I-NEXT: lbu s1, 28(a0)
+; RV32I-NEXT: lbu t3, 28(a0)
; RV32I-NEXT: slli s7, s7, 8
-; RV32I-NEXT: or s4, s8, s6
-; RV32I-NEXT: srl s10, a6, a7
-; RV32I-NEXT: or a0, s5, s2
-; RV32I-NEXT: bne t1, s9, .LBB12_21
+; RV32I-NEXT: or s4, s1, s5
+; RV32I-NEXT: srl s1, a6, a7
+; RV32I-NEXT: or a0, s2, s10
+; RV32I-NEXT: beq t1, a3, .LBB12_21
; RV32I-NEXT: # %bb.20:
-; RV32I-NEXT: or s0, s10, t2
+; RV32I-NEXT: mv a3, s1
+; RV32I-NEXT: j .LBB12_22
; RV32I-NEXT: .LBB12_21:
+; RV32I-NEXT: mv a3, s1
+; RV32I-NEXT: or a4, s1, t2
+; RV32I-NEXT: .LBB12_22:
+; RV32I-NEXT: li s10, 1
; RV32I-NEXT: li s2, 0
-; RV32I-NEXT: or t2, s7, s1
+; RV32I-NEXT: or t2, s7, t3
; RV32I-NEXT: slli s4, s4, 16
-; RV32I-NEXT: li s9, 5
+; RV32I-NEXT: li s1, 5
; RV32I-NEXT: sll s7, a0, s3
-; RV32I-NEXT: beqz t5, .LBB12_23
-; RV32I-NEXT: # %bb.22:
+; RV32I-NEXT: beqz t5, .LBB12_24
+; RV32I-NEXT: # %bb.23:
; RV32I-NEXT: mv s2, s7
-; RV32I-NEXT: .LBB12_23:
-; RV32I-NEXT: srl s8, t0, a7
+; RV32I-NEXT: .LBB12_24:
+; RV32I-NEXT: sw a1, 4(sp) # 4-byte Folded Spill
+; RV32I-NEXT: srl t3, t0, a7
; RV32I-NEXT: or t2, s4, t2
-; RV32I-NEXT: bne t1, s9, .LBB12_25
-; RV32I-NEXT: # %bb.24:
-; RV32I-NEXT: or s0, s8, s2
-; RV32I-NEXT: .LBB12_25:
-; RV32I-NEXT: li s4, 0
+; RV32I-NEXT: beq t1, s1, .LBB12_26
+; RV32I-NEXT: # %bb.25:
+; RV32I-NEXT: mv a1, t3
+; RV32I-NEXT: j .LBB12_27
+; RV32I-NEXT: .LBB12_26:
+; RV32I-NEXT: mv a1, t3
+; RV32I-NEXT: or a4, t3, s2
+; RV32I-NEXT: .LBB12_27:
+; RV32I-NEXT: li t3, 0
; RV32I-NEXT: li s2, 6
; RV32I-NEXT: sll s5, t2, s3
-; RV32I-NEXT: beqz t5, .LBB12_27
-; RV32I-NEXT: # %bb.26:
-; RV32I-NEXT: mv s4, s5
-; RV32I-NEXT: .LBB12_27:
-; RV32I-NEXT: srl s6, a0, a7
-; RV32I-NEXT: bne t1, s2, .LBB12_29
+; RV32I-NEXT: beqz t5, .LBB12_29
; RV32I-NEXT: # %bb.28:
-; RV32I-NEXT: or s0, s6, s4
+; RV32I-NEXT: mv t3, s5
; RV32I-NEXT: .LBB12_29:
-; RV32I-NEXT: li s3, 7
-; RV32I-NEXT: srl s1, t2, a7
-; RV32I-NEXT: mv s4, s1
-; RV32I-NEXT: bne t1, s3, .LBB12_34
+; RV32I-NEXT: srl s3, a0, a7
+; RV32I-NEXT: beq t1, s2, .LBB12_31
; RV32I-NEXT: # %bb.30:
-; RV32I-NEXT: bnez a7, .LBB12_35
+; RV32I-NEXT: mv ra, s3
+; RV32I-NEXT: j .LBB12_32
; RV32I-NEXT: .LBB12_31:
-; RV32I-NEXT: li s0, 0
-; RV32I-NEXT: bnez t5, .LBB12_36
+; RV32I-NEXT: mv ra, s3
+; RV32I-NEXT: or a4, s3, t3
; RV32I-NEXT: .LBB12_32:
-; RV32I-NEXT: li s4, 2
-; RV32I-NEXT: beqz t1, .LBB12_37
-; RV32I-NEXT: .LBB12_33:
-; RV32I-NEXT: li a4, 0
-; RV32I-NEXT: j .LBB12_38
+; RV32I-NEXT: li s3, 7
+; RV32I-NEXT: srl s4, t2, a7
+; RV32I-NEXT: mv t3, s4
+; RV32I-NEXT: beq t1, s3, .LBB12_34
+; RV32I-NEXT: # %bb.33:
+; RV32I-NEXT: mv t3, a4
; RV32I-NEXT: .LBB12_34:
-; RV32I-NEXT: mv s4, s0
-; RV32I-NEXT: beqz a7, .LBB12_31
-; RV32I-NEXT: .LBB12_35:
-; RV32I-NEXT: sw s4, 4(sp) # 4-byte Folded Spill
-; RV32I-NEXT: li s0, 0
-; RV32I-NEXT: beqz t5, .LBB12_32
+; RV32I-NEXT: mv a4, s11
+; RV32I-NEXT: beqz a7, .LBB12_36
+; RV32I-NEXT: # %bb.35:
+; RV32I-NEXT: mv a4, t3
; RV32I-NEXT: .LBB12_36:
-; RV32I-NEXT: mv s0, t6
-; RV32I-NEXT: li s4, 2
-; RV32I-NEXT: bnez t1, .LBB12_33
-; RV32I-NEXT: .LBB12_37:
-; RV32I-NEXT: or a4, a4, s0
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: li s11, 2
+; RV32I-NEXT: beqz t5, .LBB12_38
+; RV32I-NEXT: # %bb.37:
+; RV32I-NEXT: mv t3, s6
; RV32I-NEXT: .LBB12_38:
-; RV32I-NEXT: li s0, 1
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: bnez t5, .LBB12_57
+; RV32I-NEXT: beqz t1, .LBB12_40
; RV32I-NEXT: # %bb.39:
-; RV32I-NEXT: beq t1, s0, .LBB12_58
+; RV32I-NEXT: li s6, 0
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: bnez t5, .LBB12_41
+; RV32I-NEXT: j .LBB12_42
; RV32I-NEXT: .LBB12_40:
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: bnez t5, .LBB12_59
+; RV32I-NEXT: lw s6, 0(sp) # 4-byte Folded Reload
+; RV32I-NEXT: or s6, s6, t3
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: beqz t5, .LBB12_42
; RV32I-NEXT: .LBB12_41:
-; RV32I-NEXT: beq t1, s4, .LBB12_60
+; RV32I-NEXT: mv t3, s8
; RV32I-NEXT: .LBB12_42:
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: bnez t5, .LBB12_61
-; RV32I-NEXT: .LBB12_43:
-; RV32I-NEXT: li s4, 3
-; RV32I-NEXT: bne t1, s4, .LBB12_45
+; RV32I-NEXT: beq t1, s10, .LBB12_58
+; RV32I-NEXT: # %bb.43:
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: bnez t5, .LBB12_59
; RV32I-NEXT: .LBB12_44:
-; RV32I-NEXT: or a4, s10, t6
+; RV32I-NEXT: beq t1, s11, .LBB12_60
; RV32I-NEXT: .LBB12_45:
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: li s4, 4
-; RV32I-NEXT: bnez t5, .LBB12_62
-; RV32I-NEXT: # %bb.46:
-; RV32I-NEXT: beq t1, s4, .LBB12_63
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: bnez t5, .LBB12_61
+; RV32I-NEXT: .LBB12_46:
+; RV32I-NEXT: bne t1, s9, .LBB12_48
; RV32I-NEXT: .LBB12_47:
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: bnez t5, .LBB12_64
+; RV32I-NEXT: or s6, a3, t3
; RV32I-NEXT: .LBB12_48:
-; RV32I-NEXT: beq t1, s9, .LBB12_65
-; RV32I-NEXT: .LBB12_49:
-; RV32I-NEXT: mv t6, s1
-; RV32I-NEXT: bne t1, s2, .LBB12_66
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: li s9, 4
+; RV32I-NEXT: bnez t5, .LBB12_62
+; RV32I-NEXT: # %bb.49:
+; RV32I-NEXT: beq t1, s9, .LBB12_63
; RV32I-NEXT: .LBB12_50:
-; RV32I-NEXT: li a4, 0
-; RV32I-NEXT: bne t1, s3, .LBB12_67
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: bnez t5, .LBB12_64
; RV32I-NEXT: .LBB12_51:
-; RV32I-NEXT: beqz a7, .LBB12_53
+; RV32I-NEXT: beq t1, s1, .LBB12_65
; RV32I-NEXT: .LBB12_52:
-; RV32I-NEXT: mv a1, a4
+; RV32I-NEXT: mv t3, s4
+; RV32I-NEXT: bne t1, s2, .LBB12_66
; RV32I-NEXT: .LBB12_53:
-; RV32I-NEXT: li a4, 0
-; RV32I-NEXT: li t6, 2
-; RV32I-NEXT: beqz t5, .LBB12_55
-; RV32I-NEXT: # %bb.54:
-; RV32I-NEXT: mv a4, t3
+; RV32I-NEXT: li s6, 0
+; RV32I-NEXT: bne t1, s3, .LBB12_67
+; RV32I-NEXT: .LBB12_54:
+; RV32I-NEXT: bnez a7, .LBB12_68
; RV32I-NEXT: .LBB12_55:
-; RV32I-NEXT: beqz t1, .LBB12_68
-; RV32I-NEXT: # %bb.56:
-; RV32I-NEXT: li a4, 0
-; RV32I-NEXT: j .LBB12_69
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: bnez t5, .LBB12_69
+; RV32I-NEXT: .LBB12_56:
+; RV32I-NEXT: beqz t1, .LBB12_70
; RV32I-NEXT: .LBB12_57:
-; RV32I-NEXT: mv t6, t3
-; RV32I-NEXT: bne t1, s0, .LBB12_40
+; RV32I-NEXT: li s6, 0
+; RV32I-NEXT: j .LBB12_71
; RV32I-NEXT: .LBB12_58:
-; RV32I-NEXT: or a4, t4, t6
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: beqz t5, .LBB12_41
+; RV32I-NEXT: or s6, t6, t3
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: beqz t5, .LBB12_44
; RV32I-NEXT: .LBB12_59:
-; RV32I-NEXT: mv t6, s11
-; RV32I-NEXT: bne t1, s4, .LBB12_42
+; RV32I-NEXT: mv t3, t4
+; RV32I-NEXT: bne t1, s11, .LBB12_45
; RV32I-NEXT: .LBB12_60:
-; RV32I-NEXT: or a4, ra, t6
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: beqz t5, .LBB12_43
+; RV32I-NEXT: srl s6, a5, a7
+; RV32I-NEXT: or s6, s6, t3
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: beqz t5, .LBB12_46
; RV32I-NEXT: .LBB12_61:
-; RV32I-NEXT: lw t6, 8(sp) # 4-byte Folded Reload
-; RV32I-NEXT: li s4, 3
-; RV32I-NEXT: beq t1, s4, .LBB12_44
-; RV32I-NEXT: j .LBB12_45
+; RV32I-NEXT: mv t3, s0
+; RV32I-NEXT: beq t1, s9, .LBB12_47
+; RV32I-NEXT: j .LBB12_48
; RV32I-NEXT: .LBB12_62:
-; RV32I-NEXT: mv t6, s7
-; RV32I-NEXT: bne t1, s4, .LBB12_47
+; RV32I-NEXT: mv t3, s7
+; RV32I-NEXT: bne t1, s9, .LBB12_50
; RV32I-NEXT: .LBB12_63:
-; RV32I-NEXT: or a4, s8, t6
-; RV32I-NEXT: li t6, 0
-; RV32I-NEXT: beqz t5, .LBB12_48
+; RV32I-NEXT: or s6, a1, t3
+; RV32I-NEXT: li t3, 0
+; RV32I-NEXT: beqz t5, .LBB12_51
; RV32I-NEXT: .LBB12_64:
-; RV32I-NEXT: mv t6, s5
-; RV32I-NEXT: bne t1, s9, .LBB12_49
+; RV32I-NEXT: mv t3, s5
+; RV32I-NEXT: bne t1, s1, .LBB12_52
; RV32I-NEXT: .LBB12_65:
-; RV32I-NEXT: or a4, s6, t6
-; RV32I-NEXT: mv t6, s1
-; RV32I-NEXT: beq t1, s2, .LBB12_50
+; RV32I-NEXT: or s6, ra, t3
+; RV32I-NEXT: mv t3, s4
+; RV32I-NEXT: beq t1, s2, .LBB12_53
; RV32I-NEXT: .LBB12_66:
-; RV32I-NEXT: mv t6, a4
-; RV32I-NEXT: li a4, 0
-; RV32I-NEXT: beq t1, s3, .LBB12_51
+; RV32I-NEXT: mv t3, s6
+; RV32I-NEXT: li s6, 0
+; RV32I-NEXT: beq t1, s3, .LBB12_54
; RV32I-NEXT: .LBB12_67:
-; RV32I-NEXT: mv a4, t6
-; RV32I-NEXT: bnez a7, .LBB12_52
-; RV32I-NEXT: j .LBB12_53
+; RV32I-NEXT: mv s6, t3
+; RV32I-NEXT: beqz a7, .LBB12_55
; RV32I-NEXT: .LBB12_68:
-; RV32I-NEXT: or a4, t4, a4
-; RV32I-NEXT: .LBB12_69:
-; RV32I-NEXT: li t4, 3
+; RV32I-NEXT: sw s6, 8(sp) # 4-byte Folded Spill
; RV32I-NEXT: li t3, 0
-; RV32I-NEXT: bnez t5, .LBB12_84
-; RV32I-NEXT: # %bb.70:
-; RV32I-NEXT: beq t1, s0, .LBB12_85
+; RV32I-NEXT: beqz t5, .LBB12_56
+; RV32I-NEXT: .LBB12_69:
+; RV32I-NEXT: mv t3, s8
+; RV32I-NEXT: bnez t1, .LBB12_57
+; RV32I-NEXT: .LBB12_70:
+; RV32I-NEXT: or s6, t6, t3
; RV32I-NEXT: .LBB12_71:
+; RV32I-NEXT: li t6, 3
; RV32I-NEXT: li t3, 0
; RV32I-NEXT: bnez t5, .LBB12_86
-; RV32I-NEXT: .LBB12_72:
-; RV32I-NEXT: beq t1, t6, .LBB12_87
+; RV32I-NEXT: # %bb.72:
+; RV32I-NEXT: beq t1, s10, .LBB12_87
; RV32I-NEXT: .LBB12_73:
; RV32I-NEXT: li t3, 0
; RV32I-NEXT: bnez t5, .LBB12_88
; RV32I-NEXT: .LBB12_74:
-; RV32I-NEXT: beq t1, t4, .LBB12_89
+; RV32I-NEXT: beq t1, s11, .LBB12_89
; RV32I-NEXT: .LBB12_75:
; RV32I-NEXT: li t3, 0
; RV32I-NEXT: bnez t5, .LBB12_90
; RV32I-NEXT: .LBB12_76:
-; RV32I...
[truncated]
|
; RV32I-NEXT: .LBB12_206: | ||
; RV32I-NEXT: mv t3, t4 | ||
; RV32I-NEXT: bnez a7, .LBB12_189 | ||
; RV32I-NEXT: j .LBB12_190 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code got quite a bit longer. Is it better?
; RV32I-NEXT: .LBB13_206: | ||
; RV32I-NEXT: mv t3, t4 | ||
; RV32I-NEXT: bnez a7, .LBB13_189 | ||
; RV32I-NEXT: j .LBB13_190 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer
Here are dyn instcount diffs for an rva22 build of SPEC 2017:
Looking at the static assembly diff it is large due to lots of very tiny regalloc changes. The obvious outlier is mcf, which I'll need to report back on after having a closer look. |
Some quick static results of this on llvm-test-suite, -march=rva23u64 -O3:
And SPEC CPU 2017:
Overall seems to be an improvement but I'm definitely surprised to see that some cases have an increase in number of reloads. The results in 505.mcf_r match @asb's dynamic results. Would be good to get to the bottom of that. |
I spent some time having a closer look. There's a very specific hot block in spec_qsort that gets a move and a negate that seems to account for a good chunk of dynamic instcount diff: New:
vs old:
I'll get a minimal reproducer so we can decide whether to put this down to bad luck or something we can address in the context of this patch. |
Here's a reduced test case that gives equivalent basic blocks as above for baseline vs the change introduced in this patch: ; ModuleID = '<stdin>'
source_filename = "<stdin>"
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "riscv64-unknown-linux-gnu"
define i64 @ham(ptr %arg, i64 %arg1, i64 %arg2, ptr %arg3, i1 %arg4, ptr %arg5, ptr %arg6, i1 %arg7, i1 %arg8, i64 %arg9) #0 {
bb:
%sub = sub i64 0, %arg2
br i1 %arg4, label %bb10, label %bb12
bb10: ; preds = %bb10, %bb
%phi = phi ptr [ %getelementptr11, %bb10 ], [ %arg, %bb ]
%getelementptr = getelementptr i8, ptr %phi, i64 %arg9
%call = tail call i32 %arg3(ptr %arg, ptr null)
%getelementptr11 = getelementptr i8, ptr %phi, i64 %arg2
br label %bb10
bb12: ; preds = %bb35, %bb29, %bb
%phi13 = phi ptr [ null, %bb ], [ %arg3, %bb29 ], [ null, %bb35 ]
%call14 = tail call i32 %arg3(ptr %arg6, ptr null)
br label %bb15
bb15: ; preds = %bb26, %bb12
%phi16 = phi ptr [ %getelementptr28, %bb26 ], [ %phi13, %bb12 ]
%phi17 = phi ptr [ %phi27, %bb26 ], [ null, %bb12 ]
%call18 = tail call i32 %arg3(ptr %arg3, ptr %arg)
br i1 %arg4, label %bb19, label %bb29
bb19: ; preds = %bb15
br i1 %arg8, label %bb20, label %bb26
bb20: ; preds = %bb20, %bb19
%phi21 = phi ptr [ %getelementptr23, %bb20 ], [ %phi16, %bb19 ]
%phi22 = phi i64 [ %add, %bb20 ], [ %arg1, %bb19 ]
%load = load i64, ptr %phi21, align 8
%getelementptr23 = getelementptr nusw i8, ptr %phi21, i64 8
%add = add i64 %phi22, 1
%icmp = icmp ugt i64 %phi22, 0
br i1 %icmp, label %bb20, label %bb24
bb24: ; preds = %bb20
%getelementptr25 = getelementptr i8, ptr %phi17, i64 %sub
br label %bb26
bb26: ; preds = %bb24, %bb19
%phi27 = phi ptr [ %getelementptr25, %bb24 ], [ %phi17, %bb19 ]
%getelementptr28 = getelementptr i8, ptr %phi16, i64 %sub
br i1 %arg4, label %bb35, label %bb15
bb29: ; preds = %bb29, %bb15
%phi30 = phi i64 [ %add33, %bb29 ], [ 1, %bb15 ]
%phi31 = phi ptr [ %getelementptr32, %bb29 ], [ %phi16, %bb15 ]
%getelementptr32 = getelementptr nusw i8, ptr %phi31, i64 4
store i32 0, ptr %phi31, align 4
%add33 = add i64 %phi30, 1
%icmp34 = icmp ugt i64 %phi30, 0
br i1 %icmp34, label %bb29, label %bb12
bb35: ; preds = %bb26
br i1 %arg7, label %bb36, label %bb12
bb36: ; preds = %bb35
br i1 %arg4, label %bb37, label %bb38
bb37: ; preds = %bb36
ret i64 0
bb38: ; preds = %bb36
br i1 %arg4, label %bb39, label %bb40
bb39: ; preds = %bb38
ret i64 0
bb40: ; preds = %bb38
br i1 %arg4, label %bb41, label %bb42
bb41: ; preds = %bb41, %bb40
br label %bb41
bb42: ; preds = %bb50, %bb40
br label %bb43
bb43: ; preds = %bb49, %bb42
%phi44 = phi ptr [ %getelementptr45, %bb49 ], [ null, %bb42 ]
%getelementptr45 = getelementptr i8, ptr %phi44, i64 %sub
%call46 = tail call i32 %arg3(ptr null, ptr null)
br label %bb47
bb47: ; preds = %bb47, %bb43
%load48 = load i8, ptr %phi44, align 1
store i8 %load48, ptr null, align 1
br i1 %arg7, label %bb47, label %bb49
bb49: ; preds = %bb47
br i1 %arg4, label %bb43, label %bb50
bb50: ; preds = %bb49
%getelementptr51 = getelementptr i8, ptr null, i64 %arg2
%icmp52 = icmp ult ptr %getelementptr51, %arg5
br i1 %icmp52, label %bb42, label %bb53
bb53: ; preds = %bb50
ret i64 0
}
attributes #0 = { "target-features"="+c,+m" } You can see the divergence in selectOrSplit during regalloc. |
Haven't gotten to looking at the regalloc bit yet, but this example is illustrating a wildly unprofitable loop-term-fold case. We've got a multiply being inserted into a loop, and redundant instructions in the pre-header block just to remove one scalar increment in the loop. I'm going to switch to looking at the regalloc bit, but we should remember to come back to this aspect as it's likely a generic problem. |
Before I get to the actual regalloc, one more missed issue. It looks like MachineCSE is failing to CSE two identical negate. I haven't looked into why, but my tentative guess is that the first operand is a COPY of x0, not x0 itself. On the regalloc side, there's a couple things going on here.
(3) is going to require some more discussion heuristic wise. We showed originally that the revised heuristic was profitable, so to tackle this case we'll need to refine it while not loosing the profitability on other cases. |
This is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions.
Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work.