[RISCV] Enable (non trivial) remat for most scalar instructions #162311

preames · 2025-10-07T16:03:46Z

This is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions.

Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work.

This is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions. Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work.

llvmbot · 2025-10-07T16:04:20Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

This is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions.

Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work.

Patch is 419.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162311.diff

6 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVInstrInfo.td (+9-11)
(modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+4512-4413)
(modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+5-5)
(modified) llvm/test/CodeGen/RISCV/pr69586.ll (+141-142)
(modified) llvm/test/CodeGen/RISCV/rvv/nontemporal-vp-scalable.ll (+205-205)
(modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+21-21)

diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.td b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
index 9855c47a63392..f1ac3a5b7e9a5 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
@@ -780,21 +780,18 @@ def SB : Store_rri<0b000, "sb">, Sched<[WriteSTB, ReadStoreData, ReadMemBase]>;
 def SH : Store_rri<0b001, "sh">, Sched<[WriteSTH, ReadStoreData, ReadMemBase]>;
 def SW : Store_rri<0b010, "sw">, Sched<[WriteSTW, ReadStoreData, ReadMemBase]>;
 
-// ADDI isn't always rematerializable, but isReMaterializable will be used as
-// a hint which is verified in isReMaterializableImpl.
-let isReMaterializable = 1, isAsCheapAsAMove = 1 in
+let isReMaterializable = 1, isAsCheapAsAMove = 1 in {
 def ADDI  : ALU_ri<0b000, "addi">;
+def XORI  : ALU_ri<0b100, "xori">;
+def ORI   : ALU_ri<0b110, "ori">;
+}
 
-let IsSignExtendingOpW = 1 in {
+let IsSignExtendingOpW = 1, isReMaterializable = 1 in {
 def SLTI  : ALU_ri<0b010, "slti">;
 def SLTIU : ALU_ri<0b011, "sltiu">;
 }
 
-let isReMaterializable = 1, isAsCheapAsAMove = 1 in {
-def XORI  : ALU_ri<0b100, "xori">;
-def ORI   : ALU_ri<0b110, "ori">;
-}
-
+let isReMaterializable = 1 in {
 def ANDI  : ALU_ri<0b111, "andi">;
 
 def SLLI : Shift_ri<0b00000, 0b001, "slli">,
@@ -826,6 +823,7 @@ def OR   : ALU_rr<0b0000000, 0b110, "or", Commutable=1>,
            Sched<[WriteIALU, ReadIALU, ReadIALU]>;
 def AND  : ALU_rr<0b0000000, 0b111, "and", Commutable=1>,
            Sched<[WriteIALU, ReadIALU, ReadIALU]>;
+}
 
 let hasSideEffects = 1, mayLoad = 0, mayStore = 0 in {
 def FENCE : RVInstI<0b000, OPC_MISC_MEM, (outs),
@@ -893,7 +891,7 @@ def LWU   : Load_ri<0b110, "lwu">, Sched<[WriteLDW, ReadMemBase]>;
 def LD    : Load_ri<0b011, "ld">, Sched<[WriteLDD, ReadMemBase]>;
 def SD    : Store_rri<0b011, "sd">, Sched<[WriteSTD, ReadStoreData, ReadMemBase]>;
 
-let IsSignExtendingOpW = 1 in {
+let IsSignExtendingOpW = 1, isReMaterializable = 1 in {
 let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
 def ADDIW : RVInstI<0b000, OPC_OP_IMM_32, (outs GPR:$rd),
                     (ins GPR:$rs1, simm12_lo:$imm12),
@@ -917,7 +915,7 @@ def SRLW  : ALUW_rr<0b0000000, 0b101, "srlw">,
             Sched<[WriteShiftReg32, ReadShiftReg32, ReadShiftReg32]>;
 def SRAW  : ALUW_rr<0b0100000, 0b101, "sraw">,
             Sched<[WriteShiftReg32, ReadShiftReg32, ReadShiftReg32]>;
-} // IsSignExtendingOpW = 1
+} // IsSignExtendingOpW = 1, isReMaterializable = 1
 } // Predicates = [IsRV64]
 
 //===----------------------------------------------------------------------===//
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
index ca9f7637388f7..74c31a229dad4 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
@@ -3000,9 +3000,9 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    sw s9, 20(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    sw s10, 16(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    sw s11, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    li a4, 0
+; RV32I-NEXT:    li a5, 0
 ; RV32I-NEXT:    lbu a3, 0(a0)
-; RV32I-NEXT:    lbu a5, 1(a0)
+; RV32I-NEXT:    lbu a4, 1(a0)
 ; RV32I-NEXT:    lbu a6, 2(a0)
 ; RV32I-NEXT:    lbu a7, 3(a0)
 ; RV32I-NEXT:    lbu t0, 4(a0)
@@ -3013,736 +3013,750 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    lbu t5, 9(a0)
 ; RV32I-NEXT:    lbu t6, 10(a0)
 ; RV32I-NEXT:    lbu s0, 11(a0)
-; RV32I-NEXT:    slli a5, a5, 8
+; RV32I-NEXT:    slli a4, a4, 8
 ; RV32I-NEXT:    slli a7, a7, 8
 ; RV32I-NEXT:    slli t1, t1, 8
-; RV32I-NEXT:    or a3, a5, a3
-; RV32I-NEXT:    or a7, a7, a6
-; RV32I-NEXT:    or t1, t1, t0
-; RV32I-NEXT:    lbu a6, 13(a0)
-; RV32I-NEXT:    lbu a5, 14(a0)
-; RV32I-NEXT:    lbu s1, 15(a0)
+; RV32I-NEXT:    or a3, a4, a3
+; RV32I-NEXT:    or a4, a7, a6
+; RV32I-NEXT:    or a7, t1, t0
+; RV32I-NEXT:    lbu t0, 13(a0)
+; RV32I-NEXT:    lbu a6, 14(a0)
+; RV32I-NEXT:    lbu t1, 15(a0)
 ; RV32I-NEXT:    slli t3, t3, 8
 ; RV32I-NEXT:    slli t5, t5, 8
 ; RV32I-NEXT:    slli s0, s0, 8
-; RV32I-NEXT:    or t3, t3, t2
-; RV32I-NEXT:    or t0, t5, t4
-; RV32I-NEXT:    or t5, s0, t6
-; RV32I-NEXT:    lbu t2, 1(a1)
-; RV32I-NEXT:    lbu t4, 0(a1)
+; RV32I-NEXT:    or s1, t3, t2
+; RV32I-NEXT:    or t2, t5, t4
+; RV32I-NEXT:    or t4, s0, t6
+; RV32I-NEXT:    lbu t3, 1(a1)
+; RV32I-NEXT:    lbu t5, 0(a1)
 ; RV32I-NEXT:    lbu t6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t2, t2, 8
-; RV32I-NEXT:    or s0, t2, t4
-; RV32I-NEXT:    slli t2, s1, 8
+; RV32I-NEXT:    slli t3, t3, 8
+; RV32I-NEXT:    or t5, t3, t5
+; RV32I-NEXT:    slli t3, t1, 8
 ; RV32I-NEXT:    slli a1, a1, 8
 ; RV32I-NEXT:    or a1, a1, t6
-; RV32I-NEXT:    slli t4, a7, 16
-; RV32I-NEXT:    slli a7, t3, 16
-; RV32I-NEXT:    slli t3, t5, 16
-; RV32I-NEXT:    slli t5, a1, 16
-; RV32I-NEXT:    or a1, a7, t1
-; RV32I-NEXT:    or a7, t5, s0
+; RV32I-NEXT:    slli a4, a4, 16
+; RV32I-NEXT:    slli s1, s1, 16
+; RV32I-NEXT:    slli t4, t4, 16
+; RV32I-NEXT:    slli t1, a1, 16
+; RV32I-NEXT:    or s5, s1, a7
+; RV32I-NEXT:    or a7, t1, t5
 ; RV32I-NEXT:    slli a7, a7, 3
 ; RV32I-NEXT:    srli t1, a7, 5
 ; RV32I-NEXT:    andi t5, a7, 31
 ; RV32I-NEXT:    neg s3, t5
 ; RV32I-NEXT:    beqz t5, .LBB12_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    sll a4, a1, s3
+; RV32I-NEXT:    sll a5, s5, s3
 ; RV32I-NEXT:  .LBB12_2:
-; RV32I-NEXT:    or s7, t4, a3
-; RV32I-NEXT:    lbu t4, 12(a0)
-; RV32I-NEXT:    lbu t6, 19(a0)
-; RV32I-NEXT:    slli s1, a6, 8
-; RV32I-NEXT:    or a5, t2, a5
-; RV32I-NEXT:    or a3, t3, t0
+; RV32I-NEXT:    or a4, a4, a3
+; RV32I-NEXT:    lbu t6, 12(a0)
+; RV32I-NEXT:    lbu s0, 19(a0)
+; RV32I-NEXT:    slli s1, t0, 8
+; RV32I-NEXT:    or t0, t3, a6
+; RV32I-NEXT:    or a1, t4, t2
 ; RV32I-NEXT:    beqz t1, .LBB12_4
 ; RV32I-NEXT:  # %bb.3:
-; RV32I-NEXT:    li s0, 0
+; RV32I-NEXT:    mv s11, a4
+; RV32I-NEXT:    li a4, 0
 ; RV32I-NEXT:    j .LBB12_5
 ; RV32I-NEXT:  .LBB12_4:
-; RV32I-NEXT:    srl s0, s7, a7
-; RV32I-NEXT:    or s0, s0, a4
+; RV32I-NEXT:    mv s11, a4
+; RV32I-NEXT:    srl a6, a4, a7
+; RV32I-NEXT:    or a4, a6, a5
 ; RV32I-NEXT:  .LBB12_5:
 ; RV32I-NEXT:    li a6, 0
-; RV32I-NEXT:    lbu t0, 17(a0)
-; RV32I-NEXT:    lbu a4, 18(a0)
-; RV32I-NEXT:    slli s4, t6, 8
-; RV32I-NEXT:    or s2, s1, t4
-; RV32I-NEXT:    slli a5, a5, 16
-; RV32I-NEXT:    li s5, 1
-; RV32I-NEXT:    sll t6, a3, s3
+; RV32I-NEXT:    lbu s2, 17(a0)
+; RV32I-NEXT:    lbu a5, 18(a0)
+; RV32I-NEXT:    slli s4, s0, 8
+; RV32I-NEXT:    or s1, s1, t6
+; RV32I-NEXT:    slli t0, t0, 16
+; RV32I-NEXT:    li t3, 1
+; RV32I-NEXT:    sll s6, a1, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_7
 ; RV32I-NEXT:  # %bb.6:
-; RV32I-NEXT:    mv a6, t6
+; RV32I-NEXT:    mv a6, s6
 ; RV32I-NEXT:  .LBB12_7:
 ; RV32I-NEXT:    lbu t2, 16(a0)
-; RV32I-NEXT:    lbu t3, 23(a0)
-; RV32I-NEXT:    slli s1, t0, 8
-; RV32I-NEXT:    or t4, s4, a4
-; RV32I-NEXT:    srl a4, a1, a7
-; RV32I-NEXT:    or a5, a5, s2
-; RV32I-NEXT:    bne t1, s5, .LBB12_9
+; RV32I-NEXT:    lbu t4, 23(a0)
+; RV32I-NEXT:    slli s0, s2, 8
+; RV32I-NEXT:    or t6, s4, a5
+; RV32I-NEXT:    srl a3, s5, a7
+; RV32I-NEXT:    or a5, t0, s1
+; RV32I-NEXT:    sw a3, 0(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    bne t1, t3, .LBB12_9
 ; RV32I-NEXT:  # %bb.8:
-; RV32I-NEXT:    or s0, a4, a6
+; RV32I-NEXT:    or a4, a3, a6
 ; RV32I-NEXT:  .LBB12_9:
 ; RV32I-NEXT:    li t0, 0
-; RV32I-NEXT:    lbu s5, 21(a0)
+; RV32I-NEXT:    lbu s2, 21(a0)
 ; RV32I-NEXT:    lbu a6, 22(a0)
-; RV32I-NEXT:    slli s4, t3, 8
-; RV32I-NEXT:    or t2, s1, t2
-; RV32I-NEXT:    slli s6, t4, 16
-; RV32I-NEXT:    li s8, 2
-; RV32I-NEXT:    sll t3, a5, s3
+; RV32I-NEXT:    slli s1, t4, 8
+; RV32I-NEXT:    or t2, s0, t2
+; RV32I-NEXT:    slli s4, t6, 16
+; RV32I-NEXT:    li a3, 2
+; RV32I-NEXT:    sll s8, a5, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_11
 ; RV32I-NEXT:  # %bb.10:
-; RV32I-NEXT:    mv t0, t3
+; RV32I-NEXT:    mv t0, s8
 ; RV32I-NEXT:  .LBB12_11:
-; RV32I-NEXT:    lbu s1, 20(a0)
-; RV32I-NEXT:    lbu s2, 27(a0)
-; RV32I-NEXT:    slli s5, s5, 8
-; RV32I-NEXT:    or s4, s4, a6
-; RV32I-NEXT:    srl t4, a3, a7
-; RV32I-NEXT:    or a6, s6, t2
-; RV32I-NEXT:    bne t1, s8, .LBB12_13
+; RV32I-NEXT:    lbu t6, 20(a0)
+; RV32I-NEXT:    lbu s0, 27(a0)
+; RV32I-NEXT:    slli s2, s2, 8
+; RV32I-NEXT:    or s1, s1, a6
+; RV32I-NEXT:    srl t3, a1, a7
+; RV32I-NEXT:    or a6, s4, t2
+; RV32I-NEXT:    sw s5, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    bne t1, a3, .LBB12_13
 ; RV32I-NEXT:  # %bb.12:
-; RV32I-NEXT:    or s0, t4, t0
+; RV32I-NEXT:    or a4, t3, t0
 ; RV32I-NEXT:  .LBB12_13:
-; RV32I-NEXT:    sw s7, 4(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    li t2, 0
-; RV32I-NEXT:    lbu s6, 25(a0)
+; RV32I-NEXT:    lbu s4, 25(a0)
 ; RV32I-NEXT:    lbu t0, 26(a0)
-; RV32I-NEXT:    slli s8, s2, 8
-; RV32I-NEXT:    or s7, s5, s1
-; RV32I-NEXT:    slli s9, s4, 16
-; RV32I-NEXT:    sll s11, a6, s3
+; RV32I-NEXT:    slli s7, s0, 8
+; RV32I-NEXT:    or s5, s2, t6
+; RV32I-NEXT:    slli s9, s1, 16
+; RV32I-NEXT:    li t6, 3
+; RV32I-NEXT:    sll t4, a6, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_15
 ; RV32I-NEXT:  # %bb.14:
-; RV32I-NEXT:    mv t2, s11
+; RV32I-NEXT:    mv t2, t4
 ; RV32I-NEXT:  .LBB12_15:
-; RV32I-NEXT:    lbu s1, 24(a0)
-; RV32I-NEXT:    lbu s2, 31(a0)
-; RV32I-NEXT:    slli s5, s6, 8
-; RV32I-NEXT:    or s4, s8, t0
-; RV32I-NEXT:    srl ra, a5, a7
-; RV32I-NEXT:    or t0, s9, s7
-; RV32I-NEXT:    li s6, 3
-; RV32I-NEXT:    bne t1, s6, .LBB12_17
+; RV32I-NEXT:    lbu s0, 24(a0)
+; RV32I-NEXT:    lbu s1, 31(a0)
+; RV32I-NEXT:    slli s4, s4, 8
+; RV32I-NEXT:    or s2, s7, t0
+; RV32I-NEXT:    srl a3, a5, a7
+; RV32I-NEXT:    or t0, s9, s5
+; RV32I-NEXT:    li s9, 3
+; RV32I-NEXT:    bne t1, t6, .LBB12_17
 ; RV32I-NEXT:  # %bb.16:
-; RV32I-NEXT:    or s0, ra, t2
+; RV32I-NEXT:    or a4, a3, t2
 ; RV32I-NEXT:  .LBB12_17:
+; RV32I-NEXT:    mv t6, t3
 ; RV32I-NEXT:    li t2, 0
 ; RV32I-NEXT:    lbu s7, 29(a0)
-; RV32I-NEXT:    lbu s6, 30(a0)
-; RV32I-NEXT:    slli s8, s2, 8
-; RV32I-NEXT:    or s2, s5, s1
-; RV32I-NEXT:    slli s5, s4, 16
-; RV32I-NEXT:    li s9, 4
-; RV32I-NEXT:    sll s1, t0, s3
-; RV32I-NEXT:    sw s1, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    lbu s5, 30(a0)
+; RV32I-NEXT:    slli s1, s1, 8
+; RV32I-NEXT:    or s10, s4, s0
+; RV32I-NEXT:    slli s2, s2, 16
+; RV32I-NEXT:    li a3, 4
+; RV32I-NEXT:    sll s0, t0, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_19
 ; RV32I-NEXT:  # %bb.18:
-; RV32I-NEXT:    lw t2, 8(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    mv t2, s0
 ; RV32I-NEXT:  .LBB12_19:
-; RV32I-NEXT:    lbu s1, 28(a0)
+; RV32I-NEXT:    lbu t3, 28(a0)
 ; RV32I-NEXT:    slli s7, s7, 8
-; RV32I-NEXT:    or s4, s8, s6
-; RV32I-NEXT:    srl s10, a6, a7
-; RV32I-NEXT:    or a0, s5, s2
-; RV32I-NEXT:    bne t1, s9, .LBB12_21
+; RV32I-NEXT:    or s4, s1, s5
+; RV32I-NEXT:    srl s1, a6, a7
+; RV32I-NEXT:    or a0, s2, s10
+; RV32I-NEXT:    beq t1, a3, .LBB12_21
 ; RV32I-NEXT:  # %bb.20:
-; RV32I-NEXT:    or s0, s10, t2
+; RV32I-NEXT:    mv a3, s1
+; RV32I-NEXT:    j .LBB12_22
 ; RV32I-NEXT:  .LBB12_21:
+; RV32I-NEXT:    mv a3, s1
+; RV32I-NEXT:    or a4, s1, t2
+; RV32I-NEXT:  .LBB12_22:
+; RV32I-NEXT:    li s10, 1
 ; RV32I-NEXT:    li s2, 0
-; RV32I-NEXT:    or t2, s7, s1
+; RV32I-NEXT:    or t2, s7, t3
 ; RV32I-NEXT:    slli s4, s4, 16
-; RV32I-NEXT:    li s9, 5
+; RV32I-NEXT:    li s1, 5
 ; RV32I-NEXT:    sll s7, a0, s3
-; RV32I-NEXT:    beqz t5, .LBB12_23
-; RV32I-NEXT:  # %bb.22:
+; RV32I-NEXT:    beqz t5, .LBB12_24
+; RV32I-NEXT:  # %bb.23:
 ; RV32I-NEXT:    mv s2, s7
-; RV32I-NEXT:  .LBB12_23:
-; RV32I-NEXT:    srl s8, t0, a7
+; RV32I-NEXT:  .LBB12_24:
+; RV32I-NEXT:    sw a1, 4(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    srl t3, t0, a7
 ; RV32I-NEXT:    or t2, s4, t2
-; RV32I-NEXT:    bne t1, s9, .LBB12_25
-; RV32I-NEXT:  # %bb.24:
-; RV32I-NEXT:    or s0, s8, s2
-; RV32I-NEXT:  .LBB12_25:
-; RV32I-NEXT:    li s4, 0
+; RV32I-NEXT:    beq t1, s1, .LBB12_26
+; RV32I-NEXT:  # %bb.25:
+; RV32I-NEXT:    mv a1, t3
+; RV32I-NEXT:    j .LBB12_27
+; RV32I-NEXT:  .LBB12_26:
+; RV32I-NEXT:    mv a1, t3
+; RV32I-NEXT:    or a4, t3, s2
+; RV32I-NEXT:  .LBB12_27:
+; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    li s2, 6
 ; RV32I-NEXT:    sll s5, t2, s3
-; RV32I-NEXT:    beqz t5, .LBB12_27
-; RV32I-NEXT:  # %bb.26:
-; RV32I-NEXT:    mv s4, s5
-; RV32I-NEXT:  .LBB12_27:
-; RV32I-NEXT:    srl s6, a0, a7
-; RV32I-NEXT:    bne t1, s2, .LBB12_29
+; RV32I-NEXT:    beqz t5, .LBB12_29
 ; RV32I-NEXT:  # %bb.28:
-; RV32I-NEXT:    or s0, s6, s4
+; RV32I-NEXT:    mv t3, s5
 ; RV32I-NEXT:  .LBB12_29:
-; RV32I-NEXT:    li s3, 7
-; RV32I-NEXT:    srl s1, t2, a7
-; RV32I-NEXT:    mv s4, s1
-; RV32I-NEXT:    bne t1, s3, .LBB12_34
+; RV32I-NEXT:    srl s3, a0, a7
+; RV32I-NEXT:    beq t1, s2, .LBB12_31
 ; RV32I-NEXT:  # %bb.30:
-; RV32I-NEXT:    bnez a7, .LBB12_35
+; RV32I-NEXT:    mv ra, s3
+; RV32I-NEXT:    j .LBB12_32
 ; RV32I-NEXT:  .LBB12_31:
-; RV32I-NEXT:    li s0, 0
-; RV32I-NEXT:    bnez t5, .LBB12_36
+; RV32I-NEXT:    mv ra, s3
+; RV32I-NEXT:    or a4, s3, t3
 ; RV32I-NEXT:  .LBB12_32:
-; RV32I-NEXT:    li s4, 2
-; RV32I-NEXT:    beqz t1, .LBB12_37
-; RV32I-NEXT:  .LBB12_33:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    j .LBB12_38
+; RV32I-NEXT:    li s3, 7
+; RV32I-NEXT:    srl s4, t2, a7
+; RV32I-NEXT:    mv t3, s4
+; RV32I-NEXT:    beq t1, s3, .LBB12_34
+; RV32I-NEXT:  # %bb.33:
+; RV32I-NEXT:    mv t3, a4
 ; RV32I-NEXT:  .LBB12_34:
-; RV32I-NEXT:    mv s4, s0
-; RV32I-NEXT:    beqz a7, .LBB12_31
-; RV32I-NEXT:  .LBB12_35:
-; RV32I-NEXT:    sw s4, 4(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    li s0, 0
-; RV32I-NEXT:    beqz t5, .LBB12_32
+; RV32I-NEXT:    mv a4, s11
+; RV32I-NEXT:    beqz a7, .LBB12_36
+; RV32I-NEXT:  # %bb.35:
+; RV32I-NEXT:    mv a4, t3
 ; RV32I-NEXT:  .LBB12_36:
-; RV32I-NEXT:    mv s0, t6
-; RV32I-NEXT:    li s4, 2
-; RV32I-NEXT:    bnez t1, .LBB12_33
-; RV32I-NEXT:  .LBB12_37:
-; RV32I-NEXT:    or a4, a4, s0
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    li s11, 2
+; RV32I-NEXT:    beqz t5, .LBB12_38
+; RV32I-NEXT:  # %bb.37:
+; RV32I-NEXT:    mv t3, s6
 ; RV32I-NEXT:  .LBB12_38:
-; RV32I-NEXT:    li s0, 1
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_57
+; RV32I-NEXT:    beqz t1, .LBB12_40
 ; RV32I-NEXT:  # %bb.39:
-; RV32I-NEXT:    beq t1, s0, .LBB12_58
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_41
+; RV32I-NEXT:    j .LBB12_42
 ; RV32I-NEXT:  .LBB12_40:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_59
+; RV32I-NEXT:    lw s6, 0(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    or s6, s6, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_42
 ; RV32I-NEXT:  .LBB12_41:
-; RV32I-NEXT:    beq t1, s4, .LBB12_60
+; RV32I-NEXT:    mv t3, s8
 ; RV32I-NEXT:  .LBB12_42:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_61
-; RV32I-NEXT:  .LBB12_43:
-; RV32I-NEXT:    li s4, 3
-; RV32I-NEXT:    bne t1, s4, .LBB12_45
+; RV32I-NEXT:    beq t1, s10, .LBB12_58
+; RV32I-NEXT:  # %bb.43:
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_59
 ; RV32I-NEXT:  .LBB12_44:
-; RV32I-NEXT:    or a4, s10, t6
+; RV32I-NEXT:    beq t1, s11, .LBB12_60
 ; RV32I-NEXT:  .LBB12_45:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    li s4, 4
-; RV32I-NEXT:    bnez t5, .LBB12_62
-; RV32I-NEXT:  # %bb.46:
-; RV32I-NEXT:    beq t1, s4, .LBB12_63
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_61
+; RV32I-NEXT:  .LBB12_46:
+; RV32I-NEXT:    bne t1, s9, .LBB12_48
 ; RV32I-NEXT:  .LBB12_47:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_64
+; RV32I-NEXT:    or s6, a3, t3
 ; RV32I-NEXT:  .LBB12_48:
-; RV32I-NEXT:    beq t1, s9, .LBB12_65
-; RV32I-NEXT:  .LBB12_49:
-; RV32I-NEXT:    mv t6, s1
-; RV32I-NEXT:    bne t1, s2, .LBB12_66
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    li s9, 4
+; RV32I-NEXT:    bnez t5, .LBB12_62
+; RV32I-NEXT:  # %bb.49:
+; RV32I-NEXT:    beq t1, s9, .LBB12_63
 ; RV32I-NEXT:  .LBB12_50:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    bne t1, s3, .LBB12_67
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_64
 ; RV32I-NEXT:  .LBB12_51:
-; RV32I-NEXT:    beqz a7, .LBB12_53
+; RV32I-NEXT:    beq t1, s1, .LBB12_65
 ; RV32I-NEXT:  .LBB12_52:
-; RV32I-NEXT:    mv a1, a4
+; RV32I-NEXT:    mv t3, s4
+; RV32I-NEXT:    bne t1, s2, .LBB12_66
 ; RV32I-NEXT:  .LBB12_53:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    li t6, 2
-; RV32I-NEXT:    beqz t5, .LBB12_55
-; RV32I-NEXT:  # %bb.54:
-; RV32I-NEXT:    mv a4, t3
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    bne t1, s3, .LBB12_67
+; RV32I-NEXT:  .LBB12_54:
+; RV32I-NEXT:    bnez a7, .LBB12_68
 ; RV32I-NEXT:  .LBB12_55:
-; RV32I-NEXT:    beqz t1, .LBB12_68
-; RV32I-NEXT:  # %bb.56:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    j .LBB12_69
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_69
+; RV32I-NEXT:  .LBB12_56:
+; RV32I-NEXT:    beqz t1, .LBB12_70
 ; RV32I-NEXT:  .LBB12_57:
-; RV32I-NEXT:    mv t6, t3
-; RV32I-NEXT:    bne t1, s0, .LBB12_40
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    j .LBB12_71
 ; RV32I-NEXT:  .LBB12_58:
-; RV32I-NEXT:    or a4, t4, t6
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    beqz t5, .LBB12_41
+; RV32I-NEXT:    or s6, t6, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_44
 ; RV32I-NEXT:  .LBB12_59:
-; RV32I-NEXT:    mv t6, s11
-; RV32I-NEXT:    bne t1, s4, .LBB12_42
+; RV32I-NEXT:    mv t3, t4
+; RV32I-NEXT:    bne t1, s11, .LBB12_45
 ; RV32I-NEXT:  .LBB12_60:
-; RV32I-NEXT:    or a4, ra, t6
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    beqz t5, .LBB12_43
+; RV32I-NEXT:    srl s6, a5, a7
+; RV32I-NEXT:    or s6, s6, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_46
 ; RV32I-NEXT:  .LBB12_61:
-; RV32I-NEXT:    lw t6, 8(sp) # 4-byte Folded Reload
-; RV32I-NEXT:    li s4, 3
-; RV32I-NEXT:    beq t1, s4, .LBB12_44
-; RV32I-NEXT:    j .LBB12_45
+; RV32I-NEXT:    mv t3, s0
+; RV32I-NEXT:    beq t1, s9, .LBB12_47
+; RV32I-NEXT:    j .LBB12_48
 ; RV32I-NEXT:  .LBB12_62:
-; RV32I-NEXT:    mv t6, s7
-; RV32I-NEXT:    bne t1, s4, .LBB12_47
+; RV32I-NEXT:    mv t3, s7
+; RV32I-NEXT:    bne t1, s9, .LBB12_50
 ; RV32I-NEXT:  .LBB12_63:
-; RV32I-NEXT:    or a4, s8, t6
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    beqz t5, .LBB12_48
+; RV32I-NEXT:    or s6, a1, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_51
 ; RV32I-NEXT:  .LBB12_64:
-; RV32I-NEXT:    mv t6, s5
-; RV32I-NEXT:    bne t1, s9, .LBB12_49
+; RV32I-NEXT:    mv t3, s5
+; RV32I-NEXT:    bne t1, s1, .LBB12_52
 ; RV32I-NEXT:  .LBB12_65:
-; RV32I-NEXT:    or a4, s6, t6
-; RV32I-NEXT:    mv t6, s1
-; RV32I-NEXT:    beq t1, s2, .LBB12_50
+; RV32I-NEXT:    or s6, ra, t3
+; RV32I-NEXT:    mv t3, s4
+; RV32I-NEXT:    beq t1, s2, .LBB12_53
 ; RV32I-NEXT:  .LBB12_66:
-; RV32I-NEXT:    mv t6, a4
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    beq t1, s3, .LBB12_51
+; RV32I-NEXT:    mv t3, s6
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    beq t1, s3, .LBB12_54
 ; RV32I-NEXT:  .LBB12_67:
-; RV32I-NEXT:    mv a4, t6
-; RV32I-NEXT:    bnez a7, .LBB12_52
-; RV32I-NEXT:    j .LBB12_53
+; RV32I-NEXT:    mv s6, t3
+; RV32I-NEXT:    beqz a7, .LBB12_55
 ; RV32I-NEXT:  .LBB12_68:
-; RV32I-NEXT:    or a4, t4, a4
-; RV32I-NEXT:  .LBB12_69:
-; RV32I-NEXT:    li t4, 3
+; RV32I-NEXT:    sw s6, 8(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    li t3, 0
-; RV32I-NEXT:    bnez t5, .LBB12_84
-; RV32I-NEXT:  # %bb.70:
-; RV32I-NEXT:    beq t1, s0, .LBB12_85
+; RV32I-NEXT:    beqz t5, .LBB12_56
+; RV32I-NEXT:  .LBB12_69:
+; RV32I-NEXT:    mv t3, s8
+; RV32I-NEXT:    bnez t1, .LBB12_57
+; RV32I-NEXT:  .LBB12_70:
+; RV32I-NEXT:    or s6, t6, t3
 ; RV32I-NEXT:  .LBB12_71:
+; RV32I-NEXT:    li t6, 3
 ; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    bnez t5, .LBB12_86
-; RV32I-NEXT:  .LBB12_72:
-; RV32I-NEXT:    beq t1, t6, .LBB12_87
+; RV32I-NEXT:  # %bb.72:
+; RV32I-NEXT:    beq t1, s10, .LBB12_87
 ; RV32I-NEXT:  .LBB12_73:
 ; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    bnez t5, .LBB12_88
 ; RV32I-NEXT:  .LBB12_74:
-; RV32I-NEXT:    beq t1, t4, .LBB12_89
+; RV32I-NEXT:    beq t1, s11, .LBB12_89
 ; RV32I-NEXT:  .LBB12_75:
 ; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    bnez t5, .LBB12_90
 ; RV32I-NEXT:  .LBB12_76:
-; RV32I...
[truncated]

topperc · 2025-10-07T17:46:57Z

llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll

+; RV32I-NEXT:  .LBB12_206:
+; RV32I-NEXT:    mv t3, t4
+; RV32I-NEXT:    bnez a7, .LBB12_189
+; RV32I-NEXT:    j .LBB12_190


This code got quite a bit longer. Is it better?

topperc · 2025-10-07T17:47:23Z

llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll

+; RV32I-NEXT:  .LBB13_206:
+; RV32I-NEXT:    mv t3, t4
+; RV32I-NEXT:    bnez a7, .LBB13_189
+; RV32I-NEXT:    j .LBB13_190


asb · 2025-10-14T19:52:50Z

Here are dyn instcount diffs for an rva22 build of SPEC 2017:

Benchmark                  Baseline       This PR   Diff (%)
============================================================
500.perlbench_r         179038504795    179015518664     -0.01%
502.gcc_r               221238744000    221171512681     -0.03%
505.mcf_r               134655886612    137150096559      1.85%
508.namd_r              217623220031    217709012429      0.04%
510.parest_r            291729214114    291727407077     -0.00%
511.povray_r             30983012423     30981981254     -0.00%
519.lbm_r                91217999797     90475828912     -0.81%
520.omnetpp_r           137704191763    137702616618     -0.00%
523.xalancbmk_r         284738544130    284738525706     -0.00%
525.x264_r              379871669079    379399368988     -0.12%
526.blender_r           659313110004    659133873329     -0.03%
531.deepsjeng_r         349454510283    349291680562     -0.05%
538.imagick_r           238568576282    238568485759     -0.00%
541.leela_r             405707905587    405700609695     -0.00%
544.nab_r               398215408162    398165080506     -0.01%
557.xz_r                129537393796    129925509975      0.30%

Looking at the static assembly diff it is large due to lots of very tiny regalloc changes. The obvious outlier is mcf, which I'll need to report back on after having a closer look.

lukel97 · 2025-10-16T06:06:53Z

Some quick static results of this on llvm-test-suite, -march=rva23u64 -O3:

$ ./utils/compare.py results.rva23u64-O3.pr162311.before.json vs results.rva23u64-O3.pr162311.after.json -m regalloc.NumReloads -m regalloc.NumSpills 
Tests: 320
Metric: regalloc.NumReloads,regalloc.NumSpills

Program                                       regalloc.NumReloads               regalloc.NumSpills              
                                              lhs                 rhs     diff  lhs                rhs     diff 
SingleSour...arks/Adobe-C++/functionobjects     38.00               48.00 26.3%    8.00               8.00  0.0%
MultiSourc...e/Benchmarks/Rodinia/srad/srad     73.00               77.00  5.5%   54.00              54.00  0.0%
MultiSourc...e/Applications/minisat/minisat     33.00               34.00  3.0%   23.00              23.00  0.0%
MultiSourc...e/Applications/ClamAV/clamscan   4067.00             4159.00  2.3% 1881.00            1868.00 -0.7%
MultiSourc...e/Benchmarks/MallocBench/gs/gs    430.00              439.00  2.1%  223.00             220.00 -1.3%
MultiSource/Applications/kimwitu++/kc          627.00              631.00  0.6%  141.00             142.00  0.7%
MultiSource/Benchmarks/PAQ8p/paq8p             344.00              345.00  0.3%  243.00             243.00  0.0%
MultiSource/Benchmarks/sim/sim                 353.00              354.00  0.3%  168.00             166.00 -1.2%
MicroBench...ubsetCLambdaLoops/lcalsCLambda   1343.00             1346.00  0.2% 1017.00            1017.00  0.0%
MicroBench...CALS/SubsetBRawLoops/lcalsBRaw   1229.00             1231.00  0.2%  968.00             967.00 -0.1%
MicroBench...ubsetBLambdaLoops/lcalsBLambda   1229.00             1231.00  0.2%  968.00             967.00 -0.1%
MicroBench...CALS/SubsetCRawLoops/lcalsCRaw   1345.00             1347.00  0.1% 1019.00            1018.00 -0.1%
MicroBench...CALS/SubsetARawLoops/lcalsARaw   1410.00             1411.00  0.1% 1095.00            1093.00 -0.2%
MicroBench...ubsetALambdaLoops/lcalsALambda   1487.00             1488.00  0.1% 1167.00            1165.00 -0.2%
MultiSourc...enchmarks/VersaBench/dbms/dbms     16.00               16.00  0.0%    9.00               9.00  0.0%
                           Geomean difference                             -8.5%                            -8.9%

And SPEC CPU 2017:

$ ./utils/compare.py results.rva23u64-O3.pr162311.spec.before.json vs results.rva23u64-O3.pr162311.spec.after.json -m regalloc.NumReloads -m regalloc.NumSpills -a
Tests: 32
Metric: regalloc.NumReloads,regalloc.NumSpills

Program                                       regalloc.NumReloads                 regalloc.NumSpills               
                                              lhs                 rhs      diff   lhs                rhs      diff 
INT2017speed/605.mcf_s/605.mcf_s                196.00              220.00  12.2%   104.00              98.00 -5.8%
INT2017rate/505.mcf_r/505.mcf_r                 196.00              220.00  12.2%   104.00              98.00 -5.8%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r     515.00              516.00   0.2%   265.00             267.00  0.8%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s     515.00              516.00   0.2%   265.00             267.00  0.8%
FP2017rate/508.namd_r/508.namd_r              15208.00            15231.00   0.2%  6580.00            6585.00  0.1%
FP2017rate/519.lbm_r/519.lbm_r                   47.00               47.00   0.0%    46.00              46.00  0.0%
FP2017rate/511.povray_r/511.povray_r           2841.00             2839.00  -0.1%  1720.00            1707.00 -0.8%
FP2017rate/544.nab_r/544.nab_r                 1073.00             1066.00  -0.7%   714.00             709.00 -0.7%
FP2017speed/644.nab_s/644.nab_s                1073.00             1066.00  -0.7%   714.00             709.00 -0.7%
INT2017rate/502.gcc_r/502.gcc_r               24278.00            24034.00  -1.0% 11046.00           10911.00 -1.2%
INT2017speed/602.gcc_s/602.gcc_s              24278.00            24034.00  -1.0% 11046.00           10911.00 -1.2%
FP2017rate/538.imagick_r/538.imagick_r         8154.00             8071.00  -1.0%  3365.00            3311.00 -1.6%
FP2017speed/638.imagick_s/638.imagick_s        8154.00             8071.00  -1.0%  3365.00            3311.00 -1.6%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s    2243.00             2220.00  -1.0%  1396.00            1384.00 -0.9%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r    2243.00             2220.00  -1.0%  1396.00            1384.00 -0.9%
FP2017rate/510.parest_r/510.parest_r          76535.00            75466.00  -1.4% 43417.00           43114.00 -0.7%
FP2017rate/526.blender_r/526.blender_r        24742.00            24354.00  -1.6% 12452.00           12361.00 -0.7%
INT2017spe...00.perlbench_s/600.perlbench_s    9630.00             9470.00  -1.7%  4360.00            4309.00 -1.2%
INT2017rat...00.perlbench_r/500.perlbench_r    9630.00             9470.00  -1.7%  4360.00            4309.00 -1.2%
INT2017rate/520.omnetpp_r/520.omnetpp_r        1188.00             1155.00  -2.8%   645.00             619.00 -4.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s    1188.00             1155.00  -2.8%   645.00             619.00 -4.0%
INT2017rate/525.x264_r/525.x264_r              4065.00             3898.00  -4.1%  1865.00            1800.00 -3.5%
INT2017speed/625.x264_s/625.x264_s             4065.00             3898.00  -4.1%  1865.00            1800.00 -3.5%
FP2017speed/619.lbm_s/619.lbm_s                  43.00               41.00  -4.7%    42.00              40.00 -4.8%
INT2017rate/541.leela_r/541.leela_r             418.00              397.00  -5.0%   300.00             283.00 -5.7%
INT2017speed/641.leela_s/641.leela_s            418.00              397.00  -5.0%   300.00             283.00 -5.7%
INT2017rate/557.xz_r/557.xz_r                   470.00              420.00 -10.6%   270.00             257.00 -4.8%
INT2017speed/657.xz_s/657.xz_s                  470.00              420.00 -10.6%   270.00             257.00 -4.8%
FP2017rate...97.specrand_fr/997.specrand_fr       0.00                0.00                                         
FP2017spee...96.specrand_fs/996.specrand_fs       0.00                0.00                                         
INT2017rat...99.specrand_ir/999.specrand_ir       0.00                0.00                                         
INT2017spe...98.specrand_is/998.specrand_is       0.00                0.00                                         
                           Geomean difference                               -1.5%                             -2.3%

Overall seems to be an improvement but I'm definitely surprised to see that some cases have an increase in number of reloads. The results in 505.mcf_r match @asb's dynamic results. Would be good to get to the bottom of that.

asb · 2025-10-16T14:21:13Z

I spent some time having a closer look. There's a very specific hot block in spec_qsort that gets a move and a negate that seems to account for a good chunk of dynamic instcount diff:

New:

mv s6, s11          
neg a0, s3          
mul s11, a0, s9     
mv a0, s1           
mv a1, s8           
jalr s4

vs old:

 mul s3, s11, s6 
 mv a0, s1       
 mv a1, s8       
 jalr s4

I'll get a minimal reproducer so we can decide whether to put this down to bad luck or something we can address in the context of this patch.

asb · 2025-10-22T12:10:11Z

Here's a reduced test case that gives equivalent basic blocks as above for baseline vs the change introduced in this patch:

; ModuleID = '<stdin>'
source_filename = "<stdin>"
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "riscv64-unknown-linux-gnu"

define i64 @ham(ptr %arg, i64 %arg1, i64 %arg2, ptr %arg3, i1 %arg4, ptr %arg5, ptr %arg6, i1 %arg7, i1 %arg8, i64 %arg9) #0 {
bb:
  %sub = sub i64 0, %arg2
  br i1 %arg4, label %bb10, label %bb12

bb10:                                             ; preds = %bb10, %bb
  %phi = phi ptr [ %getelementptr11, %bb10 ], [ %arg, %bb ]
  %getelementptr = getelementptr i8, ptr %phi, i64 %arg9
  %call = tail call i32 %arg3(ptr %arg, ptr null)
  %getelementptr11 = getelementptr i8, ptr %phi, i64 %arg2
  br label %bb10

bb12:                                             ; preds = %bb35, %bb29, %bb
  %phi13 = phi ptr [ null, %bb ], [ %arg3, %bb29 ], [ null, %bb35 ]
  %call14 = tail call i32 %arg3(ptr %arg6, ptr null)
  br label %bb15

bb15:                                             ; preds = %bb26, %bb12
  %phi16 = phi ptr [ %getelementptr28, %bb26 ], [ %phi13, %bb12 ]
  %phi17 = phi ptr [ %phi27, %bb26 ], [ null, %bb12 ]
  %call18 = tail call i32 %arg3(ptr %arg3, ptr %arg)
  br i1 %arg4, label %bb19, label %bb29

bb19:                                             ; preds = %bb15
  br i1 %arg8, label %bb20, label %bb26

bb20:                                             ; preds = %bb20, %bb19
  %phi21 = phi ptr [ %getelementptr23, %bb20 ], [ %phi16, %bb19 ]
  %phi22 = phi i64 [ %add, %bb20 ], [ %arg1, %bb19 ]
  %load = load i64, ptr %phi21, align 8
  %getelementptr23 = getelementptr nusw i8, ptr %phi21, i64 8
  %add = add i64 %phi22, 1
  %icmp = icmp ugt i64 %phi22, 0
  br i1 %icmp, label %bb20, label %bb24

bb24:                                             ; preds = %bb20
  %getelementptr25 = getelementptr i8, ptr %phi17, i64 %sub
  br label %bb26

bb26:                                             ; preds = %bb24, %bb19
  %phi27 = phi ptr [ %getelementptr25, %bb24 ], [ %phi17, %bb19 ]
  %getelementptr28 = getelementptr i8, ptr %phi16, i64 %sub
  br i1 %arg4, label %bb35, label %bb15

bb29:                                             ; preds = %bb29, %bb15
  %phi30 = phi i64 [ %add33, %bb29 ], [ 1, %bb15 ]
  %phi31 = phi ptr [ %getelementptr32, %bb29 ], [ %phi16, %bb15 ]
  %getelementptr32 = getelementptr nusw i8, ptr %phi31, i64 4
  store i32 0, ptr %phi31, align 4
  %add33 = add i64 %phi30, 1
  %icmp34 = icmp ugt i64 %phi30, 0
  br i1 %icmp34, label %bb29, label %bb12

bb35:                                             ; preds = %bb26
  br i1 %arg7, label %bb36, label %bb12

bb36:                                             ; preds = %bb35
  br i1 %arg4, label %bb37, label %bb38

bb37:                                             ; preds = %bb36
  ret i64 0

bb38:                                             ; preds = %bb36
  br i1 %arg4, label %bb39, label %bb40

bb39:                                             ; preds = %bb38
  ret i64 0

bb40:                                             ; preds = %bb38
  br i1 %arg4, label %bb41, label %bb42

bb41:                                             ; preds = %bb41, %bb40
  br label %bb41

bb42:                                             ; preds = %bb50, %bb40
  br label %bb43

bb43:                                             ; preds = %bb49, %bb42
  %phi44 = phi ptr [ %getelementptr45, %bb49 ], [ null, %bb42 ]
  %getelementptr45 = getelementptr i8, ptr %phi44, i64 %sub
  %call46 = tail call i32 %arg3(ptr null, ptr null)
  br label %bb47

bb47:                                             ; preds = %bb47, %bb43
  %load48 = load i8, ptr %phi44, align 1
  store i8 %load48, ptr null, align 1
  br i1 %arg7, label %bb47, label %bb49

bb49:                                             ; preds = %bb47
  br i1 %arg4, label %bb43, label %bb50

bb50:                                             ; preds = %bb49
  %getelementptr51 = getelementptr i8, ptr null, i64 %arg2
  %icmp52 = icmp ult ptr %getelementptr51, %arg5
  br i1 %icmp52, label %bb42, label %bb53

bb53:                                             ; preds = %bb50
  ret i64 0
}

attributes #0 = { "target-features"="+c,+m" }

You can see the divergence in selectOrSplit during regalloc.

preames · 2025-10-22T18:56:11Z

Haven't gotten to looking at the regalloc bit yet, but this example is illustrating a wildly unprofitable loop-term-fold case. We've got a multiply being inserted into a loop, and redundant instructions in the pre-header block just to remove one scalar increment in the loop. I'm going to switch to looking at the regalloc bit, but we should remember to come back to this aspect as it's likely a generic problem.

preames · 2025-10-22T19:41:38Z

Before I get to the actual regalloc, one more missed issue. It looks like MachineCSE is failing to CSE two identical negate. I haven't looked into why, but my tentative guess is that the first operand is a COPY of x0, not x0 itself.

On the regalloc side, there's a couple things going on here.

%0 and %3 are identical instructions (this is the missed CSE), and yet we are splitting %0 and rematerializing %3. This does make a bit of sense when we look at the uses of each. However, the split itself doesn't appear to be profitable - or at least, I'm suspicious and want to look into this in more detail.
We don't have SUB marked isAsCheapAsAMove = 1, this causes the remat during splitting to not remat this instruction. If I tweak this, the extra mov noticed does go away. (But not the extract negate in the hot block.)
I think the root issue is probably the calc spill weights though. If I patch to restrict the discount to trivial remat only, this particular problem goes away entirely.

(3) is going to require some more discussion heuristic wise. We showed originally that the revised heuristic was profitable, so to tackle this case we'll need to refine it while not loosing the profitability on other cases.

preames requested review from asb, lukel97, mikhailramalho, pfusik, ppenzin and topperc October 7, 2025 16:03

llvmbot added backend:RISC-V llvm:globalisel labels Oct 7, 2025

topperc reviewed Oct 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV] Enable (non trivial) remat for most scalar instructions #162311

[RISCV] Enable (non trivial) remat for most scalar instructions #162311

preames commented Oct 7, 2025

Uh oh!

llvmbot commented Oct 7, 2025 •

edited

Loading

Uh oh!

topperc Oct 7, 2025

Uh oh!

topperc Oct 7, 2025

Uh oh!

asb commented Oct 14, 2025

Uh oh!

lukel97 commented Oct 16, 2025 •

edited

Loading

Uh oh!

asb commented Oct 16, 2025

Uh oh!

asb commented Oct 22, 2025

Uh oh!

preames commented Oct 22, 2025

Uh oh!

preames commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[RISCV] Enable (non trivial) remat for most scalar instructions #162311

Are you sure you want to change the base?

[RISCV] Enable (non trivial) remat for most scalar instructions #162311

Conversation

preames commented Oct 7, 2025

Uh oh!

llvmbot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topperc Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

topperc Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

asb commented Oct 14, 2025

Uh oh!

lukel97 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asb commented Oct 16, 2025

Uh oh!

asb commented Oct 22, 2025

Uh oh!

preames commented Oct 22, 2025

Uh oh!

preames commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

llvmbot commented Oct 7, 2025 •

edited

Loading

lukel97 commented Oct 16, 2025 •

edited

Loading