Skip to content

Conversation

@topperc
Copy link
Collaborator

@topperc topperc commented Nov 25, 2025

Rely on ZilsdOptimizer for non-volatile loads/stores. This can avoid extra moves.

Also use Zilsd for non-volatile stores of 0.

Rely on ZilsdOptimizer for non-volatile loads/stores. This can avoid
extra moves.

Also use Zilsd for non-volatile stores of 0.
@topperc
Copy link
Collaborator Author

topperc commented Nov 25, 2025

CC @christian-herber-nxp

@llvmbot
Copy link
Member

llvmbot commented Nov 25, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Craig Topper (topperc)

Changes

Rely on ZilsdOptimizer for non-volatile loads/stores. This can avoid extra moves.

Also use Zilsd for non-volatile stores of 0.


Full diff: https://github.com/llvm/llvm-project/pull/169529.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+7-2)
  • (modified) llvm/test/CodeGen/RISCV/zilsd.ll (+19-32)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 19a16197272fe..de60e8461e5d7 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -8404,7 +8404,12 @@ SDValue RISCVTargetLowering::LowerOperation(SDValue Op,
       if (Store->isTruncatingStore())
         return SDValue();
 
-      if (Store->getAlign() < Subtarget.getZilsdAlign())
+      // Expand non-volatile or misaligned stores.
+      // Keep stores of 0 since that doesn't constrain the register allocator.
+      if (!(Store->isVolatile() ||
+            (isa<ConstantSDNode>(StoredVal) &&
+             cast<ConstantSDNode>(StoredVal)->isZero())) ||
+          Store->getAlign() < Subtarget.getZilsdAlign())
         return SDValue();
 
       SDLoc DL(Op);
@@ -14803,7 +14808,7 @@ void RISCVTargetLowering::ReplaceNodeResults(SDNode *N,
       assert(Subtarget.hasStdExtZilsd() && !Subtarget.is64Bit() &&
              "Unexpected custom legalisation");
 
-      if (Ld->getAlign() < Subtarget.getZilsdAlign())
+      if (!Ld->isVolatile() || Ld->getAlign() < Subtarget.getZilsdAlign())
         return;
 
       SDLoc DL(N);
diff --git a/llvm/test/CodeGen/RISCV/zilsd.ll b/llvm/test/CodeGen/RISCV/zilsd.ll
index 27b1ff76f6f05..427b1d83b9d8e 100644
--- a/llvm/test/CodeGen/RISCV/zilsd.ll
+++ b/llvm/test/CodeGen/RISCV/zilsd.ll
@@ -9,9 +9,10 @@
 define i64 @load(ptr %a) nounwind {
 ; CHECK-LABEL: load:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    mv a2, a0
-; CHECK-NEXT:    ld a0, 80(a0)
-; CHECK-NEXT:    ld zero, 0(a2)
+; CHECK-NEXT:    lw a2, 80(a0)
+; CHECK-NEXT:    lw a1, 84(a0)
+; CHECK-NEXT:    ld zero, 0(a0)
+; CHECK-NEXT:    mv a0, a2
 ; CHECK-NEXT:    ret
   %1 = getelementptr i64, ptr %a, i32 10
   %2 = load i64, ptr %1
@@ -44,10 +45,10 @@ define i64 @load_align4(ptr %a) nounwind {
 define void @store(ptr %a, i64 %b) nounwind {
 ; CHECK-LABEL: store:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    mv a3, a2
-; CHECK-NEXT:    mv a2, a1
-; CHECK-NEXT:    sd a2, 0(a0)
-; CHECK-NEXT:    sd a2, 88(a0)
+; CHECK-NEXT:    sw a1, 0(a0)
+; CHECK-NEXT:    sw a2, 4(a0)
+; CHECK-NEXT:    sw a1, 88(a0)
+; CHECK-NEXT:    sw a2, 92(a0)
 ; CHECK-NEXT:    ret
   store i64 %b, ptr %a
   %1 = getelementptr i64, ptr %a, i32 11
@@ -56,25 +57,11 @@ define void @store(ptr %a, i64 %b) nounwind {
 }
 
 define void @store_align4(ptr %a, i64 %b) nounwind {
-; SLOW-LABEL: store_align4:
-; SLOW:       # %bb.0:
-; SLOW-NEXT:    sw a1, 88(a0)
-; SLOW-NEXT:    sw a2, 92(a0)
-; SLOW-NEXT:    ret
-;
-; FAST-LABEL: store_align4:
-; FAST:       # %bb.0:
-; FAST-NEXT:    mv a3, a2
-; FAST-NEXT:    mv a2, a1
-; FAST-NEXT:    sd a2, 88(a0)
-; FAST-NEXT:    ret
-;
-; 4BYTEALIGN-LABEL: store_align4:
-; 4BYTEALIGN:       # %bb.0:
-; 4BYTEALIGN-NEXT:    mv a3, a2
-; 4BYTEALIGN-NEXT:    mv a2, a1
-; 4BYTEALIGN-NEXT:    sd a2, 88(a0)
-; 4BYTEALIGN-NEXT:    ret
+; CHECK-LABEL: store_align4:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    sw a1, 88(a0)
+; CHECK-NEXT:    sw a2, 92(a0)
+; CHECK-NEXT:    ret
   %1 = getelementptr i64, ptr %a, i32 11
   store i64 %b, ptr %1, align 4
   ret void
@@ -158,9 +145,8 @@ define void @store_unaligned(ptr %p, i64 %v) {
 ;
 ; FAST-LABEL: store_unaligned:
 ; FAST:       # %bb.0:
-; FAST-NEXT:    mv a3, a2
-; FAST-NEXT:    mv a2, a1
-; FAST-NEXT:    sd a2, 0(a0)
+; FAST-NEXT:    sw a1, 0(a0)
+; FAST-NEXT:    sw a2, 4(a0)
 ; FAST-NEXT:    ret
 ;
 ; 4BYTEALIGN-LABEL: store_unaligned:
@@ -213,10 +199,11 @@ define void @large_offset(ptr nocapture %p, i64 %d) nounwind {
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    lui a1, 4
 ; CHECK-NEXT:    add a0, a0, a1
-; CHECK-NEXT:    ld a2, -384(a0)
+; CHECK-NEXT:    lw a2, -384(a0)
+; CHECK-NEXT:    lw a1, -380(a0)
 ; CHECK-NEXT:    addi a2, a2, 1
-; CHECK-NEXT:    seqz a1, a2
-; CHECK-NEXT:    add a3, a3, a1
+; CHECK-NEXT:    seqz a3, a2
+; CHECK-NEXT:    add a3, a1, a3
 ; CHECK-NEXT:    sd a2, -384(a0)
 ; CHECK-NEXT:    ret
 entry:

@lenary
Copy link
Member

lenary commented Nov 25, 2025

Can we hold off on this? I'm not super happy with the idea of splitting up the i64 operations just to attempt to pair them later in the pipeline.

I don't think we've investigated other ways to avoid the additional moves.

I'll work on a patch today to increase the allocation priority of GPRPairs (And their equivalents).

I do see that we're always going to have a problem with i64s passed in the nth argument, where n is odd. There's likely no escaping that, unfortunately.

I still owe you numbers on how bad the additional moves are in real-world code. I don't think I'll get to that this week. I don't know if you got numbers of your own on that.

@lenary
Copy link
Member

lenary commented Nov 25, 2025

Sorry, I just caught up with the other patch that does have more data on the code size impact, showing that this should help with that.

I will drop my concerns for the moment, but I do hope we can bring this back if we find other ways to remove the moves, enough to actually help.

Copy link
Member

@lenary lenary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@topperc
Copy link
Collaborator Author

topperc commented Nov 25, 2025

@christian-herber-nxp can you recheck the numbers since I've kept stores of 0 now.

; CHECK-NEXT: lui a1, 4
; CHECK-NEXT: add a0, a0, a1
; CHECK-NEXT: ld a2, -384(a0)
; CHECK-NEXT: lw a2, -384(a0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression. x10 is already allocated. We pick x11 for the odd register first, but this can't pair since x10 is already used. So we pick x12 for the even but don't revisit the odd.

@lenary
Copy link
Member

lenary commented Nov 25, 2025

I had a thought at lunch, which might be helpful.

What if we codegen the non-volatile loads/stores to PseudoLD_RV32_OPT/PseudoSD_RV32_OPT - this does the "keep things together as long as possible", but also allows the compiler to split them later if needed. Would this make any sense?

lenary added a commit to lenary/llvm-project that referenced this pull request Nov 25, 2025
This is proposed as an alternative to llvm#169529. The idea here is during
selection, to choose between directly generating `LD`/`SD` or generating
`PseudoLD_RV32_OPT`/`PseudoSD_RV32_OPT` based on the volatility of the
access.

Volatile operations will always become `LD`/`SD`, but non-volatile
operations have a chance of becoming a pair of `LW`/`SW` depending on
the register allocation, which might save some `MV` instructions.

The advantage of this approach is that we don't need to go searching for
instructions to pair (including comparing their memory operands) in the
pre-ra pass, we already know these are paired, but they don't constrain
the register allocator, unlike `LD`/`SD`.

This PR is maybe not enough - we probably have to check the passes
between ISel and the Pre-RA Load/Store Pairing pass cope with this
correctly.

This also fixes a verifier error with the kill flags.
Comment on lines +48 to +51
; CHECK-NEXT: sw a1, 0(a0)
; CHECK-NEXT: sw a2, 4(a0)
; CHECK-NEXT: sw a1, 88(a0)
; CHECK-NEXT: sw a2, 92(a0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious that does this also considered a regression in terms of latency since it turns 2 loads to 4 loads?
Or it doesn't matter since latency of sd is 2 * sw

@christian-herber-nxp
Copy link

@christian-herber-nxp can you recheck the numbers since I've kept stores of 0 now.

Oz, march=rv32imb_zce_zdinx_zilsd_zclsd

embench-iot 1.0:

image

eembc:

image

It is mostly flat, with a big code size expansion on cubic https://github.com/embench/embench-iot/tree/master/src/cubic

I will also build #169580 to compare. I do not have a fast build machine, so this will take a couple of hours.

@christian-herber-nxp
Copy link

embench-iot 1.0:

image

The cubic benchmark seems to be exposing a weakness in both approaches. Might be worthwhile to dive deeper into that.

eembc:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants