-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[RISCV] Only convert volatile i64 load/store to Zilsd in SelectionDAG. #169529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Rely on ZilsdOptimizer for non-volatile loads/stores. This can avoid extra moves. Also use Zilsd for non-volatile stores of 0.
|
@llvm/pr-subscribers-backend-risc-v Author: Craig Topper (topperc) ChangesRely on ZilsdOptimizer for non-volatile loads/stores. This can avoid extra moves. Also use Zilsd for non-volatile stores of 0. Full diff: https://github.com/llvm/llvm-project/pull/169529.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 19a16197272fe..de60e8461e5d7 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -8404,7 +8404,12 @@ SDValue RISCVTargetLowering::LowerOperation(SDValue Op,
if (Store->isTruncatingStore())
return SDValue();
- if (Store->getAlign() < Subtarget.getZilsdAlign())
+ // Expand non-volatile or misaligned stores.
+ // Keep stores of 0 since that doesn't constrain the register allocator.
+ if (!(Store->isVolatile() ||
+ (isa<ConstantSDNode>(StoredVal) &&
+ cast<ConstantSDNode>(StoredVal)->isZero())) ||
+ Store->getAlign() < Subtarget.getZilsdAlign())
return SDValue();
SDLoc DL(Op);
@@ -14803,7 +14808,7 @@ void RISCVTargetLowering::ReplaceNodeResults(SDNode *N,
assert(Subtarget.hasStdExtZilsd() && !Subtarget.is64Bit() &&
"Unexpected custom legalisation");
- if (Ld->getAlign() < Subtarget.getZilsdAlign())
+ if (!Ld->isVolatile() || Ld->getAlign() < Subtarget.getZilsdAlign())
return;
SDLoc DL(N);
diff --git a/llvm/test/CodeGen/RISCV/zilsd.ll b/llvm/test/CodeGen/RISCV/zilsd.ll
index 27b1ff76f6f05..427b1d83b9d8e 100644
--- a/llvm/test/CodeGen/RISCV/zilsd.ll
+++ b/llvm/test/CodeGen/RISCV/zilsd.ll
@@ -9,9 +9,10 @@
define i64 @load(ptr %a) nounwind {
; CHECK-LABEL: load:
; CHECK: # %bb.0:
-; CHECK-NEXT: mv a2, a0
-; CHECK-NEXT: ld a0, 80(a0)
-; CHECK-NEXT: ld zero, 0(a2)
+; CHECK-NEXT: lw a2, 80(a0)
+; CHECK-NEXT: lw a1, 84(a0)
+; CHECK-NEXT: ld zero, 0(a0)
+; CHECK-NEXT: mv a0, a2
; CHECK-NEXT: ret
%1 = getelementptr i64, ptr %a, i32 10
%2 = load i64, ptr %1
@@ -44,10 +45,10 @@ define i64 @load_align4(ptr %a) nounwind {
define void @store(ptr %a, i64 %b) nounwind {
; CHECK-LABEL: store:
; CHECK: # %bb.0:
-; CHECK-NEXT: mv a3, a2
-; CHECK-NEXT: mv a2, a1
-; CHECK-NEXT: sd a2, 0(a0)
-; CHECK-NEXT: sd a2, 88(a0)
+; CHECK-NEXT: sw a1, 0(a0)
+; CHECK-NEXT: sw a2, 4(a0)
+; CHECK-NEXT: sw a1, 88(a0)
+; CHECK-NEXT: sw a2, 92(a0)
; CHECK-NEXT: ret
store i64 %b, ptr %a
%1 = getelementptr i64, ptr %a, i32 11
@@ -56,25 +57,11 @@ define void @store(ptr %a, i64 %b) nounwind {
}
define void @store_align4(ptr %a, i64 %b) nounwind {
-; SLOW-LABEL: store_align4:
-; SLOW: # %bb.0:
-; SLOW-NEXT: sw a1, 88(a0)
-; SLOW-NEXT: sw a2, 92(a0)
-; SLOW-NEXT: ret
-;
-; FAST-LABEL: store_align4:
-; FAST: # %bb.0:
-; FAST-NEXT: mv a3, a2
-; FAST-NEXT: mv a2, a1
-; FAST-NEXT: sd a2, 88(a0)
-; FAST-NEXT: ret
-;
-; 4BYTEALIGN-LABEL: store_align4:
-; 4BYTEALIGN: # %bb.0:
-; 4BYTEALIGN-NEXT: mv a3, a2
-; 4BYTEALIGN-NEXT: mv a2, a1
-; 4BYTEALIGN-NEXT: sd a2, 88(a0)
-; 4BYTEALIGN-NEXT: ret
+; CHECK-LABEL: store_align4:
+; CHECK: # %bb.0:
+; CHECK-NEXT: sw a1, 88(a0)
+; CHECK-NEXT: sw a2, 92(a0)
+; CHECK-NEXT: ret
%1 = getelementptr i64, ptr %a, i32 11
store i64 %b, ptr %1, align 4
ret void
@@ -158,9 +145,8 @@ define void @store_unaligned(ptr %p, i64 %v) {
;
; FAST-LABEL: store_unaligned:
; FAST: # %bb.0:
-; FAST-NEXT: mv a3, a2
-; FAST-NEXT: mv a2, a1
-; FAST-NEXT: sd a2, 0(a0)
+; FAST-NEXT: sw a1, 0(a0)
+; FAST-NEXT: sw a2, 4(a0)
; FAST-NEXT: ret
;
; 4BYTEALIGN-LABEL: store_unaligned:
@@ -213,10 +199,11 @@ define void @large_offset(ptr nocapture %p, i64 %d) nounwind {
; CHECK: # %bb.0: # %entry
; CHECK-NEXT: lui a1, 4
; CHECK-NEXT: add a0, a0, a1
-; CHECK-NEXT: ld a2, -384(a0)
+; CHECK-NEXT: lw a2, -384(a0)
+; CHECK-NEXT: lw a1, -380(a0)
; CHECK-NEXT: addi a2, a2, 1
-; CHECK-NEXT: seqz a1, a2
-; CHECK-NEXT: add a3, a3, a1
+; CHECK-NEXT: seqz a3, a2
+; CHECK-NEXT: add a3, a1, a3
; CHECK-NEXT: sd a2, -384(a0)
; CHECK-NEXT: ret
entry:
|
|
Can we hold off on this? I'm not super happy with the idea of splitting up the i64 operations just to attempt to pair them later in the pipeline. I don't think we've investigated other ways to avoid the additional moves. I'll work on a patch today to increase the allocation priority of GPRPairs (And their equivalents). I do see that we're always going to have a problem with i64s passed in the nth argument, where n is odd. There's likely no escaping that, unfortunately. I still owe you numbers on how bad the additional moves are in real-world code. I don't think I'll get to that this week. I don't know if you got numbers of your own on that. |
|
Sorry, I just caught up with the other patch that does have more data on the code size impact, showing that this should help with that. I will drop my concerns for the moment, but I do hope we can bring this back if we find other ways to remove the moves, enough to actually help. |
lenary
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@christian-herber-nxp can you recheck the numbers since I've kept stores of 0 now. |
| ; CHECK-NEXT: lui a1, 4 | ||
| ; CHECK-NEXT: add a0, a0, a1 | ||
| ; CHECK-NEXT: ld a2, -384(a0) | ||
| ; CHECK-NEXT: lw a2, -384(a0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression. x10 is already allocated. We pick x11 for the odd register first, but this can't pair since x10 is already used. So we pick x12 for the even but don't revisit the odd.
|
I had a thought at lunch, which might be helpful. What if we codegen the non-volatile loads/stores to |
This is proposed as an alternative to llvm#169529. The idea here is during selection, to choose between directly generating `LD`/`SD` or generating `PseudoLD_RV32_OPT`/`PseudoSD_RV32_OPT` based on the volatility of the access. Volatile operations will always become `LD`/`SD`, but non-volatile operations have a chance of becoming a pair of `LW`/`SW` depending on the register allocation, which might save some `MV` instructions. The advantage of this approach is that we don't need to go searching for instructions to pair (including comparing their memory operands) in the pre-ra pass, we already know these are paired, but they don't constrain the register allocator, unlike `LD`/`SD`. This PR is maybe not enough - we probably have to check the passes between ISel and the Pre-RA Load/Store Pairing pass cope with this correctly. This also fixes a verifier error with the kill flags.
| ; CHECK-NEXT: sw a1, 0(a0) | ||
| ; CHECK-NEXT: sw a2, 4(a0) | ||
| ; CHECK-NEXT: sw a1, 88(a0) | ||
| ; CHECK-NEXT: sw a2, 92(a0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious that does this also considered a regression in terms of latency since it turns 2 loads to 4 loads?
Or it doesn't matter since latency of sd is 2 * sw
Oz, march=rv32imb_zce_zdinx_zilsd_zclsd embench-iot 1.0:
eembc:
It is mostly flat, with a big code size expansion on cubic https://github.com/embench/embench-iot/tree/master/src/cubic I will also build #169580 to compare. I do not have a fast build machine, so this will take a couple of hours. |




Rely on ZilsdOptimizer for non-volatile loads/stores. This can avoid extra moves.
Also use Zilsd for non-volatile stores of 0.