-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[AArch64] Replace expensive move from wzr by two moves via floating point immediate #146538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-aarch64 Author: Julian Nagele (juliannagele) ChangesWe've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there. Full diff: https://github.com/llvm/llvm-project/pull/146538.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index f01b634f44ba6..347f7d4d3c1a3 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -7356,16 +7356,10 @@ def : Pat<(v4f16 (vector_insert (v4f16 V64:$Rn),
(i64 0)),
dsub)>;
-def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
- (INSvi16gpr V128:$Rn, VectorIndexH:$imm, WZR)>;
def : Pat<(vector_insert (v4f16 V64:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
(EXTRACT_SUBREG (INSvi16gpr (v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), V64:$Rn, dsub)), VectorIndexH:$imm, WZR), dsub)>;
-def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
- (INSvi32gpr V128:$Rn, VectorIndexS:$imm, WZR)>;
def : Pat<(vector_insert (v2f32 V64:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
(EXTRACT_SUBREG (INSvi32gpr (v4f32 (INSERT_SUBREG (v4f32 (IMPLICIT_DEF)), V64:$Rn, dsub)), VectorIndexS:$imm, WZR), dsub)>;
-def : Pat<(vector_insert v2f64:$Rn, (f64 fpimm0), (i64 VectorIndexD:$imm)),
- (INSvi64gpr V128:$Rn, VectorIndexS:$imm, XZR)>;
def : Pat<(v8f16 (vector_insert (v8f16 V128:$Rn),
(f16 FPR16:$Rm), (i64 VectorIndexH:$imm))),
@@ -8035,6 +8029,18 @@ def MOVIv2d_ns : SIMDModifiedImmVectorNoShift<1, 1, 0, 0b1110, V128,
"movi", ".2d",
[(set (v2i64 V128:$Rd), (AArch64movi_edit imm0_255:$imm8))]>;
+def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
+ (INSvi16lane V128:$Rn, VectorIndexH:$imm,
+ (v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
+ (INSvi32lane V128:$Rn, VectorIndexS:$imm,
+ (v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)),
+ (INSvi64lane V128:$Rn, VectorIndexD:$imm,
+ (v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
let Predicates = [HasNEON] in {
def : Pat<(v2i64 immAllZerosV), (MOVIv2d_ns (i32 0))>;
def : Pat<(v4i32 immAllZerosV), (MOVIv2d_ns (i32 0))>;
diff --git a/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll b/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
index ff28c7817d143..3638c1b5005ad 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
@@ -172,8 +172,9 @@ define <8 x half> @test_insert_v8f16_insert_1(half %a) {
; CHECK-LABEL: test_insert_v8f16_insert_1:
; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0
+; CHECK-NEXT: movi.2d v1, #0000000000000000
; CHECK-NEXT: dup.8h v0, v0[0]
-; CHECK-NEXT: mov.h v0[7], wzr
+; CHECK-NEXT: mov.h v0[7], v1[0]
; CHECK-NEXT: ret
%v.0 = insertelement <8 x half> <half undef, half undef, half undef, half undef, half undef, half undef, half undef, half 0.0>, half %a, i32 0
%v.1 = insertelement <8 x half> %v.0, half %a, i32 1
@@ -278,8 +279,9 @@ define <4 x float> @test_insert_3_f32_undef_zero_vector(float %a) {
; CHECK-LABEL: test_insert_3_f32_undef_zero_vector:
; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0
+; CHECK-NEXT: movi.2d v1, #0000000000000000
; CHECK-NEXT: dup.4s v0, v0[0]
-; CHECK-NEXT: mov.s v0[3], wzr
+; CHECK-NEXT: mov.s v0[3], v1[0]
; CHECK-NEXT: ret
%v.0 = insertelement <4 x float> <float undef, float undef, float undef, float 0.000000e+00>, float %a, i32 0
%v.1 = insertelement <4 x float> %v.0, float %a, i32 1
@@ -362,7 +364,8 @@ define <4 x half> @test_insert_v4f16_f16_zero(<4 x half> %a) {
define <8 x half> @test_insert_v8f16_f16_zero(<8 x half> %a) {
; CHECK-LABEL: test_insert_v8f16_f16_zero:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov.h v0[6], wzr
+; CHECK-NEXT: movi.2d v1, #0000000000000000
+; CHECK-NEXT: mov.h v0[6], v1[0]
; CHECK-NEXT: ret
%v.0 = insertelement <8 x half> %a, half 0.000000e+00, i32 6
ret <8 x half> %v.0
@@ -382,7 +385,8 @@ define <2 x float> @test_insert_v2f32_f32_zero(<2 x float> %a) {
define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {
; CHECK-LABEL: test_insert_v4f32_f32_zero:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov.s v0[3], wzr
+; CHECK-NEXT: movi.2d v1, #0000000000000000
+; CHECK-NEXT: mov.s v0[3], v1[0]
; CHECK-NEXT: ret
%v.0 = insertelement <4 x float> %a, float 0.000000e+00, i32 3
ret <4 x float> %v.0
@@ -391,7 +395,8 @@ define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {
define <2 x double> @test_insert_v2f64_f64_zero(<2 x double> %a) {
; CHECK-LABEL: test_insert_v2f64_f64_zero:
; CHECK: // %bb.0:
-; CHECK-NEXT: mov.d v0[1], xzr
+; CHECK-NEXT: movi.2d v1, #0000000000000000
+; CHECK-NEXT: mov.d v0[1], v1[0]
; CHECK-NEXT: ret
%v.0 = insertelement <2 x double> %a, double 0.000000e+00, i32 1
ret <2 x double> %v.0
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in theory this shouldn't be much of an improvement, if the existing operation is already split and one of the mops is essentially the new instruction. In practice it is probably a good idea though.
There are certain cpus where it will not be helpful. Do you mind adding a tuning feature that keeps the old patterns for cortex-a53, a55, a510, a520 and a310. Those are all in-order and this will just be extra instructions.
def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)), | ||
(INSvi16lane V128:$Rn, VectorIndexH:$imm, | ||
(v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>; | ||
|
||
def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)), | ||
(INSvi32lane V128:$Rn, VectorIndexS:$imm, | ||
(v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>; | ||
|
||
def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)), | ||
(INSvi64lane V128:$Rn, VectorIndexD:$imm, | ||
(v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need these new patterns? Or is the codegen without any pattern already OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, sorry not sure I understand, without those patterns we get the move from wzr (which we want to avoid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean - If we remove the INS wzr patterns above, do we need the new patterns or is that already handled by the existing "insert" and "zero a register" patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see, it is, yes :)
Sure, do you maybe have an example I can work off, not sure how to go about this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, do you maybe have an example I can work off, not sure how to go about this.
I think that DisableFastIncVL/HasFastIncVL shows how to add a subtarget feature and make patterns dependant on it. The old patterns can be made dependant on it, I believe, and it can be added to the processors that would not benefit. (FastIncVL is perhaps a bit odd as it doesn't appear to be used anywhere).
def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)), | ||
(INSvi16lane V128:$Rn, VectorIndexH:$imm, | ||
(v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>; | ||
|
||
def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)), | ||
(INSvi32lane V128:$Rn, VectorIndexS:$imm, | ||
(v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>; | ||
|
||
def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)), | ||
(INSvi64lane V128:$Rn, VectorIndexD:$imm, | ||
(v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean - If we remove the INS wzr patterns above, do we need the new patterns or is that already handled by the existing "insert" and "zero a register" patterns?
Instead of introducing new patterns, guard exisiting ones by tuning feature
Thanks! And you were right about not needing new patterns, so I've removed the ones I added and put the existing
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
… floating point immediate (#146538) We've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there. PR: llvm/llvm-project#146538
/cherry-pick e333d60 |
/pull-request #150001 |
We've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there.