Skip to content

[AArch64] Replace expensive move from wzr by two moves via floating point immediate #146538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 16, 2025

Conversation

juliannagele
Copy link
Member

We've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there.

@llvmbot
Copy link
Member

llvmbot commented Jul 1, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Julian Nagele (juliannagele)

Changes

We've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there.


Full diff: https://github.com/llvm/llvm-project/pull/146538.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+12-6)
  • (modified) llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll (+10-5)
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index f01b634f44ba6..347f7d4d3c1a3 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -7356,16 +7356,10 @@ def : Pat<(v4f16 (vector_insert (v4f16 V64:$Rn),
               (i64 0)),
             dsub)>;
 
-def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
-          (INSvi16gpr V128:$Rn, VectorIndexH:$imm, WZR)>;
 def : Pat<(vector_insert (v4f16 V64:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
           (EXTRACT_SUBREG (INSvi16gpr (v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), V64:$Rn, dsub)), VectorIndexH:$imm, WZR), dsub)>;
-def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
-          (INSvi32gpr V128:$Rn, VectorIndexS:$imm, WZR)>;
 def : Pat<(vector_insert (v2f32 V64:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
           (EXTRACT_SUBREG (INSvi32gpr (v4f32 (INSERT_SUBREG (v4f32 (IMPLICIT_DEF)), V64:$Rn, dsub)), VectorIndexS:$imm, WZR), dsub)>;
-def : Pat<(vector_insert v2f64:$Rn, (f64 fpimm0), (i64 VectorIndexD:$imm)),
-          (INSvi64gpr V128:$Rn, VectorIndexS:$imm, XZR)>;
 
 def : Pat<(v8f16 (vector_insert (v8f16 V128:$Rn),
             (f16 FPR16:$Rm), (i64 VectorIndexH:$imm))),
@@ -8035,6 +8029,18 @@ def MOVIv2d_ns   : SIMDModifiedImmVectorNoShift<1, 1, 0, 0b1110, V128,
                                                 "movi", ".2d",
                    [(set (v2i64 V128:$Rd), (AArch64movi_edit imm0_255:$imm8))]>;
 
+def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
+          (INSvi16lane V128:$Rn, VectorIndexH:$imm,
+          (v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
+          (INSvi32lane V128:$Rn, VectorIndexS:$imm,
+          (v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)),
+          (INSvi64lane V128:$Rn, VectorIndexD:$imm,
+          (v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
 let Predicates = [HasNEON] in {
 def : Pat<(v2i64 immAllZerosV), (MOVIv2d_ns (i32 0))>;
 def : Pat<(v4i32 immAllZerosV), (MOVIv2d_ns (i32 0))>;
diff --git a/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll b/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
index ff28c7817d143..3638c1b5005ad 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
@@ -172,8 +172,9 @@ define <8 x half> @test_insert_v8f16_insert_1(half %a) {
 ; CHECK-LABEL: test_insert_v8f16_insert_1:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    // kill: def $h0 killed $h0 def $q0
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
 ; CHECK-NEXT:    dup.8h v0, v0[0]
-; CHECK-NEXT:    mov.h v0[7], wzr
+; CHECK-NEXT:    mov.h v0[7], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <8 x half> <half undef, half undef, half undef, half undef, half undef, half undef, half undef, half 0.0>, half %a, i32 0
   %v.1 = insertelement <8 x half> %v.0, half %a, i32 1
@@ -278,8 +279,9 @@ define <4 x float> @test_insert_3_f32_undef_zero_vector(float %a) {
 ; CHECK-LABEL: test_insert_3_f32_undef_zero_vector:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    // kill: def $s0 killed $s0 def $q0
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
 ; CHECK-NEXT:    dup.4s v0, v0[0]
-; CHECK-NEXT:    mov.s v0[3], wzr
+; CHECK-NEXT:    mov.s v0[3], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <4 x float> <float undef, float undef, float undef, float 0.000000e+00>, float %a, i32 0
   %v.1 = insertelement <4 x float> %v.0, float %a, i32 1
@@ -362,7 +364,8 @@ define <4 x half> @test_insert_v4f16_f16_zero(<4 x half> %a) {
 define <8 x half> @test_insert_v8f16_f16_zero(<8 x half> %a) {
 ; CHECK-LABEL: test_insert_v8f16_f16_zero:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov.h v0[6], wzr
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
+; CHECK-NEXT:    mov.h v0[6], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <8 x half> %a, half 0.000000e+00, i32 6
   ret <8 x half> %v.0
@@ -382,7 +385,8 @@ define <2 x float> @test_insert_v2f32_f32_zero(<2 x float> %a) {
 define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {
 ; CHECK-LABEL: test_insert_v4f32_f32_zero:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov.s v0[3], wzr
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
+; CHECK-NEXT:    mov.s v0[3], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <4 x float> %a, float 0.000000e+00, i32 3
   ret <4 x float> %v.0
@@ -391,7 +395,8 @@ define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {
 define <2 x double> @test_insert_v2f64_f64_zero(<2 x double> %a) {
 ; CHECK-LABEL: test_insert_v2f64_f64_zero:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov.d v0[1], xzr
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
+; CHECK-NEXT:    mov.d v0[1], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <2 x double> %a, double 0.000000e+00, i32 1
   ret <2 x double> %v.0

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in theory this shouldn't be much of an improvement, if the existing operation is already split and one of the mops is essentially the new instruction. In practice it is probably a good idea though.

There are certain cpus where it will not be helpful. Do you mind adding a tuning feature that keeps the old patterns for cortex-a53, a55, a510, a520 and a310. Those are all in-order and this will just be extra instructions.

Comment on lines 8032 to 8042
def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
(INSvi16lane V128:$Rn, VectorIndexH:$imm,
(v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>;

def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
(INSvi32lane V128:$Rn, VectorIndexS:$imm,
(v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>;

def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)),
(INSvi64lane V128:$Rn, VectorIndexD:$imm,
(v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need these new patterns? Or is the codegen without any pattern already OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, sorry not sure I understand, without those patterns we get the move from wzr (which we want to avoid).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean - If we remove the INS wzr patterns above, do we need the new patterns or is that already handled by the existing "insert" and "zero a register" patterns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I see, it is, yes :)

@juliannagele
Copy link
Member Author

There are certain cpus where it will not be helpful. Do you mind adding a tuning feature that keeps the old patterns for cortex-a53, a55, a510, a520 and a310. Those are all in-order and this will just be extra instructions.

Sure, do you maybe have an example I can work off, not sure how to go about this.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, do you maybe have an example I can work off, not sure how to go about this.

I think that DisableFastIncVL/HasFastIncVL shows how to add a subtarget feature and make patterns dependant on it. The old patterns can be made dependant on it, I believe, and it can be added to the processors that would not benefit. (FastIncVL is perhaps a bit odd as it doesn't appear to be used anywhere).

Comment on lines 8032 to 8042
def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
(INSvi16lane V128:$Rn, VectorIndexH:$imm,
(v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>;

def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
(INSvi32lane V128:$Rn, VectorIndexS:$imm,
(v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>;

def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)),
(INSvi64lane V128:$Rn, VectorIndexD:$imm,
(v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean - If we remove the INS wzr patterns above, do we need the new patterns or is that already handled by the existing "insert" and "zero a register" patterns?

@davemgreen davemgreen changed the title [ISel] Replace expensive move from wzr by two moves via floating point immediate [AArch64] Replace expensive move from wzr by two moves via floating point immediate Jul 14, 2025
Instead of introducing new patterns, guard exisiting ones by tuning feature
@juliannagele
Copy link
Member Author

Thanks! And you were right about not needing new patterns, so I've removed the ones I added and put the existing INS .. WZR ones behind a feature and added that to the cpus you mentioned.

Sure, do you maybe have an example I can work off, not sure how to go about this.

I think that DisableFastIncVL/HasFastIncVL shows how to add a subtarget feature and make patterns dependant on it. The old patterns can be made dependant on it, I believe, and it can be added to the processors that would not benefit. (FastIncVL is perhaps a bit odd as it doesn't appear to be used anywhere).

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@fhahn fhahn merged commit e333d60 into llvm:main Jul 16, 2025
9 checks passed
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jul 16, 2025
… floating point immediate (#146538)

We've noticed that inserting 0 into a known vector lane is implemented
via a move from wzr, i.e., moving between register banks. We think it
will be cheaper (and have seen improvements on our benchmarks) to
materialize 0 into a floating point register and insert from there.

PR: llvm/llvm-project#146538
@juliannagele juliannagele added this to the LLVM 21.x Release milestone Jul 22, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in LLVM Release Status Jul 22, 2025
@juliannagele
Copy link
Member Author

/cherry-pick e333d60

@llvmbot
Copy link
Member

llvmbot commented Jul 22, 2025

/pull-request #150001

@llvmbot llvmbot moved this from Needs Triage to Done in LLVM Release Status Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

4 participants