[AArch64] Replace expensive move from wzr by two moves via floating point immediate #146538

juliannagele · 2025-07-01T14:23:38Z

We've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there.

llvmbot · 2025-07-01T14:24:11Z

@llvm/pr-subscribers-backend-aarch64

Author: Julian Nagele (juliannagele)

Changes

We've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there.

Full diff: https://github.com/llvm/llvm-project/pull/146538.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+12-6)
(modified) llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll (+10-5)

diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index f01b634f44ba6..347f7d4d3c1a3 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -7356,16 +7356,10 @@ def : Pat<(v4f16 (vector_insert (v4f16 V64:$Rn),
               (i64 0)),
             dsub)>;
 
-def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
-          (INSvi16gpr V128:$Rn, VectorIndexH:$imm, WZR)>;
 def : Pat<(vector_insert (v4f16 V64:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
           (EXTRACT_SUBREG (INSvi16gpr (v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), V64:$Rn, dsub)), VectorIndexH:$imm, WZR), dsub)>;
-def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
-          (INSvi32gpr V128:$Rn, VectorIndexS:$imm, WZR)>;
 def : Pat<(vector_insert (v2f32 V64:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
           (EXTRACT_SUBREG (INSvi32gpr (v4f32 (INSERT_SUBREG (v4f32 (IMPLICIT_DEF)), V64:$Rn, dsub)), VectorIndexS:$imm, WZR), dsub)>;
-def : Pat<(vector_insert v2f64:$Rn, (f64 fpimm0), (i64 VectorIndexD:$imm)),
-          (INSvi64gpr V128:$Rn, VectorIndexS:$imm, XZR)>;
 
 def : Pat<(v8f16 (vector_insert (v8f16 V128:$Rn),
             (f16 FPR16:$Rm), (i64 VectorIndexH:$imm))),
@@ -8035,6 +8029,18 @@ def MOVIv2d_ns   : SIMDModifiedImmVectorNoShift<1, 1, 0, 0b1110, V128,
                                                 "movi", ".2d",
                    [(set (v2i64 V128:$Rd), (AArch64movi_edit imm0_255:$imm8))]>;
 
+def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
+          (INSvi16lane V128:$Rn, VectorIndexH:$imm,
+          (v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
+          (INSvi32lane V128:$Rn, VectorIndexS:$imm,
+          (v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)),
+          (INSvi64lane V128:$Rn, VectorIndexD:$imm,
+          (v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
 let Predicates = [HasNEON] in {
 def : Pat<(v2i64 immAllZerosV), (MOVIv2d_ns (i32 0))>;
 def : Pat<(v4i32 immAllZerosV), (MOVIv2d_ns (i32 0))>;
diff --git a/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll b/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
index ff28c7817d143..3638c1b5005ad 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll
@@ -172,8 +172,9 @@ define <8 x half> @test_insert_v8f16_insert_1(half %a) {
 ; CHECK-LABEL: test_insert_v8f16_insert_1:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    // kill: def $h0 killed $h0 def $q0
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
 ; CHECK-NEXT:    dup.8h v0, v0[0]
-; CHECK-NEXT:    mov.h v0[7], wzr
+; CHECK-NEXT:    mov.h v0[7], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <8 x half> <half undef, half undef, half undef, half undef, half undef, half undef, half undef, half 0.0>, half %a, i32 0
   %v.1 = insertelement <8 x half> %v.0, half %a, i32 1
@@ -278,8 +279,9 @@ define <4 x float> @test_insert_3_f32_undef_zero_vector(float %a) {
 ; CHECK-LABEL: test_insert_3_f32_undef_zero_vector:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    // kill: def $s0 killed $s0 def $q0
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
 ; CHECK-NEXT:    dup.4s v0, v0[0]
-; CHECK-NEXT:    mov.s v0[3], wzr
+; CHECK-NEXT:    mov.s v0[3], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <4 x float> <float undef, float undef, float undef, float 0.000000e+00>, float %a, i32 0
   %v.1 = insertelement <4 x float> %v.0, float %a, i32 1
@@ -362,7 +364,8 @@ define <4 x half> @test_insert_v4f16_f16_zero(<4 x half> %a) {
 define <8 x half> @test_insert_v8f16_f16_zero(<8 x half> %a) {
 ; CHECK-LABEL: test_insert_v8f16_f16_zero:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov.h v0[6], wzr
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
+; CHECK-NEXT:    mov.h v0[6], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <8 x half> %a, half 0.000000e+00, i32 6
   ret <8 x half> %v.0
@@ -382,7 +385,8 @@ define <2 x float> @test_insert_v2f32_f32_zero(<2 x float> %a) {
 define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {
 ; CHECK-LABEL: test_insert_v4f32_f32_zero:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov.s v0[3], wzr
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
+; CHECK-NEXT:    mov.s v0[3], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <4 x float> %a, float 0.000000e+00, i32 3
   ret <4 x float> %v.0
@@ -391,7 +395,8 @@ define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {
 define <2 x double> @test_insert_v2f64_f64_zero(<2 x double> %a) {
 ; CHECK-LABEL: test_insert_v2f64_f64_zero:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov.d v0[1], xzr
+; CHECK-NEXT:    movi.2d v1, #0000000000000000
+; CHECK-NEXT:    mov.d v0[1], v1[0]
 ; CHECK-NEXT:    ret
   %v.0 = insertelement <2 x double> %a, double 0.000000e+00, i32 1
   ret <2 x double> %v.0

davemgreen

I think in theory this shouldn't be much of an improvement, if the existing operation is already split and one of the mops is essentially the new instruction. In practice it is probably a good idea though.

There are certain cpus where it will not be helpful. Do you mind adding a tuning feature that keeps the old patterns for cortex-a53, a55, a510, a520 and a310. Those are all in-order and this will just be extra instructions.

davemgreen · 2025-07-10T14:20:29Z

llvm/lib/Target/AArch64/AArch64InstrInfo.td

+def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
+          (INSvi16lane V128:$Rn, VectorIndexH:$imm,
+          (v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
+          (INSvi32lane V128:$Rn, VectorIndexS:$imm,
+          (v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)),
+          (INSvi64lane V128:$Rn, VectorIndexD:$imm,
+          (v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>;


Does it need these new patterns? Or is the codegen without any pattern already OK?

Hm, sorry not sure I understand, without those patterns we get the move from wzr (which we want to avoid).

I mean - If we remove the INS wzr patterns above, do we need the new patterns or is that already handled by the existing "insert" and "zero a register" patterns?

Ohh, I see, it is, yes :)

llvm/lib/Target/AArch64/AArch64InstrInfo.td

juliannagele · 2025-07-11T17:19:51Z

There are certain cpus where it will not be helpful. Do you mind adding a tuning feature that keeps the old patterns for cortex-a53, a55, a510, a520 and a310. Those are all in-order and this will just be extra instructions.

Sure, do you maybe have an example I can work off, not sure how to go about this.

davemgreen

Sure, do you maybe have an example I can work off, not sure how to go about this.

I think that DisableFastIncVL/HasFastIncVL shows how to add a subtarget feature and make patterns dependant on it. The old patterns can be made dependant on it, I believe, and it can be added to the processors that would not benefit. (FastIncVL is perhaps a bit odd as it doesn't appear to be used anywhere).

davemgreen · 2025-07-12T18:36:30Z

llvm/lib/Target/AArch64/AArch64InstrInfo.td

+def : Pat<(vector_insert (v8f16 V128:$Rn), (f16 fpimm0), (i64 VectorIndexH:$imm)),
+          (INSvi16lane V128:$Rn, VectorIndexH:$imm,
+          (v8f16 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v4f32 V128:$Rn), (f32 fpimm0), (i64 VectorIndexS:$imm)),
+          (INSvi32lane V128:$Rn, VectorIndexS:$imm,
+          (v4f32 (MOVIv2d_ns (i32 0))), (i64 0))>;
+
+def : Pat<(vector_insert (v2f64 V128:$Rn), (f64 fpimm0), (i64 VectorIndexD:$imm)),
+          (INSvi64lane V128:$Rn, VectorIndexD:$imm,
+          (v2f64 (MOVIv2d_ns (i32 0))), (i64 0))>;


I mean - If we remove the INS wzr patterns above, do we need the new patterns or is that already handled by the existing "insert" and "zero a register" patterns?

Instead of introducing new patterns, guard exisiting ones by tuning feature

juliannagele · 2025-07-14T15:42:00Z

Thanks! And you were right about not needing new patterns, so I've removed the ones I added and put the existing INS .. WZR ones behind a feature and added that to the cpus you mentioned.

Sure, do you maybe have an example I can work off, not sure how to go about this.

I think that DisableFastIncVL/HasFastIncVL shows how to add a subtarget feature and make patterns dependant on it. The old patterns can be made dependant on it, I believe, and it can be added to the processors that would not benefit. (FastIncVL is perhaps a bit odd as it doesn't appear to be used anywhere).

davemgreen

LGTM, thanks

fhahn

LGTM, thanks

… floating point immediate (#146538) We've noticed that inserting 0 into a known vector lane is implemented via a move from wzr, i.e., moving between register banks. We think it will be cheaper (and have seen improvements on our benchmarks) to materialize 0 into a floating point register and insert from there. PR: llvm/llvm-project#146538

juliannagele · 2025-07-22T11:22:08Z

/cherry-pick e333d60

llvmbot · 2025-07-22T11:28:12Z

/pull-request #150001

[ISel] Replace expensive mov from wzr by two moves via fpr

44db4f1

juliannagele requested review from aemerson and fhahn July 1, 2025 14:23

llvmbot added the backend:AArch64 label Jul 1, 2025

juliannagele mentioned this pull request Jul 1, 2025

[ISel] Commute FMUL and inserting zero into vector lane #146096

Open

fhahn requested a review from davemgreen July 1, 2025 14:36

davemgreen reviewed Jul 10, 2025

View reviewed changes

davemgreen reviewed Jul 14, 2025

View reviewed changes

davemgreen changed the title ~~[ISel] Replace expensive move from wzr by two moves via floating point immediate~~ [AArch64] Replace expensive move from wzr by two moves via floating point immediate Jul 14, 2025

fixup! [ISel] Replace expensive mov from wzr by two moves via fpr

7be1083

Instead of introducing new patterns, guard exisiting ones by tuning feature

juliannagele added 2 commits July 14, 2025 18:47

Merge branch 'main' into insertzero_lowering

dd61ed6

fixup! fixup! [ISel] Replace expensive mov from wzr by two moves via fpr

131c488

davemgreen approved these changes Jul 15, 2025

View reviewed changes

Merge branch 'main' into insertzero_lowering

4b5089a

fhahn approved these changes Jul 16, 2025

View reviewed changes

fhahn merged commit e333d60 into llvm:main Jul 16, 2025
9 checks passed

juliannagele added this to the LLVM 21.x Release milestone Jul 22, 2025

github-project-automation bot added this to LLVM Release Status Jul 22, 2025

github-project-automation bot moved this to Needs Triage in LLVM Release Status Jul 22, 2025

llvmbot moved this from Needs Triage to Done in LLVM Release Status Jul 22, 2025

This was referenced Jul 23, 2025

test abhinavgaba/llvm-project#2

Closed

Add dataFence plugin interface abhinavgaba/llvm-project#3

Closed

[AArch64] Replace expensive move from wzr by two moves via floating point immediate #146538

[AArch64] Replace expensive move from wzr by two moves via floating point immediate #146538

Uh oh!

Conversation

juliannagele commented Jul 1, 2025

Uh oh!

llvmbot commented Jul 1, 2025

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

juliannagele Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

davemgreen Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

juliannagele Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

juliannagele commented Jul 11, 2025

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

juliannagele commented Jul 14, 2025

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

juliannagele commented Jul 22, 2025

Uh oh!

llvmbot commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants