[AArch64][ISel] Select constructive EXT_ZZI pseudo instruction #152554

gbossu · 2025-08-07T17:19:09Z

The patch adds patterns to select the EXT_ZZI_CONSTRUCTIVE pseudo
instead of the EXT_ZZI destructive instruction for vector_splice. This
only works when the two inputs to vector_splice are identical.

Given that registers aren't tied anymore, this gives the register
allocator more freedom and a lot of MOVs get replaced with MOVPRFX.

In some cases however, we could have just chosen the same input and
output register, but regalloc preferred not to. This means we end up
with some test cases now having more instructions: there is now a
MOVPRFX while no MOV was previously needed.

This is a chained PR: #152552 - #152553 - #152554

llvmbot · 2025-08-07T17:19:43Z

@llvm/pr-subscribers-backend-aarch64

Author: Gaëtan Bossu (gbossu)

Changes

The patch changes existing patterns to select the EXT_ZZZI pseudo
instead of the EXT_ZZI destructive instruction for vector_splice.

Given that registers aren't tied anymore, this gives the register
allocator more freedom and a lot of MOVs get replaced with MOVPRFX.

In some cases however, we could have just chosen the same input and
output register, but regalloc preferred not to. This means we end up
with some test cases now having more instructions: there is now a
MOVPRFX while no MOV was previously needed.

Patch is 154.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152554.diff

21 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp (+7-3)
(modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+4-4)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll (+21-20)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll (+24-20)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll (+30-24)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll (+20-20)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-int-to-fp.ll (+24-20)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-limit-duplane.ll (+8-6)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-masked-loads.ll (+70-56)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-partial-reduce.ll (+14-14)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+21-20)
(modified) llvm/test/CodeGen/AArch64/sve-pr92779.ll (+9-9)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-extend-trunc.ll (+15-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-to-int.ll (+150-136)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+413-327)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-rem.ll (+108-108)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+152-132)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-limit-duplane.ll (+8-7)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-load.ll (+14-12)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-store.ll (+20-18)
(modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-reductions.ll (+52-42)

diff --git a/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp b/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp
index cdf2822f3ed9d..b7d69b68af4ee 100644
--- a/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp
+++ b/llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp
@@ -53,9 +53,6 @@ bool AArch64PostCoalescer::runOnMachineFunction(MachineFunction &MF) {
   if (skipFunction(MF.getFunction()))
     return false;
 
-  AArch64FunctionInfo *FuncInfo = MF.getInfo<AArch64FunctionInfo>();
-  if (!FuncInfo->hasStreamingModeChanges())
-    return false;
 
   MRI = &MF.getRegInfo();
   LIS = &getAnalysis<LiveIntervalsWrapperPass>().getLIS();
@@ -86,6 +83,13 @@ bool AArch64PostCoalescer::runOnMachineFunction(MachineFunction &MF) {
         Changed = true;
         break;
       }
+      case AArch64::EXT_ZZZI:
+        Register DstReg = MI.getOperand(0).getReg();
+        Register SrcReg1 = MI.getOperand(1).getReg();
+        if (SrcReg1 != DstReg) {
+          MRI->setRegAllocationHint(DstReg, 0, SrcReg1);
+        }
+        break;
       }
     }
   }
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 85e647af6684c..a3ca0cb73cd43 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -2135,19 +2135,19 @@ let Predicates = [HasSVE_or_SME] in {
   // Splice with lane bigger or equal to 0
   foreach VT = [nxv16i8] in
     def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_255 i32:$index)))),
-              (EXT_ZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+              (EXT_ZZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
 
   foreach VT = [nxv8i16, nxv8f16, nxv8bf16] in
     def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_127 i32:$index)))),
-              (EXT_ZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+              (EXT_ZZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
 
   foreach VT = [nxv4i32, nxv4f16, nxv4f32, nxv4bf16] in
     def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_63 i32:$index)))),
-              (EXT_ZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+              (EXT_ZZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
 
   foreach VT = [nxv2i64, nxv2f16, nxv2f32, nxv2f64, nxv2bf16] in
     def : Pat<(VT (vector_splice VT:$Z1, VT:$Z2, (i64 (sve_ext_imm_0_31 i32:$index)))),
-              (EXT_ZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
+              (EXT_ZZZI  ZPR:$Z1, ZPR:$Z2, imm0_255:$index)>;
 
   defm CMPHS_PPzZZ : sve_int_cmp_0<0b000, "cmphs", SETUGE, SETULE>;
   defm CMPHI_PPzZZ : sve_int_cmp_0<0b001, "cmphi", SETUGT, SETULT>;
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
index 800f95d97af4c..7b438743487e1 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
@@ -50,7 +50,7 @@ define void @extract_v32i8_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
 ; CHECK-LABEL: extract_v32i8_halves:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    str q1, [x1]
 ; CHECK-NEXT:    str q0, [x2]
@@ -68,7 +68,7 @@ define void @extract_v32i8_half_unaligned(ptr %in, ptr %out) #0 vscale_range(2,2
 ; CHECK-LABEL: extract_v32i8_half_unaligned:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    ext v0.16b, v0.16b, v1.16b, #4
 ; CHECK-NEXT:    str q0, [x1]
@@ -84,15 +84,16 @@ define void @extract_v32i8_quarters(ptr %in, ptr %out, ptr %out2, ptr %out3, ptr
 ; CHECK-LABEL: extract_v32i8_quarters:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
-; CHECK-NEXT:    mov z2.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
+; CHECK-NEXT:    movprfx z2, z0
 ; CHECK-NEXT:    ext z2.b, z2.b, z0.b, #24
+; CHECK-NEXT:    movprfx z3, z0
+; CHECK-NEXT:    ext z3.b, z3.b, z0.b, #8
 ; CHECK-NEXT:    str d1, [x1]
 ; CHECK-NEXT:    str d2, [x2]
 ; CHECK-NEXT:    str d0, [x3]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #8
-; CHECK-NEXT:    str d0, [x4]
+; CHECK-NEXT:    str d3, [x4]
 ; CHECK-NEXT:    ret
 entry:
   %b = load <32 x i8>, ptr %in
@@ -126,7 +127,7 @@ define void @extract_v64i8_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
 ; CHECK-NEXT:    ptrue p0.b, vl32
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #32
 ; CHECK-NEXT:    st1b { z1.b }, p0, [x1]
 ; CHECK-NEXT:    st1b { z0.b }, p0, [x2]
@@ -207,7 +208,7 @@ define void @extract_v16i16_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range
 ; CHECK-LABEL: extract_v16i16_halves:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    str q1, [x1]
 ; CHECK-NEXT:    str q0, [x2]
@@ -240,7 +241,7 @@ define void @extract_v32i16_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
 ; CHECK-NEXT:    ptrue p0.h, vl16
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #32
 ; CHECK-NEXT:    st1h { z1.h }, p0, [x1]
 ; CHECK-NEXT:    st1h { z0.h }, p0, [x2]
@@ -322,7 +323,7 @@ define void @extract_v8i32_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
 ; CHECK-LABEL: extract_v8i32_halves:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    str q1, [x1]
 ; CHECK-NEXT:    str q0, [x2]
@@ -355,7 +356,7 @@ define void @extract_v16i32_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
 ; CHECK-NEXT:    ptrue p0.s, vl8
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #32
 ; CHECK-NEXT:    st1w { z1.s }, p0, [x1]
 ; CHECK-NEXT:    st1w { z0.s }, p0, [x2]
@@ -426,7 +427,7 @@ define void @extract_v4i64_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
 ; CHECK-LABEL: extract_v4i64_halves:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    str q1, [x1]
 ; CHECK-NEXT:    str q0, [x2]
@@ -459,7 +460,7 @@ define void @extract_v8i64_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_range(
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
 ; CHECK-NEXT:    ptrue p0.d, vl4
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #32
 ; CHECK-NEXT:    st1d { z1.d }, p0, [x1]
 ; CHECK-NEXT:    st1d { z0.d }, p0, [x2]
@@ -553,7 +554,7 @@ define void @extract_v16half_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_rang
 ; CHECK-LABEL: extract_v16half_halves:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    str q1, [x1]
 ; CHECK-NEXT:    str q0, [x2]
@@ -586,7 +587,7 @@ define void @extract_v32half_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_rang
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
 ; CHECK-NEXT:    ptrue p0.h, vl16
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #32
 ; CHECK-NEXT:    st1h { z1.h }, p0, [x1]
 ; CHECK-NEXT:    st1h { z0.h }, p0, [x2]
@@ -668,7 +669,7 @@ define void @extract_v8float_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_rang
 ; CHECK-LABEL: extract_v8float_halves:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    str q1, [x1]
 ; CHECK-NEXT:    str q0, [x2]
@@ -701,7 +702,7 @@ define void @extract_v16float_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_ran
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
 ; CHECK-NEXT:    ptrue p0.s, vl8
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #32
 ; CHECK-NEXT:    st1w { z1.s }, p0, [x1]
 ; CHECK-NEXT:    st1w { z0.s }, p0, [x2]
@@ -772,7 +773,7 @@ define void @extract_v4double_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_ran
 ; CHECK-LABEL: extract_v4double_halves:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    str q1, [x1]
 ; CHECK-NEXT:    str q0, [x2]
@@ -805,7 +806,7 @@ define void @extract_v8double_halves(ptr %in, ptr %out, ptr %out2) #0 vscale_ran
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    ldr z0, [x0]
 ; CHECK-NEXT:    ptrue p0.d, vl4
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #32
 ; CHECK-NEXT:    st1d { z1.d }, p0, [x1]
 ; CHECK-NEXT:    st1d { z0.d }, p0, [x2]
@@ -908,7 +909,7 @@ define void @extract_subvector_legalization_v8i32() vscale_range(2,2) #0 {
 ; CHECK-NEXT:    add x8, x8, :lo12:.LCPI59_0
 ; CHECK-NEXT:    ptrue p1.d
 ; CHECK-NEXT:    ldr z0, [x8]
-; CHECK-NEXT:    mov z1.d, z0.d
+; CHECK-NEXT:    movprfx z1, z0
 ; CHECK-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; CHECK-NEXT:    cmeq v0.4s, v0.4s, #0
 ; CHECK-NEXT:    cmeq v1.4s, v1.4s, #0
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll
index af54b146c5b66..c8f6d98f5a63f 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll
@@ -150,13 +150,14 @@ define void @fcvtzu_v16f16_v16i32(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    mov x8, #8 // =0x8
 ; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
-; VBITS_GE_256-NEXT:    uunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    uunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT:    fcvtzu z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT:    uunpklo z1.s, z1.h
 ; VBITS_GE_256-NEXT:    fcvtzu z0.s, p0/m, z0.h
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT:    fcvtzu z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1, x8, lsl #2]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fcvtzu_v16f16_v16i32:
@@ -551,13 +552,14 @@ define void @fcvtzu_v8f32_v8i64(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    mov x8, #4 // =0x4
 ; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.d, vl4
-; VBITS_GE_256-NEXT:    uunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    uunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT:    fcvtzu z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT:    uunpklo z1.d, z1.s
 ; VBITS_GE_256-NEXT:    fcvtzu z0.d, p0/m, z0.s
-; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT:    fcvtzu z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1, x8, lsl #3]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fcvtzu_v8f32_v8i64:
@@ -1043,13 +1045,14 @@ define void @fcvtzs_v16f16_v16i32(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    mov x8, #8 // =0x8
 ; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
-; VBITS_GE_256-NEXT:    uunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    uunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT:    fcvtzs z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT:    uunpklo z1.s, z1.h
 ; VBITS_GE_256-NEXT:    fcvtzs z0.s, p0/m, z0.h
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT:    fcvtzs z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1, x8, lsl #2]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fcvtzs_v16f16_v16i32:
@@ -1444,13 +1447,14 @@ define void @fcvtzs_v8f32_v8i64(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    mov x8, #4 // =0x4
 ; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.d, vl4
-; VBITS_GE_256-NEXT:    uunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    uunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT:    fcvtzs z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT:    uunpklo z1.d, z1.s
 ; VBITS_GE_256-NEXT:    fcvtzs z0.d, p0/m, z0.s
-; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT:    fcvtzs z1.d, p0/m, z1.s
+; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1, x8, lsl #3]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: fcvtzs_v8f32_v8i64:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll
index 4feb86305f8f6..d2fa65599b973 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll
@@ -77,11 +77,12 @@ define void @sext_v32i8_v32i16(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256-NEXT:    ld1b { z0.b }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.h, vl16
 ; VBITS_GE_256-NEXT:    add z0.b, z0.b, z0.b
-; VBITS_GE_256-NEXT:    sunpklo z1.h, z0.b
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    sunpklo z0.h, z0.b
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1, x8, lsl #1]
+; VBITS_GE_256-NEXT:    sunpklo z1.h, z1.b
+; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1, x8, lsl #1]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: sext_v32i8_v32i16:
@@ -326,11 +327,12 @@ define void @sext_v16i16_v16i32(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
 ; VBITS_GE_256-NEXT:    add z0.h, z0.h, z0.h
-; VBITS_GE_256-NEXT:    sunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    sunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT:    sunpklo z1.s, z1.h
+; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1, x8, lsl #2]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: sext_v16i16_v16i32:
@@ -490,11 +492,12 @@ define void @sext_v8i32_v8i64(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.d, vl4
 ; VBITS_GE_256-NEXT:    add z0.s, z0.s, z0.s
-; VBITS_GE_256-NEXT:    sunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    sunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT:    sunpklo z1.d, z1.s
+; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1, x8, lsl #3]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: sext_v8i32_v8i64:
@@ -573,11 +576,12 @@ define void @zext_v32i8_v32i16(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256-NEXT:    ld1b { z0.b }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.h, vl16
 ; VBITS_GE_256-NEXT:    add z0.b, z0.b, z0.b
-; VBITS_GE_256-NEXT:    uunpklo z1.h, z0.b
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    uunpklo z0.h, z0.b
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1, x8, lsl #1]
+; VBITS_GE_256-NEXT:    uunpklo z1.h, z1.b
+; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1, x8, lsl #1]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: zext_v32i8_v32i16:
@@ -822,11 +826,12 @@ define void @zext_v16i16_v16i32(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
 ; VBITS_GE_256-NEXT:    add z0.h, z0.h, z0.h
-; VBITS_GE_256-NEXT:    uunpklo z1.s, z0.h
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    uunpklo z0.s, z0.h
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT:    uunpklo z1.s, z1.h
+; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1, x8, lsl #2]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: zext_v16i16_v16i32:
@@ -986,11 +991,12 @@ define void @zext_v8i32_v8i64(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0]
 ; VBITS_GE_256-NEXT:    ptrue p0.d, vl4
 ; VBITS_GE_256-NEXT:    add z0.s, z0.s, z0.s
-; VBITS_GE_256-NEXT:    uunpklo z1.d, z0.s
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16
 ; VBITS_GE_256-NEXT:    uunpklo z0.d, z0.s
-; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1]
-; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT:    uunpklo z1.d, z1.s
+; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1, x8, lsl #3]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: zext_v8i32_v8i64:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll
index 2d78945399176..27be84419d59e 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll
@@ -259,17 +259,17 @@ define void @srem_v256i8(ptr %a, ptr %b) vscale_range(16,0) #0 {
 ; CHECK-NEXT:    sunpklo z2.s, z2.h
 ; CHECK-NEXT:    sunpklo z3.s, z3.h
 ; CHECK-NEXT:    sdivr z4.s, p1/m, z4.s, z5.s
-; CHECK-NEXT:    mov z5.d, z0.d
+; CHECK-NEXT:    movprfx z5, z0
 ; CHECK-NEXT:    ext z5.b, z5.b, z0.b, #128
 ; CHECK-NEXT:    sunpklo z5.h, z5.b
 ; CHECK-NEXT:    sunpklo z7.s, z5.h
 ; CHECK-NEXT:    ext z5.b, z5.b, z5.b, #128
-; CHECK-NEXT:    sdivr z2.s, p1/m, z2.s, z3.s
-; CHECK-NEXT:    mov z3.d, z1.d
 ; CHECK-NEXT:    sunpklo z5.s, z5.h
+; CHECK-NEXT:    sdivr z2.s, p1/m, z2.s, z3.s
+; CHECK-NEXT:    movprfx z3, z1
 ; CHECK-NEXT:    ext z3.b, z3.b, z1.b, #128
-; CHECK-NEXT:    uzp1 z4.h, z4.h, z4.h
 ; CHECK-NEXT:    sunpklo z3.h, z3.b
+; CHECK-NEXT:    uzp1 z4.h, z4.h, z4.h
 ; CHECK-NEXT:    sunpklo z6.s, z3.h
 ; CHECK-NEXT:    ext z3.b, z3.b, z3.b, #128
 ; CHECK-NEXT:    sunpklo z3.s, z3.h
@@ -420,11 +420,11 @@ define void @srem_v16i16(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    ld1h { z1.h }, p0/z, [x1]
 ; VBITS_GE_256-NEXT:    sunpklo z2.s, z1.h
 ; VBITS_GE_256-NEXT:    sunpklo z3.s, z0.h
-; VBITS_GE_256-NEXT:    mov z4.d, z0.d
+; VBITS_GE_256-NEXT:    movprfx z4, z0
 ; VBITS_GE_256-NEXT:    ext z4...
[truncated]

llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp

llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll

gbossu · 2025-08-07T17:44:09Z

llvm/test/CodeGen/AArch64/sve-fixed-length-fp-to-int.ll

-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT:    fcvtzu z1.s, p0/m, z1.h
+; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1, x8, lsl #2]


In that example, we do get one more instruction now (the movprfx), but I think the schedule is actually better because we eliminate one dependency between ext and the second uunpklo. Now the two uunpklo can execute in parallel.

This is is the theme of the test updates in general: Sometimes more instructions, but more freedom for the MachineScheduler

SamTebbs33 · 2025-08-13T12:36:24Z

llvm/test/CodeGen/AArch64/sve-fixed-length-masked-loads.ll

-; VBITS_GE_256-NEXT:    sunpklo z1.h, z0.b
-; VBITS_GE_256-NEXT:    ext z0.b, z0.b, z0.b, #16
+; VBITS_GE_256-NEXT:    movprfx z1, z0
+; VBITS_GE_256-NEXT:    ext z1.b, z1.b, z0.b, #16


According to the documentation for movprfx, the following instruction shouldn't use the destination register again in other operands, but the first source register for the ext is the same as the destination register.

That is just an artefact of the instruction's textual format. The instruction only has a single dedicated input register.

EXT <Zdn>.B, <Zdn>.B, <Zm>.B, #<imm>

Ah, is it not this ext?

No, that's the NEON instruction. It is the SVE destructive variant of https://developer.arm.com/documentation/ddi0602/2025-06/SVE-Instructions/EXT--Extract-vector-from-pair-of-vectors-

The patch adds patterns to select the EXT_ZZI_CONSTRUCTIVE pseudo instead of the EXT_ZZI destructive instruction for vector_splice. This only works when the two inputs to vector_splice are identical. Given that registers aren't tied anymore, this gives the register allocator more freedom and a lot of MOVs get replaced with MOVPRFX. In some cases however, we could have just chosen the same input and output register, but regalloc preferred not to. This means we end up with some test cases now having more instructions: there is now a MOVPRFX while no MOV was previously needed.

…rs (#155601) Previously, only 128-bit "NEON" vectors were given sensible costs. Cores with vscale>1 can use SVE's EXT instruction to perform a fixed-length subvector extract. This is a follow-up from the codegen patches at #152554. They show that with the help of MOVPRFX, we can do subvector extracts with roughly one instruction. We now at least give sensible costs for extracting 128-bit halves from a 256-bit vector.

llvmbot added the backend:AArch64 label Aug 7, 2025

This was referenced Aug 7, 2025

[AArch64] Define constructive EXT_ZZI pseudo instruction #152552

Merged

[AArch64][ISel] Add unary vector_splice tests (NFC) #152553

Merged

gbossu commented Aug 7, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64PostCoalescerPass.cpp Outdated Show resolved Hide resolved

gbossu commented Aug 7, 2025

View reviewed changes

llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll Outdated Show resolved Hide resolved

gbossu commented Aug 7, 2025

View reviewed changes

gbossu force-pushed the users/gbossu.vector.extract.movprfx.3 branch from 56ac99a to 4ad6acf Compare August 7, 2025 17:51

llvm deleted a comment from github-actions bot Aug 7, 2025

gbossu force-pushed the users/gbossu.vector.extract.movprfx.2 branch from a6be08b to 970513b Compare August 12, 2025 15:49

gbossu force-pushed the users/gbossu.vector.extract.movprfx.3 branch from 4ad6acf to b6e3365 Compare August 12, 2025 15:51

gbossu changed the title ~~[AArch64][ISel] Select constructive EXT_ZZZI pseudo instruction~~ [AArch64][ISel] Select constructive EXT_ZZI pseudo instruction Aug 12, 2025

gbossu requested review from SamTebbs33, paulwalker-arm and sdesmalen-arm August 12, 2025 15:53

SamTebbs33 reviewed Aug 13, 2025

View reviewed changes

gbossu force-pushed the users/gbossu.vector.extract.movprfx.2 branch from 970513b to 910fc50 Compare August 15, 2025 08:09

gbossu force-pushed the users/gbossu.vector.extract.movprfx.3 branch from b6e3365 to 087e105 Compare August 15, 2025 08:11

paulwalker-arm approved these changes Aug 15, 2025

View reviewed changes

Base automatically changed from users/gbossu.vector.extract.movprfx.2 to main August 15, 2025 12:15

gbossu force-pushed the users/gbossu.vector.extract.movprfx.3 branch from 087e105 to fbcac18 Compare August 15, 2025 12:18

SamTebbs33 approved these changes Aug 15, 2025

View reviewed changes

gbossu merged commit 9828745 into main Aug 15, 2025
9 checks passed

gbossu deleted the users/gbossu.vector.extract.movprfx.3 branch August 15, 2025 13:30

gbossu mentioned this pull request Aug 27, 2025

[AArch64] Update cost model for extracting halves from 128+ bit vectors #155601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64][ISel] Select constructive EXT_ZZI pseudo instruction #152554

[AArch64][ISel] Select constructive EXT_ZZI pseudo instruction #152554

Uh oh!

gbossu commented Aug 7, 2025 •

edited

Loading

Uh oh!

llvmbot commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

gbossu Aug 7, 2025

Uh oh!

SamTebbs33 Aug 13, 2025 •

edited

Loading

Uh oh!

paulwalker-arm Aug 13, 2025

Uh oh!

SamTebbs33 Aug 13, 2025

Uh oh!

paulwalker-arm Aug 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[AArch64][ISel] Select constructive EXT_ZZI pseudo instruction #152554

[AArch64][ISel] Select constructive EXT_ZZI pseudo instruction #152554

Uh oh!

Conversation

gbossu commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

gbossu Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

SamTebbs33 Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

SamTebbs33 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

paulwalker-arm Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gbossu commented Aug 7, 2025 •

edited

Loading

SamTebbs33 Aug 13, 2025 •

edited

Loading

paulwalker-arm Aug 13, 2025 •

edited

Loading