Skip to content

Conversation

@arsenm
Copy link
Contributor

@arsenm arsenm commented Oct 2, 2025

Introduce a target hook to incrementally flip the behavior of
targets with test changes, and start by implementing it for AMDGPU.

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

Copy link
Contributor Author

arsenm commented Oct 2, 2025

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-backend-systemz
@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-globalisel

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-llvm-regalloc

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.


Patch is 1.65 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161621.diff

152 Files Affected:

  • (modified) llvm/lib/CodeGen/RegisterCoalescer.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/build-vector-two-dup.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll (+15-24)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll (+34-51)
  • (modified) llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-extract-fixed-vector.ll (+20-19)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-reshuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-shuffles.ll (+36-36)
  • (modified) llvm/test/CodeGen/AArch64/sve-ptest-removal-sink.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+46-46)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+180-194)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+47-53)
  • (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+42-42)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_nested.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+19-23)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-sext-inreg.ll (+26-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin_legacy.ll (+11-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+40-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+21-27)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+9-10)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+20-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll (+14-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+18-19)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+3-5)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+1038-1012)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+1070-1042)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+1224-1182)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+12-13)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+40-46)
  • (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+9-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+204-204)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_int24.ll (+60-69)
  • (modified) llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll (+50-56)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/sext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+30-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+9-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sign_extend.ll (+42-48)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+31-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (+18-23)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+30-32)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+12-16)
  • (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+6-7)
  • (modified) llvm/test/CodeGen/AMDGPU/zext-divergence-driven-isel.ll (+7-8)
  • (modified) llvm/test/CodeGen/BPF/objdump_cond_op_2.ll (+2-2)
  • (modified) llvm/test/CodeGen/Hexagon/late_instr.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-carried-1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-conv3x3-nested.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi11.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi12.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-epilog-phi7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-kernel-phi1.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-matmul-bitext.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-copies.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-order-deps7.ll (+1-1)
  • (modified) llvm/test/CodeGen/Hexagon/swp-reuse-phi-6.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-b128.ll (+75-75)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/atomics.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/ctrloop-fp128.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/licm-xxsplti.ll (+27-27)
  • (modified) llvm/test/CodeGen/PowerPC/loop-instr-form-prepare.ll (+3-5)
  • (modified) llvm/test/CodeGen/PowerPC/perfect-shuffle.ll (+6-6)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-1.ll (+3-2)
  • (modified) llvm/test/CodeGen/PowerPC/sms-phi-2.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+6-10)
  • (modified) llvm/test/CodeGen/RISCV/machine-pipeliner.ll (+23-23)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-exact-vlen.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/pr95865.ll (+21-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vandn-sdnode.ll (+33-33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vcpop-shl-zext-opt.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+12-12)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fadd-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fsub-01.ll (+6-5)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll (+9-9)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll (+24-26)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll (+45-46)
  • (modified) llvm/test/CodeGen/Thumb2/mve-float32regloops.ll (+102-109)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-increment.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll (+45-45)
  • (modified) llvm/test/CodeGen/Thumb2/mve-pipelineloops.ll (+26-26)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+7-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+7-7)
  • (modified) llvm/test/CodeGen/Thumb2/pacbti-m-vla.ll (+1-1)
  • (modified) llvm/test/CodeGen/WebAssembly/simd-shift-in-loop.ll (+6-8)
  • (modified) llvm/test/CodeGen/X86/3addr-16bit.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/atomic-rm-bit-test.ll (+13-9)
  • (modified) llvm/test/CodeGen/X86/atomicrmw-fadd-fp-vector.ll (+2-1)
  • (modified) llvm/test/CodeGen/X86/bitcast-vector-bool.ll (+16-16)
  • (modified) llvm/test/CodeGen/X86/coalescer-dead-flag-verifier-error.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/dag-update-nodetomatch.ll (+52-48)
  • (modified) llvm/test/CodeGen/X86/fold-loop-of-urem.ll (+38-43)
  • (modified) llvm/test/CodeGen/X86/freeze-binary.ll (+14-12)
  • (modified) llvm/test/CodeGen/X86/i128-mul.ll (+87-91)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/masked_gather_scatter.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/midpoint-int.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/mmx-arith.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i16.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/mul-constant-i8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/optimize-max-0.ll (+107-104)
  • (modified) llvm/test/CodeGen/X86/parity.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/rotate-extract.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sshl_sat.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/sshl_sat_vec.ll (+56-57)
  • (modified) llvm/test/CodeGen/X86/stackmap.ll (+6-3)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+105-105)
  • (modified) llvm/test/CodeGen/X86/twoaddr-lea.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/ushl_sat.ll (+14-14)
  • (modified) llvm/test/CodeGen/X86/ushl_sat_vec.ll (+55-56)
  • (modified) llvm/test/CodeGen/X86/vector-mulfix-legalize.ll (+17-17)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-xor-bool.ll (+80-80)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-by-byte-multiple-legalization.ll (+3023-3058)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+668-676)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll (+165-163)
  • (modified) llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll (+49-46)
  • (modified) llvm/test/CodeGen/X86/x86-shrink-wrapping.ll (+9-9)
  • (modified) llvm/test/CodeGen/X86/xor.ll (+66-66)
  • (modified) llvm/test/Transforms/LoopStrengthReduce/X86/ivchain-X86.ll (+11-10)
diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 7ac1aef83777a..5bd38a916fe4d 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -81,7 +81,7 @@ static cl::opt<bool> EnableJoining("join-liveintervals",
 
 static cl::opt<bool> UseTerminalRule("terminal-rule",
                                      cl::desc("Apply the terminal rule"),
-                                     cl::init(false), cl::Hidden);
+                                     cl::init(true), cl::Hidden);
 
 /// Temporary flag to test critical edge unsplitting.
 static cl::opt<bool> EnableJoinSplits(
diff --git a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
index dbbfbea9176f6..f725c19081deb 100644
--- a/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
+++ b/llvm/test/CodeGen/AArch64/build-vector-two-dup.ll
@@ -188,11 +188,11 @@ entry:
 define <8 x i8> @test11(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
 ; CHECK-LABEL: test11:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    ld1r { v1.8b }, [x0]
-; CHECK-NEXT:    ld1r { v2.8b }, [x1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    mov v0.h[2], v2.h[0]
-; CHECK-NEXT:    mov v0.h[3], v1.h[0]
+; CHECK-NEXT:    ld1r { v0.8b }, [x0]
+; CHECK-NEXT:    ld1r { v1.8b }, [x1]
+; CHECK-NEXT:    fmov d2, d0
+; CHECK-NEXT:    mov v0.h[2], v1.h[0]
+; CHECK-NEXT:    mov v0.h[3], v2.h[0]
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
index 3230c9e946da7..b3a7ec961b736 100644
--- a/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
+++ b/llvm/test/CodeGen/AArch64/machine-licm-sink-instr.ll
@@ -20,20 +20,17 @@ define i32 @sink_load_and_copy(i32 %n) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    adrp x8, A
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:    ldr w21, [x8, :lo12:A]
+; CHECK-NEXT:    mov w21, w19
+; CHECK-NEXT:    ldr w20, [x8, :lo12:A]
 ; CHECK-NEXT:  .LBB0_2: // %for.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov w0, w21
+; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w20, w20, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB0_2
-; CHECK-NEXT:    b .LBB0_4
-; CHECK-NEXT:  .LBB0_3:
-; CHECK-NEXT:    mov w20, w19
-; CHECK-NEXT:  .LBB0_4: // %for.cond.cleanup
-; CHECK-NEXT:    mov w0, w20
+; CHECK-NEXT:  .LBB0_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
@@ -82,15 +79,12 @@ define i32 @cant_sink_successive_call(i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB1_2
-; CHECK-NEXT:    b .LBB1_4
-; CHECK-NEXT:  .LBB1_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB1_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB1_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
@@ -139,15 +133,12 @@ define i32 @cant_sink_successive_store(ptr nocapture readnone %store, i32 %n) {
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mov w0, w20
 ; CHECK-NEXT:    bl _Z3usei
-; CHECK-NEXT:    sdiv w21, w21, w0
-; CHECK-NEXT:    subs w19, w19, #1
+; CHECK-NEXT:    sdiv w19, w19, w0
+; CHECK-NEXT:    subs w21, w21, #1
 ; CHECK-NEXT:    b.ne .LBB2_2
-; CHECK-NEXT:    b .LBB2_4
-; CHECK-NEXT:  .LBB2_3:
-; CHECK-NEXT:    mov w21, w19
-; CHECK-NEXT:  .LBB2_4: // %for.cond.cleanup
+; CHECK-NEXT:  .LBB2_3: // %for.cond.cleanup
+; CHECK-NEXT:    mov w0, w19
 ; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    mov w0, w21
 ; CHECK-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
index e7e109170d6a1..338084295fc7f 100644
--- a/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
+++ b/llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll
@@ -16,13 +16,12 @@ define i32 @test(ptr %ptr) {
 ; CHECK-NEXT:    mov w9, wzr
 ; CHECK-NEXT:  LBB0_1: ; %.thread
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    lsr w11, w9, #1
 ; CHECK-NEXT:    sub w10, w9, #1
-; CHECK-NEXT:    mov w9, w11
+; CHECK-NEXT:    lsr w9, w9, #1
 ; CHECK-NEXT:    tbnz w10, #0, LBB0_1
 ; CHECK-NEXT:  ; %bb.2: ; %bb343
 ; CHECK-NEXT:    and w9, w10, #0x1
-; CHECK-NEXT:    mov w0, #-1
+; CHECK-NEXT:    mov w0, #-1 ; =0xffffffff
 ; CHECK-NEXT:    str w9, [x8]
 ; CHECK-NEXT:    ret
 bb:
diff --git a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
index b947c943ba448..72f6646930624 100644
--- a/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
+++ b/llvm/test/CodeGen/AArch64/sme-pstate-sm-changing-call-disable-coalescing.ll
@@ -151,12 +151,11 @@ define void @dont_coalesce_arg_f16(half %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_f16
@@ -190,12 +189,11 @@ define void @dont_coalesce_arg_f32(float %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $z0
 ; CHECK-NEXT:    str s0, [sp, #12] // 4-byte Folded Spill
+; CHECK-NEXT:    // kill: def $s0 killed $s0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr s0, [sp, #12] // 4-byte Folded Reload
 ; CHECK-NEXT:    bl use_f32
@@ -229,12 +227,11 @@ define void @dont_coalesce_arg_f64(double %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_f64
@@ -273,12 +270,11 @@ define void @dont_coalesce_arg_v1i8(<1 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -313,12 +309,11 @@ define void @dont_coalesce_arg_v1i16(<1 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -353,12 +348,11 @@ define void @dont_coalesce_arg_v1i32(<1 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -393,12 +387,11 @@ define void @dont_coalesce_arg_v1i64(<1 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -433,12 +426,11 @@ define void @dont_coalesce_arg_v1f16(<1 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $z0
 ; CHECK-NEXT:    str h0, [sp, #14] // 2-byte Folded Spill
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr h0, [sp, #14] // 2-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -513,12 +505,11 @@ define void @dont_coalesce_arg_v1f64(<1 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    str d0, [sp, #8] // 8-byte Folded Spill
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr d0, [sp, #8] // 8-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
@@ -557,12 +548,11 @@ define void @dont_coalesce_arg_v16i8(<16 x i8> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v16i8
@@ -596,12 +586,11 @@ define void @dont_coalesce_arg_v8i16(<8 x i16> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8i16
@@ -635,12 +624,11 @@ define void @dont_coalesce_arg_v4i32(<4 x i32> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4i32
@@ -674,12 +662,11 @@ define void @dont_coalesce_arg_v2i64(<2 x i64> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2i64
@@ -713,12 +700,11 @@ define void @dont_coalesce_arg_v8f16(<8 x half> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8f16
@@ -752,12 +738,11 @@ define void @dont_coalesce_arg_v8bf16(<8 x bfloat> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v8bf16
@@ -791,12 +776,11 @@ define void @dont_coalesce_arg_v4f32(<4 x float> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v4f32
@@ -830,12 +814,11 @@ define void @dont_coalesce_arg_v2f64(<2 x double> %arg, ptr %ptr) #0 {
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
-; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    mov x19, x0
-; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
 ; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
+; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
 ; CHECK-NEXT:    smstop sm
 ; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
 ; CHECK-NEXT:    bl use_v2f64
diff --git a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
index f2163ad15bafc..df88f37195ed6 100644
--- a/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
+++ b/llvm/test/CodeGen/AArch64/sme-streaming-compatible-interface.ll
@@ -129,12 +129,11 @@ define <2 x double> @streaming_compatible_with_neon_vectors(<2 x double> %arg) "
 ; CHECK-NEXT:    stp x30, x19, [sp, #80] // 16-byte Folded Spill
 ; CHECK-NEXT:    sub sp, sp, #16
 ; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
+; CHECK-NEXT:    mrs x19, SVCR
 ; CHECK-NEXT:    add x8, sp, #16
 ; CHECK-NEXT:    // kill: def $q0 killed $q0 def $z0
 ; CHECK-NEXT:    str z0, [x8] // 16-byte Folded Spill
-; CHECK-NEXT:    mrs x19, SVCR
-; CHECK-NEXT:    // kill: def $q0 killed $q0 killed $z0
-; CHECK-NEXT:    str q0, [sp] // 16-byte Folded Spill
 ; CHECK-NEXT:    tbz w19, #0, .LBB4_2
 ; CHECK-NEXT:  // %bb.1:
 ; CHECK-NEXT:    smstop sm
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
index 6c6a691760af3..52a77cb396909 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
@@ -147,15 +147,15 @@ define <2 x float> @extract_v2f32_nxv16f32_2(<vscale x 16 x float> %arg) {
 define <4 x i1> @extract_v4i1_nxv32i1_0(<vscale x 32 x i1> %arg) {
 ; CHECK-LABEL: extract_v4i1_nxv32i1_0:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov z1.b, p0/z, #1 // =0x1
-; CHECK-NEXT:    umov w8, v1.b[1]
-; CHECK-NEXT:    mov v0.16b, v1.16b
-; CHECK-NEXT:    umov w9, v1.b[2]
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    umov w8, v0.b[1]
+; CHECK-NEXT:    mov v1.16b, v0.16b
 ; CHECK-NEXT:    mov v0.h[1], w8
+; CHECK-NEXT:    umov w8, v1.b[2]
+; CHECK-NEXT:    mov v0.h[2], w8
 ; CHECK-NEXT:    umov w8, v1.b[3]
-; CHECK-NEXT:    mov v0.h[2], w9
 ; CHECK-NEXT:    mov v0.h[3], w8
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
 ; CHECK-NEXT:    ret
   %ext = call <4 x i1> @llvm.vector.extract.v4i1.n...
[truncated]

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give an opportunity for targets to adopt the terminal rule on their own, I think it make sense to have a target hook for it, like what we do with join-globalcopies and ::enableJoinGlobalCopies

I don't remember why I didn't flip the switch when I introduced this rule, so I want to make sure we are friendly with downstream users.

@arsenm
Copy link
Contributor Author

arsenm commented Oct 2, 2025

To give an opportunity for targets to adopt the terminal rule on their own, I think it make sense to have a target hook for it, like what we do with join-globalcopies and ::enableJoinGlobalCopies

I specifically think we should not have a control for this, and have a policy against introducing this style of target hook. I think these are lazy shortcuts to avoid touching targets whoever adds the feature doesn't care about, and the project is worse off for it. The reality of target maintenance is this will never be implemented by any target. If we wait several years, a few might discover it and turn it on

@preames
Copy link
Collaborator

preames commented Oct 2, 2025

I specifically think we should not have a control for this, and have a policy against introducing this style of target hook. I think these are lazy shortcuts to avoid touching targets whoever adds the feature doesn't care about, and the project is worse off for it. The reality of target maintenance is this will never be implemented by any target. If we wait several years, a few might discover it and turn it on

@arsenm I generally agree with you here, but I think this is a case where adding a hook would make review much easier. I'd proposed adding the hook, enabling AMDGPU, then doing individual changes for the other backends, then (if desired) remove the hook.

LGTM from the RISC-V perspective, and the change generally seems reasonable.

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light  regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.

I also had to hack many hexagon tests to disable the rule. I have
no idea how to update these tests. They appear to be testing specific
scheduling and packet formation of later machine passes, so any change
in the incoming mir is likely hiding whatever was originally intended.
I'll open an issue to fixup these tests once this lands.
@arsenm arsenm force-pushed the users/arsenm/register-coalescer/enable-terminal-rule branch from a2a66a4 to c8e1e32 Compare November 1, 2025 02:51
@arsenm arsenm changed the title RegisterCoalescer: Enable terminal rule by default RegisterCoalescer: Enable terminal rule by default for AMDGPU Nov 1, 2025
@arsenm
Copy link
Contributor Author

arsenm commented Nov 4, 2025

ping

1 similar comment
@arsenm
Copy link
Contributor Author

arsenm commented Nov 10, 2025

ping

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arsenm arsenm merged commit e95f6fa into main Nov 10, 2025
10 checks passed
@arsenm arsenm deleted the users/arsenm/register-coalescer/enable-terminal-rule branch November 10, 2025 17:37
@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 10, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-devrel-x86-64 running on ml-opt-devrel-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/28529

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 2
/b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir | /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# .---command stderr------------
# | /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir:19:16: error: CHECK-NEXT: is not on the line after the previous match
# |  ; CHECK-NEXT: undef [[S_MOV_B32_1:%[0-9]+]].sub0:sgpr_256 = S_MOV_B32 0
# |                ^
# | <stdin>:145:2: note: 'next' match was here
# |  undef %5.sub0:sgpr_256 = S_MOV_B32 0
# |  ^
# | <stdin>:143:38: note: previous match ended here
# |  undef %3.sub0:sgpr_128 = S_MOV_B32 1
# |                                      ^
# | <stdin>:144:1: note: non-matching line after previous match is here
# |  %13.sub2:sgpr_128 = S_MOV_B32 0
# | ^
# | 
# | Input file: <stdin>
# | Check file: /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |          .
# |          .
# |          .
# |        140:   
# |        141:  %0:sgpr_64 = COPY $sgpr4_sgpr5 
# |        142:  undef %13.sub1:sgpr_128 = S_LOAD_DWORD_IMM %0, 0, 0 :: (dereferenceable invariant load (s32), align 16, addrspace 4) 
# |        143:  undef %3.sub0:sgpr_128 = S_MOV_B32 1 
# |        144:  %13.sub2:sgpr_128 = S_MOV_B32 0 
# |        145:  undef %5.sub0:sgpr_256 = S_MOV_B32 0 
# | next:19      !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  error: match on wrong line
# |        146:  TENSOR_LOAD_TO_LDS_D2 %3, %5, 0, 0, implicit-def dead $tensorcnt, implicit $exec, implicit $tensorcnt 
# |        147:  %13.sub0:sgpr_128 = S_MOV_B32 1 
# |        148:  %3.sub1:sgpr_128 = COPY %3.sub0 
# |        149:   
# |        150:  bb.1: 
# |          .
# |          .
# |          .
# | >>>>>>
# `-----------------------------
...

@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 10, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-dev-x86-64 running on ml-opt-dev-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/28736

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 2
/b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir | /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# .---command stderr------------
# | /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir:19:16: error: CHECK-NEXT: is not on the line after the previous match
# |  ; CHECK-NEXT: undef [[S_MOV_B32_1:%[0-9]+]].sub0:sgpr_256 = S_MOV_B32 0
# |                ^
# | <stdin>:145:2: note: 'next' match was here
# |  undef %5.sub0:sgpr_256 = S_MOV_B32 0
# |  ^
# | <stdin>:143:38: note: previous match ended here
# |  undef %3.sub0:sgpr_128 = S_MOV_B32 1
# |                                      ^
# | <stdin>:144:1: note: non-matching line after previous match is here
# |  %13.sub2:sgpr_128 = S_MOV_B32 0
# | ^
# | 
# | Input file: <stdin>
# | Check file: /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |          .
# |          .
# |          .
# |        140:   
# |        141:  %0:sgpr_64 = COPY $sgpr4_sgpr5 
# |        142:  undef %13.sub1:sgpr_128 = S_LOAD_DWORD_IMM %0, 0, 0 :: (dereferenceable invariant load (s32), align 16, addrspace 4) 
# |        143:  undef %3.sub0:sgpr_128 = S_MOV_B32 1 
# |        144:  %13.sub2:sgpr_128 = S_MOV_B32 0 
# |        145:  undef %5.sub0:sgpr_256 = S_MOV_B32 0 
# | next:19      !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  error: match on wrong line
# |        146:  TENSOR_LOAD_TO_LDS_D2 %3, %5, 0, 0, implicit-def dead $tensorcnt, implicit $exec, implicit $tensorcnt 
# |        147:  %13.sub0:sgpr_128 = S_MOV_B32 1 
# |        148:  %3.sub1:sgpr_128 = COPY %3.sub0 
# |        149:   
# |        150:  bb.1: 
# |          .
# |          .
# |          .
# | >>>>>>
# `-----------------------------
...

@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 10, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-rel-x86-64 running on ml-opt-rel-x86-64-b1 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/28551

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 2
/b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir | /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# .---command stderr------------
# | /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir:19:16: error: CHECK-NEXT: is not on the line after the previous match
# |  ; CHECK-NEXT: undef [[S_MOV_B32_1:%[0-9]+]].sub0:sgpr_256 = S_MOV_B32 0
# |                ^
# | <stdin>:145:2: note: 'next' match was here
# |  undef %5.sub0:sgpr_256 = S_MOV_B32 0
# |  ^
# | <stdin>:143:38: note: previous match ended here
# |  undef %3.sub0:sgpr_128 = S_MOV_B32 1
# |                                      ^
# | <stdin>:144:1: note: non-matching line after previous match is here
# |  %13.sub2:sgpr_128 = S_MOV_B32 0
# | ^
# | 
# | Input file: <stdin>
# | Check file: /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |          .
# |          .
# |          .
# |        140:   
# |        141:  %0:sgpr_64 = COPY $sgpr4_sgpr5 
# |        142:  undef %13.sub1:sgpr_128 = S_LOAD_DWORD_IMM %0, 0, 0 :: (dereferenceable invariant load (s32), align 16, addrspace 4) 
# |        143:  undef %3.sub0:sgpr_128 = S_MOV_B32 1 
# |        144:  %13.sub2:sgpr_128 = S_MOV_B32 0 
# |        145:  undef %5.sub0:sgpr_256 = S_MOV_B32 0 
# | next:19      !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  error: match on wrong line
# |        146:  TENSOR_LOAD_TO_LDS_D2 %3, %5, 0, 0, implicit-def dead $tensorcnt, implicit $exec, implicit $tensorcnt 
# |        147:  %13.sub0:sgpr_128 = S_MOV_B32 1 
# |        148:  %3.sub1:sgpr_128 = COPY %3.sub0 
# |        149:   
# |        150:  bb.1: 
# |          .
# |          .
# |          .
# | >>>>>>
# `-----------------------------
...

@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 10, 2025

LLVM Buildbot has detected a new failure on builder lld-x86_64-ubuntu-fast running on as-builder-4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/26195

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 2
/home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir | /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=register-coalescer -verify-coalescing -o - /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# executed command: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/build/bin/FileCheck /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# .---command stderr------------
# | /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir:19:16: error: CHECK-NEXT: is not on the line after the previous match
# |  ; CHECK-NEXT: undef [[S_MOV_B32_1:%[0-9]+]].sub0:sgpr_256 = S_MOV_B32 0
# |                ^
# | <stdin>:145:2: note: 'next' match was here
# |  undef %5.sub0:sgpr_256 = S_MOV_B32 0
# |  ^
# | <stdin>:143:38: note: previous match ended here
# |  undef %3.sub0:sgpr_128 = S_MOV_B32 1
# |                                      ^
# | <stdin>:144:1: note: non-matching line after previous match is here
# |  %13.sub2:sgpr_128 = S_MOV_B32 0
# | ^
# | 
# | Input file: <stdin>
# | Check file: /home/buildbot/worker/as-builder-4/ramdisk/lld-x86_64/llvm-project/llvm/test/CodeGen/AMDGPU/reg-coalescer-subreg-liveness.mir
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |          .
# |          .
# |          .
# |        140:   
# |        141:  %0:sgpr_64 = COPY $sgpr4_sgpr5 
# |        142:  undef %13.sub1:sgpr_128 = S_LOAD_DWORD_IMM %0, 0, 0 :: (dereferenceable invariant load (s32), align 16, addrspace 4) 
# |        143:  undef %3.sub0:sgpr_128 = S_MOV_B32 1 
# |        144:  %13.sub2:sgpr_128 = S_MOV_B32 0 
# |        145:  undef %5.sub0:sgpr_256 = S_MOV_B32 0 
# | next:19      !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  error: match on wrong line
# |        146:  TENSOR_LOAD_TO_LDS_D2 %3, %5, 0, 0, implicit-def dead $tensorcnt, implicit $exec, implicit $tensorcnt 
# |        147:  %13.sub0:sgpr_128 = S_MOV_B32 1 
# |        148:  %3.sub1:sgpr_128 = COPY %3.sub0 
# |        149:   
# |        150:  bb.1: 
# |          .
# |          .
# |          .
# | >>>>>>
# `-----------------------------
...

ckoparkar added a commit to ckoparkar/llvm-project that referenced this pull request Nov 10, 2025
* main: (63 commits)
  [libc++] Inline vector::__append into resize (llvm#162086)
  [Flang][OpenMP] Move char box bounds generation for Maps to DirectiveCommons.h (llvm#165918)
  RuntimeLibcalls: Add entries for vector sincospi functions (llvm#166981)
  [X86] _mm_addsub_pd is not valid for constexpr (llvm#167363)
  [CIR] Re-land: Recognize constant aggregate initialization of auto vars (llvm#167033)
  [gn build] Port d2521f1
  [gn build] Port caed089
  [gn build] Port 315d705
  [gn build] Port 2345b7d
  [PowerPC] convert memmove to milicode call  .___memmove64[PR]  in 64-bit mode (llvm#167334)
  [HLSL] Add internal linkage attribute to resources (llvm#166844)
  AMDGPU: Update test after e95f6fa
  [bazel] Port llvm#166980: TLI/VectorLibrary refactor (llvm#167354)
  [libc++] Split macros related to hardening into their own header (llvm#167069)
  [libc++][NFC] Remove unused imports from generate_feature_test_macro_components.py (llvm#159591)
  [CIR][NFC] Add test for Complex imag with GUN extension (llvm#167215)
  [BOLT][AArch64] Add more heuristics on epilogue determination (llvm#167077)
  RegisterCoalescer: Enable terminal rule by default for AMDGPU (llvm#161621)
  Revert "[clang] Refactor option-related code from clangDriver into new clangOptions library" (llvm#167348)
  [libc++][docs] Update to refer to P3355R2 (llvm#167267)
  ...
@hstk30-hw
Copy link
Contributor

diff --git a/llvm/lib/CodeGen/RegisterCoalescer.cpp b/llvm/lib/CodeGen/RegisterCoalescer.cpp
index 25c4375a73ce..e624088a0964 100644
--- a/llvm/lib/CodeGen/RegisterCoalescer.cpp
+++ b/llvm/lib/CodeGen/RegisterCoalescer.cpp
@@ -4150,7 +4150,7 @@ bool RegisterCoalescer::applyTerminalRule(const MachineInstr &Copy) const {
       continue;
     Register OtherSrcReg, OtherReg;
     unsigned OtherSrcSubReg = 0, OtherSubReg = 0;
-    if (!isMoveInstr(*TRI, &Copy, OtherSrcReg, OtherReg, OtherSrcSubReg,
+    if (!isMoveInstr(*TRI, &MI, OtherSrcReg, OtherReg, OtherSrcSubReg,
                      OtherSubReg))
       return false;
     if (OtherReg == SrcReg)

I review the RegisterCoalescer::applyTerminalRule code, I guess there maybe a bug which is introduced by commit 6749ae36b4a33769e7a77cf812d7cd0a908ae3b9, and has been present ever since.

Please review it again. @qcolombet @arsenm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants