Skip to content

[AArch64] Miscompile due to 32-bit insertelement lowered to 8-bit move #140707

@aleks-tmb

Description

@aleks-tmb

After the changes introduced in #136091, we started experiencing a miscompile on AArch64 in our local testing. Here is a reduced LLVM IR example:

; ModuleID = 'Test.ll'
target triple = "aarch64-none-linux-gnu"

define i32 @main(ptr addrspace(1) %p) {
  %1 = load <2 x i32>, ptr addrspace(1) %p, align 4
  %2 = extractelement <2 x i32> %1, i64 0
  %3 = call i32 @llvm.ctpop.i32(i32 %2)
  %4 = insertelement <2 x i32> <i32 -1, i32 poison>, i32 %3, i64 1
  %5 = sub <2 x i32> %1, %4
  store <2 x i32> %5, ptr addrspace(1) %p, align 4
  ret i32 0
}

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.ctpop.i32(i32)

Here is an llc (trunk) output:

main:
        ldr     d0, [x0]
        movi    v2.2d, #0xffffffffffffffff
        mov     x8, x0
        mov     w0, wzr
        fmov    w9, s0
        fmov    s1, w9
        cnt     v1.8b, v1.8b
        addv    b1, v1.8b
        mov     v2.b[4], v1.b[0]
        sub     v0.2s, v0.2s, v2.2s
        str     d0, [x8]
        ret

https://godbolt.org/z/c5YP7rYbE

Current transformation seems to be incorrect because:

  • The instruction mov v2.b[4], v1.b[0] only updates a single byte (byte 4) of the v2 vector register.
  • However, the LLVM IR expects a full 32-bit insertion into the second element (insertelement at index 1).
  • Because the rest of the 32-bit lane in v2.s[1] remains filled with 0xFF (due to movi v2.2d, #0xFFFFFFFFFFFFFFFF), the resulting subtraction computes an incorrect value.

llc 20.1.0 output (before applying #136091)

main:                                   // @main
        ldr     d1, [x0]
        movi    v0.2d, #0xffffffffffffffff
        mov     x8, x0
        mov     w0, wzr
        fmov    w9, s1
        fmov    s2, w9
        cnt     v2.8b, v2.8b
        addv    b2, v2.8b
        fmov    w9, s2
        mov     v0.s[1], w9
        sub     v0.2s, v1.2s, v0.2s
        str     d0, [x8]
        ret

https://godbolt.org/z/qqvM1b3hc

Why this is correct?

  • mov v0.s[1], w9 fully overwrites the entire 32-bit lane in v0, matching the semantics of LLVM IR's insertelement.
  • This avoids leftover bytes from the earlier movi initialization, ensuring the result is correct.

@davemgreen David, could you please take a look when you have a moment?

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions