Skip to content

[AIE2P] Large struct phi with zeroinitializer fails to zero ACC2048 registers; crashes at -O0 #805

@paintedsnipedotll

Description

@paintedsnipedotll

Summary

A phi node carrying a large struct ({i32, i64 x 128}, 129 fields) with zeroinitializer as the entry value has two issues on
aie2p-none-unknown-elf:

  1. -O0 crash: LLVM ERROR: unable to legalize instruction: G_INSERT_VECTOR_ELT
  2. -O2 silent miscompile: The generated code fails to zero the ACC2048 hardware registers, causing alternating correct/incorrect results across
    repeated kernel invocations on the NPU.

A smaller struct ({i32, i64 x 32}, 33 fields) with the same pattern works correctly at both -O0 and -O2.

Environment

  • llvm-aie: 20.0.0.2025090701+8c084497 (commit 8c084497)
  • Target: aie2p-none-unknown-elf
  • Hardware (for runtime bug): AMD Ryzen AI NPU (Krackan Point, AIE2P)

Reproducer

See attached peano_bug_repro.ll.

The struct packs a loop counter + 4 ACC2048 vectors (32 × i64 each = 128 i64 fields total):

%acc_state = type { i32, i64, i64, ..., i64 }  ; 129 fields

k_loop:
  %state = phi %acc_state [ zeroinitializer, %entry ], [ %state_next, %k_continue ]
  ; extract 4 groups of 32 × i64 → <32 x i64>, call mac.conf, insertvalue back

Reproducing the -O0 crash (no hardware needed)

llc peano_bug_repro.ll --mtriple=aie2p-none-unknown-elf -O0 --filetype=obj -o /dev/null

Expected: compiles successfully.
Actual: LLVM ERROR: unable to legalize instruction: G_INSERT_VECTOR_ELT

Reproducing the -O2 runtime bug (requires AIE2P NPU)

llc peano_bug_repro.ll --mtriple=aie2p-none-unknown-elf -O2 --filetype=obj -o repro.o

Compiles successfully, but when linked into an AIE2P kernel and invoked repeatedly:

  • Even invocations: wrong results (accumulator retains stale values from previous invocation, e.g. 68 instead of expected 32)
  • Odd invocations: correct results (all 32)
  • Only fields 1–64 (first two ACC2048 groups) are affected; fields 65–128 (last two) are always correct
  • Pattern is 100% deterministic

Working case

The same phi pattern with a 33-field struct (1 accumulator instead of 4) works correctly:

%acc_state_small = type { i32, i64, i64, ..., i64 }  ; 33 fields
; → correct ACC zeroing on every invocation, no crash at -O0

Workaround

Restructure the kernel to carry only one <32 x i64> accumulator per loop phi node instead of four.

peano_bug_repro.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions