ik_llama.cpp for Armv8.0 #556

NotAHero04 · 2025-06-25T07:26:00Z

NotAHero04
Jun 25, 2025

I managed to port ik_llama.cpp to my phone which has a Snapdragon 680 CPU. Although under heavy emulation, it's still much faster than mainline llama.cpp. All of the tests are done using Qwen 3 0.6B model.

What works:

Quants: legacy quants (tested Q4_0, Q8_0), i-quants (IQ4_XS), k-quants (Q4_K_M), iqk-quants (IQ4_KS, IQ5_K).
Flash attention.

What doesn't work:

Trellis quants (tested IQ4_KT), though it might be specific to model or to my quantization. I'll test it more tonight.
Repacking (both online and quantized forms, tested Q4_0_R8 and Q8_0_R8).

If anyone is interested, I'll publish a fork. It just adds emulation for some NEON dot product and float16 arithmetic intrinsics. (mainline also has some level of emulation for v8.0)

ikawrakow · 2025-06-25T07:52:27Z

ikawrakow
Jun 25, 2025
Maintainer

Nice 😄

The repacked variants don't work because the emulation for vdotq_laneq_s32 is incorrect, or is there some other issue? But I guess it may not be worth putting too much effort into this as one would need to use vgetq_lane_X, which will make the dot products quite slow, I think.

0 replies

NotAHero04 · 2025-06-25T14:37:21Z

NotAHero04
Jun 25, 2025
Author

I did a fresh recompile and repacking works now! Unfortunately IQ4_KT still doesn't work :(

1 reply

l15y Jul 31, 2025

build faild:

cmake -B build \
    -DCMAKE_C_FLAGS="-march=armv8-a -O3 -flto -D_GNU_SOURCE  -ffast-math  -fopenmp  -fno-finite-math-only " \
    -DCMAKE_CXX_FLAGS="-march=armv8-a -O3 -flto -D_GNU_SOURCE  -ffast-math  -fopenmp  -fno-finite-math-only  " \
    -DLLAMA_CURL=OFF \
    -DCMAKE_C_COMPILER="/usr/bin/aarch64-linux-gnu-gcc-12" \
    -DCMAKE_CXX_COMPILER="/usr/bin/aarch64-linux-gnu-g++-12" \
    -DCMAKE_SYSTEM_NAME=Linux \
    -DCMAKE_SYSTEM_PROCESSOR=arm \
    -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_EXE_LINKER_FLAGS="-static -fopenmp" \
    -DGGML_CPU_KLEIDIAI=OFF -DGGML_BLAS=OFF -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON

numactl --cpunodebind=0 --membind=0 cmake --build build --config Release -j $(nproc --all | awk '{print $1}') -t llama-server

find in output:

error: inlining failed in call to ‘always_inline’ ‘vdotq_s32(__Int32x4_t, __Int8x16_t, __Int8x16_t)’: target specific option mismatch

It seems to use the vdotq_s32 instruction set of ARMv8.2

ikawrakow · 2025-06-25T15:30:22Z

ikawrakow
Jun 25, 2025
Maintainer

The *_KT quants are very slow on my M2-Max CPU, so it may not be worth putting the effort to make them work on a v8.0 phone.

1 reply

NotAHero04 Jun 26, 2025
Author

So the KT quants do work after all, I just have to get the model from my PC. And yes, it is unbearably slow. (Q4_0 is 3x faster in TG)

ikawrakow · 2025-06-26T16:57:03Z

ikawrakow
Jun 26, 2025
Maintainer

Yes, the *_kt quants performance is very competitive on a GPU, nearly competitive on the two x86_64 CPU's that I have available, 2X slower than corresponding size quant on the M2-Max CPU, and ridiculously slow on the M2-Max GPU.

But nice you have made all this work!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ik_llama.cpp for Armv8.0 #556

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ik_llama.cpp for Armv8.0 #556

Uh oh!

NotAHero04 Jun 25, 2025

Replies: 4 comments · 2 replies

Uh oh!

ikawrakow Jun 25, 2025 Maintainer

Uh oh!

NotAHero04 Jun 25, 2025 Author

Uh oh!

l15y Jul 31, 2025

Uh oh!

ikawrakow Jun 25, 2025 Maintainer

Uh oh!

Uh oh!

NotAHero04 Jun 26, 2025 Author

Uh oh!

ikawrakow Jun 26, 2025 Maintainer

NotAHero04
Jun 25, 2025

Replies: 4 comments 2 replies

ikawrakow
Jun 25, 2025
Maintainer

NotAHero04
Jun 25, 2025
Author

ikawrakow
Jun 25, 2025
Maintainer

NotAHero04 Jun 26, 2025
Author

ikawrakow
Jun 26, 2025
Maintainer