[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON #29193

fadara01 · 2025-11-21T18:32:19Z

[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON

NEON is an Arm SIMD instruction set extension, compulsory since armv8-a

Purpose

Fixes #28981 for Arm CPUs

PR #27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks.

However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained disabled on Arm.

This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.

Performance:

Uplift with ISA::NEON vs ISA::VEC:
For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads = 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC

For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill.

export VLLM_CPU_OMP_THREADS_BIND=0-63
export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1"
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=64
vllm bench throughput \
  --num-prompts 128 \
  --seed 0 \
  --dataset-name sharegpt \
  --input-len 1024 \
  --output-len 128 \
  --max-model-len 2048 \
  --max-num-batched-tokens 8192 \
  --model  meta-llama/Llama-3.1-8B-Instruct \
  --load-format dummy

Future

This PR enables a slid reference path for attention GEMMs on Arm CPUs.
Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16 support.

Test Plan

./run-cpu-test-arm.sh
which includes tests/kernels/attention/test_cpu_attn.py

Test Result

All tests pass

Essential Elements of an Effective PR Description Checklist

[Y] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[Y] The test plan, such as providing test command.
[Y] The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces NEON acceleration for Arm CPU Attention GEMMs, which is a significant performance improvement. The changes are well-structured, with a new cpu_attn_neon.hpp file containing the optimized kernels and modifications in other files to integrate the new ISA path. The NEON implementation itself is solid, using intrinsics and unrolling to achieve better performance. I've found an important issue regarding naming clarity in the new NEON implementation that should be addressed to improve maintainability.

csrc/cpu/cpu_attn_neon.hpp

PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks. However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm. This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc. Performance: Uplift with ISA::NEON vs ISA::VEC: For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill. ``` export VLLM_CPU_OMP_THREADS_BIND=0-63 export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" export VLLM_TARGET_DEVICE=cpu export VLLM_CPU_KVCACHE_SPACE=64 vllm bench throughput \ --num-prompts 128 \ --seed 0 \ --dataset-name sharegpt \ --input-len 1024 \ --output-len 128 \ --max-model-len 2048 \ --max-num-batched-tokens 8192 \ --model meta-llama/Llama-3.1-8B-Instruct \ --load-format dummy ``` Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16. Signed-off-by: Fadi Arafeh <[email protected]>

fadara01 · 2025-11-21T19:08:01Z

@mgoin @bigPYJ1151

Can you guys have a look? this deprecates torch.sdpa prefill path for Arm CPUs.

fadara01 requested review from LucasWilkinson and bigPYJ1151 as code owners November 21, 2025 18:32

mergify bot added the v1 label Nov 21, 2025

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

csrc/cpu/cpu_attn_neon.hpp Outdated Show resolved Hide resolved

csrc/cpu/cpu_attn_neon.hpp Outdated Show resolved Hide resolved

fadara01 force-pushed the accelerate_arm_attention branch from 5be945f to 1846ffd Compare November 21, 2025 18:40

fadara01 force-pushed the accelerate_arm_attention branch from 1846ffd to 28a7367 Compare November 21, 2025 18:52

fadara01 changed the title ~~Accelerate Arm CPU Attention GEMMs with NEON~~ Accelerate CPU Attention GEMMs on Arm with NEON Nov 21, 2025

fadara01 changed the title ~~Accelerate CPU Attention GEMMs on Arm with NEON~~ [perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON Nov 21, 2025

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed aarch64-cpu labels Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON #29193

[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON #29193

fadara01 commented Nov 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

fadara01 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON #29193

Are you sure you want to change the base?

[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON #29193

Conversation

fadara01 commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Performance:

Future

Test Plan

Test Result

All tests pass

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

fadara01 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fadara01 commented Nov 21, 2025 •

edited by github-actions bot

Loading