Skip to content

Conversation

@fadara01
Copy link
Contributor

@fadara01 fadara01 commented Nov 21, 2025

[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON

NEON is an Arm SIMD instruction set extension, compulsory since armv8-a

Purpose

Fixes #28981 for Arm CPUs

PR #27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks.

However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained disabled on Arm.

This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.

Performance:

Uplift with ISA::NEON vs ISA::VEC:
For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads = 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC

For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill.

export VLLM_CPU_OMP_THREADS_BIND=0-63
export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1"
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=64
vllm bench throughput \
  --num-prompts 128 \
  --seed 0 \
  --dataset-name sharegpt \
  --input-len 1024 \
  --output-len 128 \
  --max-model-len 2048 \
  --max-num-batched-tokens 8192 \
  --model  meta-llama/Llama-3.1-8B-Instruct \
  --load-format dummy

Future

This PR enables a slid reference path for attention GEMMs on Arm CPUs.
Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16 support.

Test Plan

./run-cpu-test-arm.sh
which includes tests/kernels/attention/test_cpu_attn.py

Test Result

All tests pass

Essential Elements of an Effective PR Description Checklist
  • [Y] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [Y] The test plan, such as providing test command.
  • [Y] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NEON acceleration for Arm CPU Attention GEMMs, which is a significant performance improvement. The changes are well-structured, with a new cpu_attn_neon.hpp file containing the optimized kernels and modifications in other files to integrate the new ISA path. The NEON implementation itself is solid, using intrinsics and unrolling to achieve better performance. I've found an important issue regarding naming clarity in the new NEON implementation that should be addressed to improve maintainability.

@fadara01 fadara01 force-pushed the accelerate_arm_attention branch from 5be945f to 1846ffd Compare November 21, 2025 18:40
PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching,
SWA, alibi, softcap and sinks.

However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa
for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm.

This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs
(enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache
is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for
prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.

Performance:

Uplift with ISA::NEON vs ISA::VEC:
For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128:
using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC

For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes:
ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill.
```
export VLLM_CPU_OMP_THREADS_BIND=0-63
export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1"
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=64
vllm bench throughput \
  --num-prompts 128 \
  --seed 0 \
  --dataset-name sharegpt \
  --input-len 1024 \
  --output-len 128 \
  --max-model-len 2048 \
  --max-num-batched-tokens 8192 \
  --model  meta-llama/Llama-3.1-8B-Instruct \
  --load-format dummy
```

Future PRs will accelerate attention further by introducing faster/vectorized exp implementations
and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16.

Signed-off-by: Fadi Arafeh <[email protected]>
@fadara01 fadara01 force-pushed the accelerate_arm_attention branch from 1846ffd to 28a7367 Compare November 21, 2025 18:52
@fadara01 fadara01 changed the title Accelerate Arm CPU Attention GEMMs with NEON Accelerate CPU Attention GEMMs on Arm with NEON Nov 21, 2025
@fadara01
Copy link
Contributor Author

@mgoin @bigPYJ1151

Can you guys have a look? this deprecates torch.sdpa prefill path for Arm CPUs.

@fadara01 fadara01 changed the title Accelerate CPU Attention GEMMs on Arm with NEON [perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON Nov 21, 2025
@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed aarch64-cpu labels Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aarch64-cpu performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: max batched tokens not compatible with max model length on non-X86 CPU Backend

2 participants