-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON #29193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces NEON acceleration for Arm CPU Attention GEMMs, which is a significant performance improvement. The changes are well-structured, with a new cpu_attn_neon.hpp file containing the optimized kernels and modifications in other files to integrate the new ISA path. The NEON implementation itself is solid, using intrinsics and unrolling to achieve better performance. I've found an important issue regarding naming clarity in the new NEON implementation that should be addressed to improve maintainability.
5be945f to
1846ffd
Compare
PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks. However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm. This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc. Performance: Uplift with ISA::NEON vs ISA::VEC: For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill. ``` export VLLM_CPU_OMP_THREADS_BIND=0-63 export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" export VLLM_TARGET_DEVICE=cpu export VLLM_CPU_KVCACHE_SPACE=64 vllm bench throughput \ --num-prompts 128 \ --seed 0 \ --dataset-name sharegpt \ --input-len 1024 \ --output-len 128 \ --max-model-len 2048 \ --max-num-batched-tokens 8192 \ --model meta-llama/Llama-3.1-8B-Instruct \ --load-format dummy ``` Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16. Signed-off-by: Fadi Arafeh <[email protected]>
1846ffd to
28a7367
Compare
|
Can you guys have a look? this deprecates torch.sdpa prefill path for Arm CPUs. |
[perf][cpu] Accelerate attention GEMMs (QK, PV) on Arm CPUs with NEON
NEON is an Arm SIMD instruction set extension, compulsory since
armv8-aPurpose
Fixes #28981 for Arm CPUs
PR #27954 added
cpu_attention_with_kv_cachewhich supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks.However, it's currently disabled for prefill on Arm CPUs because it's slower than
torch.sdpafor relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained disabled on Arm.This PR accelerates
cpu_attention_with_kv_cacheon Arm CPUs by introducing NEON accelerated GEMMs (enabled withISA::NEON) for QK and PV. With the new GEMMs, performance ofcpu_attention_with_kv_cacheis similar totorch.sdpafor long prefills, which allows us to enablecpu_attention_with_kv_cachefor prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.Performance:
Uplift with
ISA::NEONvsISA::VEC:For
batch size = 64,query tokens = kv tokens = 512,q heads = 32,kv heads = 8,head size = 128,block size = 128: usingISA::NEONforcpu_attention_with_kv_cacheaccelerates prefill attention by 2x compared to the current state withISA::VECFor the throughput benchmark below on Arm Neoverse-V2, using
cpu_attention_with_kv_cachefor prefills and decodes:ISA::NEONyields ~ %13 higher throughput thanISA::VECand similar throughput to usingtorch.sdpafor prefill.Future
This PR enables a slid reference path for attention GEMMs on Arm CPUs.
Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16 support.
Test Plan
./run-cpu-test-arm.sh
which includes
tests/kernels/attention/test_cpu_attn.pyTest Result
All tests pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.