Commit 1846ffd
committed
Accelerate Arm CPU Attention GEMMs with NEON
PR #27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching,
SWA, alibi, softcap and sinks.
However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa
for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm.
This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs
(enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache
is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for
prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc.
Performance:
Uplift with ISA::NEON vs ISA::VEC:
For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128:
using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC
For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes:
ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill.
```
export VLLM_CPU_OMP_THREADS_BIND=0-63
export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1"
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=64
vllm bench throughput \
--num-prompts 128 \
--seed 0 \
--dataset-name sharegpt \
--input-len 1024 \
--output-len 128 \
--max-model-len 2048 \
--max-num-batched-tokens 8192 \
--model meta-llama/Llama-3.1-8B-Instruct \
--load-format dummy
```
Future PRs will accelerate attention further by introducing faster/vectorized exp implementations
and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16.
Signed-off-by: Fadi Arafeh <[email protected]>1 parent b7f1f49 commit 1846ffd
File tree
5 files changed
+409
-5
lines changed- csrc/cpu
- vllm
- engine
- v1/attention/backends
5 files changed
+409
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
16 | 28 | | |
17 | 29 | | |
18 | 30 | | |
| |||
41 | 53 | | |
42 | 54 | | |
43 | 55 | | |
| 56 | + | |
44 | 57 | | |
45 | 58 | | |
46 | 59 | | |
| |||
73 | 86 | | |
74 | 87 | | |
75 | 88 | | |
| 89 | + | |
| 90 | + | |
76 | 91 | | |
77 | 92 | | |
78 | 93 | | |
| |||
158 | 173 | | |
159 | 174 | | |
160 | 175 | | |
| 176 | + | |
| 177 | + | |
161 | 178 | | |
162 | 179 | | |
163 | 180 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| |||
143 | 143 | | |
144 | 144 | | |
145 | 145 | | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
146 | 152 | | |
147 | 153 | | |
148 | 154 | | |
| |||
0 commit comments