-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Description
Name and Version
❯ ./llama-cli --version
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.027 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = false
ggml_metal_device_init: has unified memory = false
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = false
ggml_metal_device_init: recommendedMaxWorkingSetSize = 17163.09 MB
version: 7972 (e06088d)
built with AppleClang 17.0.0.17000319 for Darwin x86_64
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
llama-bench, llama-server
Command line
❯ ./llama-bench \
-m ~/models/bartowski/Qwen_Qwen3-14B-GGUF/Qwen_Qwen3-14B-Q4_K_M.gguf \
-ngl 99 \
-fa 0,1Problem description & steps to reproduce
It used to be that the metal backend was much slower than the Vulkan backend on Intel Macs. I did some testing with with the latest versions and I'm happy to report that the metal backend is now faster than the Vulkan backend on my Intel Mac. Prompt processing is significantly faster with metal and token output is slightly faster. This is all with flash attention disabled.
With flash attention enabled, prompt processing is unaffected but token output speed is super slow:
| model | size | params | backend | threads | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS | 6 | 0 | pp512 | 64.26 ± 0.22 |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS | 6 | 0 | tg128 | 43.76 ± 0.13 |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS | 6 | 1 | pp512 | 63.22 ± 0.08 |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS | 6 | 1 | tg128 | 11.53 ± 3.72 |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS | 6 | 0 | pp512 | 41.62 ± 0.07 |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS | 6 | 0 | tg128 | 42.50 ± 0.16 |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS | 6 | 1 | pp512 | 41.34 ± 0.38 |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS | 6 | 1 | tg128 | 42.95 ± 0.22 |
I know Intel Macs are not a priority and I'm happy using the Vulkan backend, I just wanted to report my findings and let other people know that using the metal backend without flash attention might be worth it.
First Bad Commit
No response
Relevant log output
No response