Skip to content

Misc. bug: FA performance on macOS with Metal backend (Intel/AMD) #19431

@soerenkampschroer

Description

@soerenkampschroer

Name and Version

❯ ./llama-cli --version
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.027 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = false
ggml_metal_device_init: has unified memory = false
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = false
ggml_metal_device_init: recommendedMaxWorkingSetSize = 17163.09 MB
version: 7972 (e06088d)
built with AppleClang 17.0.0.17000319 for Darwin x86_64

Operating systems

Mac

Which llama.cpp modules do you know to be affected?

llama-bench, llama-server

Command line

❯ ./llama-bench \
    -m ~/models/bartowski/Qwen_Qwen3-14B-GGUF/Qwen_Qwen3-14B-Q4_K_M.gguf \
    -ngl 99 \
    -fa 0,1

Problem description & steps to reproduce

It used to be that the metal backend was much slower than the Vulkan backend on Intel Macs. I did some testing with with the latest versions and I'm happy to report that the metal backend is now faster than the Vulkan backend on my Intel Mac. Prompt processing is significantly faster with metal and token output is slightly faster. This is all with flash attention disabled.

With flash attention enabled, prompt processing is unaffected but token output speed is super slow:

model size params backend threads fa test t/s
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B MTL,BLAS 6 0 pp512 64.26 ± 0.22
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B MTL,BLAS 6 0 tg128 43.76 ± 0.13
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B MTL,BLAS 6 1 pp512 63.22 ± 0.08
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B MTL,BLAS 6 1 tg128 11.53 ± 3.72
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B Vulkan,BLAS 6 0 pp512 41.62 ± 0.07
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B Vulkan,BLAS 6 0 tg128 42.50 ± 0.16
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B Vulkan,BLAS 6 1 pp512 41.34 ± 0.38
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B Vulkan,BLAS 6 1 tg128 42.95 ± 0.22

I know Intel Macs are not a priority and I'm happy using the Vulkan backend, I just wanted to report my findings and let other people know that using the metal backend without flash attention might be worth it.

First Bad Commit

No response

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions