Misc. bug: FA performance on macOS with Metal backend (Intel/AMD)

### Name and Version

❯ ./llama-cli --version
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.027 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = false
ggml_metal_device_init: has unified memory    = false
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = false
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 17163.09 MB
version: 7972 (e06088da0)
built with AppleClang 17.0.0.17000319 for Darwin x86_64

### Operating systems

Mac

### Which llama.cpp modules do you know to be affected?

llama-bench, llama-server

### Command line

```shell
❯ ./llama-bench \
    -m ~/models/bartowski/Qwen_Qwen3-14B-GGUF/Qwen_Qwen3-14B-Q4_K_M.gguf \
    -ngl 99 \
    -fa 0,1
```

### Problem description & steps to reproduce

It used to be that the metal backend was much slower than the Vulkan backend on Intel Macs. I did some testing with with the latest versions and I'm happy to report that the metal backend is now faster than the Vulkan backend on my Intel Mac. Prompt processing is significantly faster with metal and token output is slightly faster. This is all with flash attention disabled.

With flash attention enabled, prompt processing is unaffected but token output speed is super slow:

| model                   | size     | params  | backend        | threads | fa | test   | t/s           |
|-------------------------|----------|---------|----------------|---------|----|--------|---------------|
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS      | 6       | 0  | pp512  | 64.26 ± 0.22  |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS      | 6       | 0  | tg128  | 43.76 ± 0.13  |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS      | 6       | 1  | pp512  | 63.22 ± 0.08  |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | MTL,BLAS      | 6       | 1  | tg128  | 11.53 ± 3.72  |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS   | 6       | 0  | pp512  | 41.62 ± 0.07  |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS   | 6       | 0  | tg128  | 42.50 ± 0.16  |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS   | 6       | 1  | pp512  | 41.34 ± 0.38  |
| qwen3 14B Q4_K - Medium | 8.38 GiB | 14.77 B | Vulkan,BLAS   | 6       | 1  | tg128  | 42.95 ± 0.22  |


I know Intel Macs are not a priority and I'm happy using the Vulkan backend, I just wanted to report my findings and let other people know that using the metal backend without flash attention might be worth it.

### First Bad Commit

_No response_

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: FA performance on macOS with Metal backend (Intel/AMD) #19431

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	threads	fa	test	t/s
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	MTL,BLAS	6	0	pp512	64.26 ± 0.22
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	MTL,BLAS	6	0	tg128	43.76 ± 0.13
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	MTL,BLAS	6	1	pp512	63.22 ± 0.08
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	MTL,BLAS	6	1	tg128	11.53 ± 3.72
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	Vulkan,BLAS	6	0	pp512	41.62 ± 0.07
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	Vulkan,BLAS	6	0	tg128	42.50 ± 0.16
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	Vulkan,BLAS	6	1	pp512	41.34 ± 0.38
qwen3 14B Q4_K - Medium	8.38 GiB	14.77 B	Vulkan,BLAS	6	1	tg128	42.95 ± 0.22

Misc. bug: FA performance on macOS with Metal backend (Intel/AMD) #19431

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions