Skip to content

Conversation

@ggerganov
Copy link
Member

Better threadgroup utilization for the floating-point mat-vec kernels:

Model Test t/s master t/s gg/metal-mul-mv-opt-2 Speedup
qwen3 1.7B BF16 tg32 127.14 141.96 1.12
qwen3 1.7B F16 tg32 127.44 142.26 1.12
qwen3 1.7B all F32 tg32 83.96 86.97 1.04

Also some TG gains for MoE models of any quantization since they use F32 matrix multiplication in the FFN:

Model Test t/s master t/s gg/metal-mul-mv-opt-2 Speedup
gpt-oss 20B MXFP4 MoE tg32 128.57 132.93 1.03
qwen3moe 30B.A3B Q4_0 tg32 100.39 102.86 1.02

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 17, 2025
@ggerganov ggerganov force-pushed the gg/metal-mul-mv-opt-2 branch from 5fbb485 to 64c6dcb Compare September 18, 2025 08:32
@ggerganov ggerganov merged commit b213fce into master Sep 18, 2025
61 of 62 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
)

* metal : improve F32, F16 and BF16 mat-vec multiplication

ggml-ci

* metal : make the NSG a function constant in mul_mv kernels

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants