Skip to content

Conversation

@ggerganov
Copy link
Member

fix #12948

Increase tg performance at long contexts for models such as Phi-3 that have head size of 96.

./bin/llama-batched-bench -m ../models/phi-3-mini-128k-instruct/ggml-model-q8_0.gguf -c 32768 -b 4096 -ub 4096 -npp 0,512,4096,8192,16384 -ntg 128 -npl 1 -lv 1 -fa
  • master
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
0 128 1 128 0.898 0.00 1.400 91.42 2.298 55.69
512 128 1 640 0.396 1294.24 1.510 84.74 1.906 335.77
4096 128 1 4224 2.173 1884.70 2.266 56.49 4.439 951.49
8192 128 1 8320 4.863 1684.49 3.128 40.92 7.991 1041.18
16384 128 1 16512 12.512 1309.50 4.867 26.30 17.378 950.15
  • PR
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
0 128 1 128 0.968 0.00 1.294 98.92 2.262 56.59
512 128 1 640 0.421 1217.31 1.345 95.19 1.765 362.55
4096 128 1 4224 2.150 1904.81 1.686 75.91 3.837 1100.97
8192 128 1 8320 4.848 1689.73 2.074 61.72 6.922 1201.97
16384 128 1 16512 12.522 1308.46 2.864 44.70 15.385 1073.24

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 15, 2025
@ggerganov ggerganov merged commit f8f820c into master Apr 15, 2025
59 checks passed
@ggerganov ggerganov deleted the gg/metal-fa-vec-add-h96 branch April 15, 2025 11:45
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

low performance in large contex compared to mlx format model

2 participants