Skip to content

Conversation

@xctan
Copy link
Collaborator

@xctan xctan commented Oct 31, 2025

This PR optimizes the q2_k and q3_k vector dot product kernels, applying similar techniques from #15720.

This change removes vector-length-dependent designs, allowing the kernels to support vector lengths wider than 128 bits.

The generation speedup decreases from 1.72x (16 threads) to 1.36x (64 threads), likely due to memory bandwidth constraints.

Perplexity was measured to ensure correctness and remains unchanged:

PR: PPL = 18.8338 +/- 0.18369
master: PPL = 18.8338 +/- 0.18369

@xctan
Copy link
Collaborator Author

xctan commented Oct 31, 2025

Performance data is shown below:

model size params backend threads test t/s branch
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 64 pp512 71.91 ± 0.38 PR
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 64 pp512 66.27 ± 0.06 master
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 64 tg128 23.21 ± 1.23 PR
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 64 tg128 17.07 ± 0.41 master
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 32 pp512 38.13 ± 0.01 PR
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 32 pp512 34.97 ± 0.01 master
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 32 tg128 18.43 ± 0.08 PR
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 32 tg128 11.58 ± 0.02 master
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 16 pp512 19.42 ± 0.01 PR
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 16 pp512 17.59 ± 0.04 master
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 16 tg128 10.76 ± 0.06 PR
gemma3 4B Q2_K - Medium 1.60 GiB 3.88 B CPU 16 tg128 6.26 ± 0.01 master

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants