Skip to content

Conversation

@slaren
Copy link
Member

@slaren slaren commented Jul 31, 2025

  • When using --override-tensor to override to the CPU, other buffer types will be considered as well. In practice, what this means is that the host buffer types will be used, which may improve performance when prompt processing is offloaded (Note that mmap needs to be disabled to use host buffers).
  • Adds --no-repack (-nr) option to disable weight repacking.

llama-bench -m Qwen3-30B-A3B-Q4_0.gguf -ot exps=CPU -n 0 -p 32,64,128,256,512,1024 -ub 1024 -mmp 0:

Model Test t/s master t/s sl/ot-repacking Speedup
qwen3moe 30B.A3B Q4_0 pp32 15.03 22.62 1.50
qwen3moe 30B.A3B Q4_0 pp64 28.87 45.04 1.56
qwen3moe 30B.A3B Q4_0 pp128 61.06 89.35 1.46
qwen3moe 30B.A3B Q4_0 pp256 121.44 173.97 1.43
qwen3moe 30B.A3B Q4_0 pp512 227.41 309.59 1.36
qwen3moe 30B.A3B Q4_0 pp1024 421.50 594.32 1.41

@slaren slaren merged commit d6818d0 into master Jul 31, 2025
47 checks passed
@slaren slaren deleted the sl/ot-repacking branch July 31, 2025 16:11
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants