-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Problem description
The default value for GGML_SCHED_MAX_COPIES is 4. With that value -sm layer performs significantly worse than -sm none
Setting GGML_SCHED_MAX_COPIES to 1 brings -sm layer performance up to the level of -sm none and doesn't seem to otherwise negatively impact performance in this use case.
Benchmarks
CMake options: -DGGML_CUDA=ON, -sm layer performs worse than -sm none
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | pp512 | 3306.84 ± 3.18 |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | tg128 | 40.06 ± 0.10 |
build: b775345d (5470)
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | sm | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | none | pp512 | 3315.12 ± 14.63 |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | none | tg128 | 61.38 ± 0.13 |
build: b775345d (5470)
CMake options: -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA=ON, -sm layer performs the same as -sm none
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | pp512 | 3314.12 ± 8.22 |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | tg128 | 60.46 ± 0.23 |
build: b775345d (5470)
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | sm | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | none | pp512 | 3328.29 ± 10.94 |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | none | tg128 | 61.30 ± 0.07 |
build: b775345d (5470)
Additional information
Using --override-tensors also seems to have the effect of disabling pipeline parallelism even in builds with DGGML_SCHED_MAX_COPIES=4. When using -v the line llama_context: pipeline parallelism enabled (n_copies=4) is not printed when --override-tensors is used.
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1 | pp512 | 3311.54 ± 12.45 |
| qwen3 32B Q4_K - Medium | 18.40 GiB | 32.76 B | CUDA | 99 | blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1 | tg128 | 60.80 ± 0.07 |
build: b775345d (5470)
The model used is https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/Qwen3-32B-Q4_K_M.gguf but other model architectures also exhibit the same behaviour. I tested qwen3, qwen3moe, llama, and gemma3.
Disabling pipeline parallelism also improves performance for models that don't fit on a single GPU in the first place. For example https://huggingface.co/Qwen/Qwen3-235B-A22B-GGUF/tree/main/Q4_K_M goes from 25t/s to 60t/s.
All tests were done on Windows. Version of the CUDA Toolkit is 12.9.
> llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 5470 (b775345d)
built with MSVC 19.42.34435.0 for x64