Skip to content

Large performance drop when using pipeline parallelism and layer splitting on multiple GPUs #13751

@matbrez

Description

@matbrez

Problem description

The default value for GGML_SCHED_MAX_COPIES is 4. With that value -sm layer performs significantly worse than -sm none
Setting GGML_SCHED_MAX_COPIES to 1 brings -sm layer performance up to the level of -sm none and doesn't seem to otherwise negatively impact performance in this use case.

Benchmarks

CMake options: -DGGML_CUDA=ON, -sm layer performs worse than -sm none

> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           pp512 |       3306.84 ± 3.18 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           tg128 |         40.06 ± 0.10 |

build: b775345d (5470)
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           pp512 |      3315.12 ± 14.63 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           tg128 |         61.38 ± 0.13 |

build: b775345d (5470)

CMake options: -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA=ON, -sm layer performs the same as -sm none

> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           pp512 |       3314.12 ± 8.22 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           tg128 |         60.46 ± 0.23 |

build: b775345d (5470)
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           pp512 |      3328.29 ± 10.94 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           tg128 |         61.30 ± 0.07 |

build: b775345d (5470)

Additional information

Using --override-tensors also seems to have the effect of disabling pipeline parallelism even in builds with DGGML_SCHED_MAX_COPIES=4. When using -v the line llama_context: pipeline parallelism enabled (n_copies=4) is not printed when --override-tensors is used.

> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 | blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1 |           pp512 |      3311.54 ± 12.45 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 | blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1 |           tg128 |         60.80 ± 0.07 |

build: b775345d (5470)

The model used is https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/Qwen3-32B-Q4_K_M.gguf but other model architectures also exhibit the same behaviour. I tested qwen3, qwen3moe, llama, and gemma3.

Disabling pipeline parallelism also improves performance for models that don't fit on a single GPU in the first place. For example https://huggingface.co/Qwen/Qwen3-235B-A22B-GGUF/tree/main/Q4_K_M goes from 25t/s to 60t/s.

All tests were done on Windows. Version of the CUDA Toolkit is 12.9.

> llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 5470 (b775345d)
built with MSVC 19.42.34435.0 for x64

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions