Large performance drop when using pipeline parallelism and layer splitting on multiple GPUs

## Problem description

The default value for GGML_SCHED_MAX_COPIES is 4. With that value `-sm layer` performs significantly worse than `-sm none`
Setting GGML_SCHED_MAX_COPIES to 1 brings `-sm layer` performance up to the level of `-sm none` and doesn't seem to otherwise negatively impact performance in this use case.

## Benchmarks

### CMake options: `-DGGML_CUDA=ON`, `-sm layer` performs worse than `-sm none`
```
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           pp512 |       3306.84 ± 3.18 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           tg128 |         40.06 ± 0.10 |

build: b775345d (5470)
```

```
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           pp512 |      3315.12 ± 14.63 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           tg128 |         61.38 ± 0.13 |

build: b775345d (5470)
```

### CMake options: `-DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA=ON`, `-sm layer` performs the same as `-sm none`

```
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           pp512 |       3314.12 ± 8.22 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |           tg128 |         60.46 ± 0.23 |

build: b775345d (5470)
```

```
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           pp512 |      3328.29 ± 10.94 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 |  none |           tg128 |         61.30 ± 0.07 |

build: b775345d (5470)
```

## Additional information
Using `--override-tensors` also seems to have the effect of disabling pipeline parallelism even in builds with `DGGML_SCHED_MAX_COPIES=4`. When using `-v` the line `llama_context: pipeline parallelism enabled (n_copies=4)` is not printed when `--override-tensors` is used.

```
> llama-bench.exe -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -sm layer -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 | blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1 |           pp512 |      3311.54 ± 12.45 |
| qwen3 32B Q4_K - Medium        |  18.40 GiB |    32.76 B | CUDA       |  99 | blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.=CUDA0;blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63)\.=CUDA1 |           tg128 |         60.80 ± 0.07 |

build: b775345d (5470)
```

The model used is https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/Qwen3-32B-Q4_K_M.gguf but other model architectures also exhibit the same behaviour. I tested qwen3, qwen3moe, llama, and gemma3.

Disabling pipeline parallelism also improves performance for models that don't fit on a single GPU in the first place. For example https://huggingface.co/Qwen/Qwen3-235B-A22B-GGUF/tree/main/Q4_K_M goes from 25t/s to 60t/s.

All tests were done on Windows. Version of the CUDA Toolkit is 12.9.

```
> llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
version: 5470 (b775345d)
built with MSVC 19.42.34435.0 for x64
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large performance drop when using pipeline parallelism and layer splitting on multiple GPUs #13751

Problem description

Benchmarks

CMake options: `-DGGML_CUDA=ON`, `-sm layer` performs worse than `-sm none`

CMake options: `-DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA=ON`, `-sm layer` performs the same as `-sm none`

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large performance drop when using pipeline parallelism and layer splitting on multiple GPUs #13751

Description

Problem description

Benchmarks

CMake options: -DGGML_CUDA=ON, -sm layer performs worse than -sm none

CMake options: -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA=ON, -sm layer performs the same as -sm none

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

CMake options: `-DGGML_CUDA=ON`, `-sm layer` performs worse than `-sm none`

CMake options: `-DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA=ON`, `-sm layer` performs the same as `-sm none`