vulkan: sort graph to allow more parallel execution #15850

jeffbolznv · 2025-09-07T04:20:04Z

Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

Performance on 5090:

before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        217.89 ± 0.62 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        184.30 ± 8.05 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        122.90 ± 3.50 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        647.11 ± 9.85 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        647.63 ± 7.75 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        342.45 ± 2.72 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        232.42 ± 3.79 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        217.38 ± 1.77 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        285.17 ± 4.83 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        242.25 ± 6.27 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        328.79 ± 1.92 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        247.47 ± 2.68 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        294.44 ± 1.64 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         89.74 ± 2.02 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.02 ± 0.13 |

after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        226.14 ± 1.00 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        188.87 ± 8.07 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.14 ± 1.87 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        702.48 ± 6.27 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        691.18 ± 4.58 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        356.65 ± 3.19 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        248.13 ± 4.80 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        237.03 ± 3.20 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        306.41 ± 5.12 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        264.83 ± 8.34 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        330.56 ± 2.73 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        254.38 ± 1.73 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        301.31 ± 2.63 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         93.02 ± 0.17 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.93 ± 0.16 |

Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With ggml-org#15489, this reduces the number of synchronizations needed.

taronaeo

LGTM for zDNN. If I'm not wrong, this is something we at IBM Research are looking to do :). Looking forward to this!

ggml/src/ggml-backend.cpp

ggerganov · 2025-09-08T18:30:52Z

@taronaeo Friendly reminder, when the author of the PR is a collaborator, we usually let them merge it themselves.

jeffbolznv · 2025-09-08T18:37:55Z

I would have waited for @0cc4m to merge after he reviews the ggml-vulkan changes (and often does perf tests on HW I don't have available).

taronaeo · 2025-09-08T18:40:12Z

My apologies, jumped the gun on this. I'll take note in the future.

ggerganov · 2025-09-11T13:07:35Z

ggml/src/ggml-metal/ggml-metal.m

    /* .graph_compute           = */ ggml_backend_metal_graph_compute,
    /* .event_record            = */ NULL,
    /* .event_wait              = */ NULL,
+    /* .optimize_graph          = */ NULL,


.graph_optimize would have been more consistent name.

Yeah, it could be renamed.

I'll fix this when I get back to work, if nobody beats me to it

Btw, I'm wondering what is the benefit of delegating this optimization step to the scheduler. Seems like the same effect can be achieved by creating an array of indices with the order in which we want to traverse the graph. This can be done inside .graph_compute and can even be interleaved with the GPU if needed.

Ggml-alloc will aggressively reuse memory which will interfere with concurrency. I prototyped a version of this where I did it entirely in the backend, and I had to basically ignore the real allocations and use temporary allocations for all tensors I wanted to reorder.

Ok got it.

Another question though - isn't the original concern from #15489 (comment) now valid again? Without the actual address ranges, you might miss a dependency between the nodes that is not represented by the graph. Back there you solved it by using the actual address ranges, but here this logic is not present.

I'm not completely sure, but I did consider this case. Something like set_rows still operates on (views of )tensors and I included a check that treats it as an implicit dependency if two operations view the same tensor.

There aren't any actual allocations at this point so it all has to be done in terms of tensors, so I think this works out.

For the Metal backend I implemented a backend-agnostic graph optimization that should reorder the nodes for improved concurrency, while preserving the order of fusable ops and also does not reorder problematic operators such as GGML_OP_CPY and GGML_OP_SET_ROWS. I think it is generic and fast enough to be used by all backends, but currently testing it only with the Metal backend - seems to work well so far. If you are interested in trying it out, you can quite easily plug it in the Vulkan backend - the implementation is self-contained in the ggml-metal-common.cpp source:

llama.cpp/ggml/src/ggml-metal/ggml-metal-common.cpp

Lines 377 to 387 in 4b8560a

void ggml_graph_optimize(ggml_cgraph * gf) {

constexpr int MAX_FUSE = 16;

const int n = gf->n_nodes;

enum ggml_op ops[MAX_FUSE];

std::vector<node_info> nodes;

nodes.reserve(gf->n_nodes);

I don't think it's possible to generate the most efficient ordering without knowing what is actually (not just theoretically) fusable by the backend. For example, if you have two matmul+adds:

t0 = matmul ... t1 = add t0, ... t2 = matmul ... t3 = add t2, ...

If the backend fuses matmul+add, then t0,t1,t2,t3 is the correct order - the two fused matmuladds can run concurrently. But if the backend does not fuse matmul+add, then the better order is t0,t2,t1,t3, so the two matmuls can run concurrently and the two adds can run concurrently.

jeffbolznv requested review from 0cc4m and slaren September 7, 2025 04:20

jeffbolznv force-pushed the optimize_graph branch from e71329b to 2246c33 Compare September 7, 2025 04:24

jeffbolznv requested a review from taronaeo as a code owner September 7, 2025 04:24

jeffbolznv force-pushed the optimize_graph branch from 2246c33 to 977ab65 Compare September 7, 2025 04:59

taronaeo reviewed Sep 7, 2025

View reviewed changes

slaren reviewed Sep 8, 2025

View reviewed changes

ggml/src/ggml-backend.cpp Outdated Show resolved Hide resolved

call optimize_graph per-split

0989ca3

slaren approved these changes Sep 8, 2025

View reviewed changes

taronaeo merged commit e68aa10 into ggml-org:master Sep 8, 2025
90 of 91 checks passed

ggerganov mentioned this pull request Sep 10, 2025

metal : allow ops to run concurrently #15929

Merged

3 tasks

ggerganov reviewed Sep 11, 2025

View reviewed changes

noemotiovon mentioned this pull request Sep 15, 2025

llama.cpp需求 noemotiovon/llama.cpp#1

Open

22 tasks

jeffbolznv mentioned this pull request Sep 18, 2025

rename optimize_graph to graph_optimize #16082

Merged


	void ggml_graph_optimize(ggml_cgraph * gf) {
	constexpr int MAX_FUSE = 16;

	const int n = gf->n_nodes;

	enum ggml_op ops[MAX_FUSE];

	std::vector<node_info> nodes;
	nodes.reserve(gf->n_nodes);

vulkan: sort graph to allow more parallel execution #15850

vulkan: sort graph to allow more parallel execution #15850

Uh oh!

Conversation

jeffbolznv commented Sep 7, 2025

Uh oh!

taronaeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Sep 8, 2025

Uh oh!

taronaeo commented Sep 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Sep 8, 2025 •

edited

Loading