Skip to content

Conversation

wishstudio
Copy link
Contributor

This PR tries to implement graph plan APIs for the CUDA backend, as well as implement code in ggml-backend.cpp to actually use the graph plan APIs when a backend supports it.

The main functional improvement is to support cuda graphs when the graph is split (e.g. for hybrid inference). Currently the graph update and reuse logic (ggml_backend_sched_update_plans) is a simple heuristic: only try updating previous graphs when the number of splits and their corresponding backends are the same as the previous run. As the benchmark shown this universally accelerate hybrid inference tg performance by up to 30%.

The CUDA graph execution code is refactored and cleaned up. Two out of three original graph plan fail path are removed: disable_due_to_failed_graph_capture and disable_due_to_too_many_updates. The former one is due to the fact I found no code setting it to true. The latter is because I have currently no idea about the semantics in a split graph scenario. But it seems to not degrade the performance at all. Interestingly, I found that on my rig, even repeatitively build a graph then execute it only once is always faster than calling the kernels individually. I suspect it is the reason that the performance increased in tests even for CUDA only workloads, given this PR's optimization not targeting them. This of course needs to be verified on more hardware configurations.

Performance comparison:
RTX 5090 + 13700k, 128GB 6400 MT/s RAM

model n_cpu_moe test t/s master t/s pr Speedup
gpt-oss 20B MXFP4 MoE 0 pp512 9070.14 8768.06 0.97
gpt-oss 20B MXFP4 MoE 0 tg128 273.99 278.43 1.02
gpt-oss 20B MXFP4 MoE 99 pp512 916.16 931.72 1.02
gpt-oss 20B MXFP4 MoE 99 tg128 42.4 47.2 1.11
gpt-oss 120B MXFP4 MoE 24 pp512 150.76 150.31 1.00
gpt-oss 120B MXFP4 MoE 24 tg128 36.73 45.04 1.23
gpt-oss 120B MXFP4 MoE 99 pp512 187.69 186.21 0.99
gpt-oss 120B MXFP4 MoE 99 tg128 28.24 31.7 1.12
glm4moe 106B.A12B Q4_K 34 pp512 81.4 79.9 0.98
glm4moe 106B.A12B Q4_K 34 tg128 18.69 21.72 1.16
glm4moe 106B.A12B Q4_K 99 pp512 114.85 114.34 1.00
glm4moe 106B.A12B Q4_K 99 tg128 14.59 16.01 1.10
glm4moe 355B.A32B Q2_K 99 pp512 25.3 26.96 1.07
glm4moe 355B.A32B Q2_K 99 tg128 7.73 8.74 1.13
qwen3moe 235B.A22B Q3_K 80 pp512 59.99 61.26 1.02
qwen3moe 235B.A22B Q3_K 80 tg128 10.66 11.77 1.10
qwen3moe 235B.A22B Q3_K 99 pp512 85.38 88.45 1.04
qwen3moe 235B.A22B Q3_K 99 tg128 9.27 10.21 1.10
qwen3moe 30B.A3B Q4_K 0 pp512 6806.34 7295.45 1.07
qwen3moe 30B.A3B Q4_K 0 tg128 246.12 273 1.11
qwen3moe 30B.A3B Q4_K 99 pp512 531.72 560.52 1.05
qwen3moe 30B.A3B Q4_K 99 tg128 36.1 48.4 1.34
qwen3 8B Q8_0 pp512 10002.11 11383.42 1.14
qwen3 8B Q8_0 tg128 124.07 136.75 1.10
llama 13B Q6_K pp512 3886.45 4175.62 1.07
llama 13B Q6_K tg128 65.89 72.19 1.10
llama 8B Q8_0 pp512 10092.17 11702.73 1.16
llama 8B Q8_0 tg128 127.96 142.67 1.11
gemma3 12B Q8_0 pp512 6696.89 7920.99 1.18
gemma3 12B Q8_0 tg128 79.01 88.73 1.12
nemotron_h 9B Q8_0 pp512 7243.57 7808.74 1.08
nemotron_h 9B Q8_0 tg128 114.51 122.93 1.07

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant