rpc : fix alloc size logic #17116

ggerganov · 2025-11-09T11:28:17Z

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

Check performance imapct
Cache the responses to avoid extra RPC calls?

ggerganov · 2025-11-10T13:40:32Z

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

rgerganov

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

rgerganov · 2025-11-10T13:54:26Z

ggml/include/ggml-rpc.h

@@ -8,4 +7,4 @@
 #endif

 #define RPC_PROTO_MAJOR_VERSION    3


let's bump the protocol version as this is a breaking change

ggml/src/ggml-metal/ggml-metal-ops.cpp

ggerganov · 2025-11-10T14:20:28Z

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

rgerganov · 2025-11-10T14:33:47Z

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?
LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

Yes, disabling graph reuse results into a lot of RPC_CMD_GET_ALLOC_SIZE calls for each graph computation

ggerganov · 2025-11-10T14:36:18Z

Great, I didn't realize until now that the graph reuse saves us almost all the extra RPC calls. This maybe makes the need to cache the RPC calls almost redundant as I think this won't have a significant impact on the PP.

ggerganov · 2025-11-10T19:35:02Z

Some benches could be useful to confirm that the performance is not significantly affected. And I think it should be good to merge.

jukofyork · 2025-11-10T19:59:39Z

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

No problem, but it will likely be Thursday before I can run any tests.

slaren · 2025-11-11T10:25:15Z

I think eventually the proper solution to this and other per-tensor calls such as init_tensor will be to batch all the tensors into a single a call. For example, ggml-alloc could be modified to build a list of tensors and obtain all the tensor alloc sizes first, and use that data rather than calling get_alloc_size every time. This way, instead of thousands of calls (each with a network roundtrip), it would only require one. The RPC backend could also cache the results so that calls with identical tensors don't require the network at all.

ggerganov · 2025-12-03T11:16:21Z

@jukofyork @eugr Do you want to do some perf tests with this branch before merging to verify there is no significant performance hit?

jukofyork · 2025-12-03T11:42:37Z

@jukofyork @eugr Do you want to do some perf tests with this branch before merging to verify there is no significant performance hit?

Sorry, I forgot all about this - I'll try and test it this afternoon and report back.

jukofyork · 2025-12-03T16:11:10Z

@jukofyork @eugr Do you want to do some perf tests with this branch before merging to verify there is no significant performance hit?

Using 3 nodes, each with 2x A6000 in:

./llama-bench \
        --rpc "192.168.1.2:50052,192.168.1.3:50052" \
        --model ~/models/gguf/GLM-4.6-Q5_X.gguf \
        --n-gpu-layers 99 \
        --flash-attn 1 \
        --n-prompt 512 \
        --n-gen 128,256,512

Master

model	size	params	backend	ngl	fa	test	t/s
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	pp512	364.66 ± 0.90
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	tg128	22.01 ± 0.01
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	tg256	21.98 ± 0.01
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	tg512	21.69 ± 0.02

PR

model	size	params	backend	ngl	fa	test	t/s
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	pp512	362.57 ± 1.68
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	tg128	21.94 ± 0.01
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	tg256	21.90 ± 0.00
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	tg512	21.47 ± 0.06

eugr · 2025-12-04T20:20:34Z

@ggerganov - sorry for the delay, been trying to do some work done :)
Looks like I'm getting significant performance hit on my Sparks:

Before patch:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048	1745.82 ± 16.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32	51.07 ± 0.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d4096	1600.32 ± 14.38
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d4096	48.40 ± 0.28
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d8192	1465.80 ± 13.11
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d8192	46.95 ± 1.02
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d16384	1237.04 ± 12.66
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d16384	43.60 ± 0.41
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d32768	941.25 ± 10.39
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d32768	37.17 ± 0.52

build: bde188d (7275)

Apply patch:

curl -L https://patch-diff.githubusercontent.com/raw/ggml-org/llama.cpp/pull/17116.diff | git apply

After patch:

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048	1768.18 ± 11.56
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32	47.47 ± 1.31
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d4096	1610.21 ± 53.17
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d4096	44.53 ± 0.97
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d8192	1463.13 ± 16.52
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d8192	43.12 ± 2.64
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d16384	1233.73 ± 27.81
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d16384	41.26 ± 2.38
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	pp2048 @ d32768	928.27 ± 6.69
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA,RPC	tg32 @ d32768	34.99 ± 1.01

build: bde188d (7275)

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 9, 2025

DajanaV mentioned this pull request Nov 9, 2025

UPSTREAM PR #17116: rpc : fix alloc size logic auroralabs-loci/llama.cpp#145

Closed

2 tasks

rgerganov reviewed Nov 10, 2025

View reviewed changes

ggerganov marked this pull request as ready for review November 10, 2025 19:34

ggerganov requested a review from slaren as a code owner November 10, 2025 19:34

ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 493d0bd to 06116a0 Compare November 14, 2025 12:37

rgerganov mentioned this pull request Nov 15, 2025

Eval bug: problem with llama serve and rpc #17274

Closed

ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 06116a0 to 590a805 Compare November 18, 2025 09:26

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17116: rpc : fix alloc size logic auroralabs-loci/llama.cpp#262

Open

2 tasks

ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 590a805 to 4953693 Compare November 28, 2025 15:34

ggerganov added 2 commits December 3, 2025 13:13

rpc : fix alloc size logic

ad47862

rpc : bump version

5c6f2a8

ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 4953693 to 5c6f2a8 Compare December 3, 2025 11:14

rpc : fix alloc size logic #17116

Are you sure you want to change the base?

rpc : fix alloc size logic #17116

Conversation

ggerganov commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

rgerganov left a comment

Choose a reason for hiding this comment

Uh oh!

rgerganov Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

rgerganov commented Nov 10, 2025

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

ggerganov commented Nov 10, 2025

Uh oh!

jukofyork commented Nov 10, 2025

Uh oh!

slaren commented Nov 11, 2025

Uh oh!

ggerganov commented Dec 3, 2025

Uh oh!

jukofyork commented Dec 3, 2025

Uh oh!

jukofyork commented Dec 3, 2025

Master

PR

Uh oh!

eugr commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggerganov commented Nov 9, 2025 •

edited

Loading