Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 9, 2025

fix #16657
ref #16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

  • Check performance imapct
  • Cache the responses to avoid extra RPC calls?

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 9, 2025
@ggerganov
Copy link
Member Author

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

Copy link
Collaborator

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

@@ -8,4 +7,4 @@
#endif

#define RPC_PROTO_MAJOR_VERSION 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's bump the protocol version as this is a breaking change

@ggerganov
Copy link
Member Author

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

@rgerganov
Copy link
Collaborator

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

Yes, disabling graph reuse results into a lot of RPC_CMD_GET_ALLOC_SIZE calls for each graph computation

@ggerganov
Copy link
Member Author

Great, I didn't realize until now that the graph reuse saves us almost all the extra RPC calls. This maybe makes the need to cache the RPC calls almost redundant as I think this won't have a significant impact on the PP.

@ggerganov ggerganov marked this pull request as ready for review November 10, 2025 19:34
@ggerganov ggerganov requested a review from slaren as a code owner November 10, 2025 19:34
@ggerganov
Copy link
Member Author

Some benches could be useful to confirm that the performance is not significantly affected. And I think it should be good to merge.

@jukofyork
Copy link
Collaborator

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

No problem, but it will likely be Thursday before I can run any tests.

@slaren
Copy link
Member

slaren commented Nov 11, 2025

I think eventually the proper solution to this and other per-tensor calls such as init_tensor will be to batch all the tensors into a single a call. For example, ggml-alloc could be modified to build a list of tensors and obtain all the tensor alloc sizes first, and use that data rather than calling get_alloc_size every time. This way, instead of thousands of calls (each with a network roundtrip), it would only require one. The RPC backend could also cache the results so that calls with identical tensors don't require the network at all.

@ggerganov ggerganov force-pushed the gg/rpc-fix-alloc-size branch from 4953693 to 5c6f2a8 Compare December 3, 2025 11:14
@ggerganov
Copy link
Member Author

@jukofyork @eugr Do you want to do some perf tests with this branch before merging to verify there is no significant performance hit?

@jukofyork
Copy link
Collaborator

@jukofyork @eugr Do you want to do some perf tests with this branch before merging to verify there is no significant performance hit?

Sorry, I forgot all about this - I'll try and test it this afternoon and report back.

@jukofyork
Copy link
Collaborator

@jukofyork @eugr Do you want to do some perf tests with this branch before merging to verify there is no significant performance hit?

Using 3 nodes, each with 2x A6000 in:

./llama-bench \
        --rpc "192.168.1.2:50052,192.168.1.3:50052" \
        --model ~/models/gguf/GLM-4.6-Q5_X.gguf \
        --n-gpu-layers 99 \
        --flash-attn 1 \
        --n-prompt 512 \
        --n-gen 128,256,512

Master

model size params backend ngl fa test t/s
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 pp512 364.66 ± 0.90
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 tg128 22.01 ± 0.01
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 tg256 21.98 ± 0.01
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 tg512 21.69 ± 0.02

PR

model size params backend ngl fa test t/s
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 pp512 362.57 ± 1.68
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 tg128 21.94 ± 0.01
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 tg256 21.90 ± 0.00
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 tg512 21.47 ± 0.06

@eugr
Copy link

eugr commented Dec 4, 2025

@ggerganov - sorry for the delay, been trying to do some work done :)
Looks like I'm getting significant performance hit on my Sparks:

Before patch:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 1745.82 ± 16.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 51.07 ± 0.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d4096 1600.32 ± 14.38
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d4096 48.40 ± 0.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d8192 1465.80 ± 13.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d8192 46.95 ± 1.02
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d16384 1237.04 ± 12.66
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d16384 43.60 ± 0.41
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d32768 941.25 ± 10.39
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d32768 37.17 ± 0.52

build: bde188d (7275)

Apply patch:

curl -L https://patch-diff.githubusercontent.com/raw/ggml-org/llama.cpp/pull/17116.diff | git apply

After patch:

model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 1768.18 ± 11.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 47.47 ± 1.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d4096 1610.21 ± 53.17
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d4096 44.53 ± 0.97
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d8192 1463.13 ± 16.52
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d8192 43.12 ± 2.64
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d16384 1233.73 ± 27.81
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d16384 41.26 ± 2.38
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC pp2048 @ d32768 928.27 ± 6.69
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA,RPC tg32 @ d32768 34.99 ± 1.01

build: bde188d (7275)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Incorrect outputs when running inference in multiple nodes (Mac)

6 participants