rpc : reuse compute graphs #15405

rgerganov · 2025-08-18T19:52:51Z

Store compute graphs on the server side and reuse them when possible. Compute graphs are kept in a ring buffer with fixed size, so we can avoid serializing and deserializing the same graph every time. Add two new commands:

RPC_CMD_GRAPH_COMPUTE_AND_STORE -- store the graph, compute it and return its ID
RPC_CMD_GRAPH_RECOMPUTE -- recompute the graph with the given ID

Currently there is no good way to associate an ID with ggml_cgraph, so we abuse tensor->extra of the first node for this purpose.

rgerganov · 2025-08-18T19:59:56Z

The time spent on serializing/deserializing the graph is quite small compared to the graph compute time. I have measured ~1% tg improvement on 1Gpbs connection:

master

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	pp512	116.93 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg128	13.62 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg256	13.53 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg512	13.27 ± 0.01

PR

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	pp512	116.79 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg128	13.76 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg256	13.66 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg512	13.37 ± 0.02

Not sure if its worth the complexity we are adding here ...

jukofyork · 2025-10-24T01:03:39Z

When I'm running over RPC, I'm noticing the amount of data sent from the main host to the RPC servers is very asymmetric during token generation, eg:

Something like 200MB/s getting sent to each RPC server.
Something like 5MB/s getting sent from each RPC server.

Is this the compute graph getting sent (which I can only think must have all the sin/cos values for the positional encodings or something to have this large disparity?) and the hidden state(s) getting sent back?

rgerganov · 2025-10-24T11:49:37Z

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:

...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...

For set_tensor you get how much bytes are received by the server.
For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296
For get_tensor you get how much bytes are sent by the server.

jukofyork · 2025-10-24T12:43:00Z

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:
...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...
For set_tensor you get how much bytes are received by the server. For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296 For get_tensor you get how much bytes are sent by the server.

Thanks! I'll give this a try and see if I can see what is getting sent.

slaren · 2025-10-24T12:50:15Z

Running the client with GGML_SCHED_DEBUG=1 could also give you an idea of what tensors are being transferred to each server when evaluating a graph (these are the "inputs" of each split). Unfortunately, many tensors don't have very informative names at the moment, but the big one is likely to be the attention mask.

jukofyork · 2025-10-24T13:02:07Z

This is what I'm getting for the 2 RPC servers (each with 2 GPUs):

[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 122880
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c603c000, offset: 0, size: 122880

[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 122880
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e03c000, offset: 0, size: 122880

1080*8 + 1400*296 = 423040

So nothing huge here for 1 token.

But this is what I'm seeing get moved over network whilst generating:

and here are the last couple of tokens generated:

[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 61440
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 143360
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 143360

[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 61440
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 143360
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 143360

Is there anything here that can explain the huge discrepancy of what the main host is sending to the 2 clients, compared to what they are sending back?

The amount of data they are sending back looks to be in line with what you would expect for the hidden states.

jukofyork · 2025-10-24T19:51:07Z

I think I've deciphered this now (it was confusing because I was also running speculative decoding!).

For graph_compute, the received data in bytes is n_nodes8 + n_tensors296

[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400

So this is 1080*8 + 1400*296 = 423040.

[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480

this is the hidden state stored as float32 (ie: 20480/4 = 5120).

[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072

This appears to be 2x the context length, so I assume it is something holding position encodings.

So overall we are sending 423040+131072+20480 and receiving 20480 back here:

[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480

which is a 28:1 ratio and in line with the measurements.

rgerganov · 2025-10-25T08:42:27Z

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

jukofyork · 2025-10-25T11:09:58Z

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

Thanks!

Store compute graphs on the server side and reuse them when possible. Compute graphs are kept in a ring buffer with fixed size, so we can avoid serializing and deserializing the same graph every time. Add two new commands: * RPC_CMD_GRAPH_COMPUTE_AND_STORE -- store the graph, compute it and return its ID * RPC_CMD_GRAPH_RECOMPUTE -- recompute the graph with the given ID Currently there is no good way to associate an ID with `ggml_cgraph`, so we abuse `tensor->extra` of the first node for this purpose.

rgerganov · 2025-10-27T13:03:25Z

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

jukofyork · 2025-10-27T13:06:45Z

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

Thanks! I'm away for a couple of days so will be Thursday/Friday before I can boot up the machines to test it.

jukofyork · 2025-10-28T14:21:50Z

This makes quite a big difference for me:

master

model	size	params	backend	ngl	fa	dev	ts	test	t/s
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	pp512	318.55 ± 0.86
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg128	15.64 ± 0.00
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg256	15.64 ± 0.01
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg512	15.54 ± 0.01

PR

model	size	params	backend	ngl	fa	dev	ts	test	t/s
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	pp512	320.18 ± 0.99
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg128	19.47 ± 0.02
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg256	19.48 ± 0.01
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg512	19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

jukofyork · 2025-10-28T14:59:57Z

This also changes the relative costs as now batch_size = 1 and batch_size = 2 both generate ~19.5 tokens/s (not really relevant for this PR, but will make more sense when I get time to explain my new "profile guided speculative decoding" code that uses these values...).

rgerganov · 2025-10-28T17:24:32Z

@jukofyork this is great news, I didn't have a chance to test this patch with larger graphs, thanks for confirming it makes a significant improvement

@slaren Is there a better way to associate an ID with ggml_cgraph?

jukofyork · 2025-10-30T09:59:34Z

@rgerganov @slaren

I've been looking at the RPC code and wonder if there might be a way to simplify/unify all this and remove the RPC_CMD_SET_TENSOR_HASH command, etc.

It's pretty cheap to hash data using #16753 (and with a few hours work I can probably decipher the 128-bit version and get an extra 1.5x performance for SSE2...), so what about if we removed the existing tensor-specific hashing and instead did it in the send_rpc_cmd code:

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// No response
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    uint8_t cmd_byte = cmd;
    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
    return true;
}

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// RPC response: | response_size (8 bytes) | response_data (response_size bytes) |
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size, void * output, size_t output_size) {
    if (!send_rpc_cmd(sock, cmd, input, input_size)) {
        return false;
    }
    // TODO: currently the output_size is always known, do we need support for commands with variable output size?
    // even if we do, we can skip sending output_size from the server for commands with known output size
    uint64_t out_size;
    if (!recv_data(sock->fd, &out_size, sizeof(out_size))) {
        return false;
    }
    if (out_size != output_size) {
        return false;
    }
    if (!recv_data(sock->fd, output, output_size)) {
        return false;
    }
    return true;
}

So:

The RPC servers would still have the (optional) non-volatile cache for large transfers and the code above would need then to negotiate these transfers, but since these are the large tensors the added latency would be insignificant and this is essentially what the RPC_CMD_SET_TENSOR_HASH command is doing anyway; we just move the logic to here. As with the current implementation, there will be some host-side-only heuristic(s) applied (minimum size, etc).

But we would also introduce a volatile LRU cache:

It could either cache all data sent/received, or use similar heuristic(s) as the non-volatile cache, but the heuristic(s) would need to be performed both host-side and server-side.
The host wouldn't actually need to store the data itself and its LRU cache would just need the hashes storing.
As the host knows the LRU cache size and ejection policy the RPC servers will be using, it should be able to keep track of exactly what they have stored in their LRU caches and there would be no additional latency nor any negotiation needed (ie: it would just be the same as above but with the hash in place of the actual data).
The LRU cache would need to be large enough so that important data like the compute graph and position embeddings (if that is what that other tensor is that I see getting sent?), don't get ejected by the batches of hidden state data during PP (or some additional heuristic applied to avoid storing hidden state data at all...).

This seems to have lots of benefits:

Removes a lot of the complexity this PR adds.
Removes the need for the RPC_CMD_SET_TENSOR_HASH and the complexity it adds.
Moves the logic of the heuristics to a single point and this will make it easier to apply additional heuristics if needed (ie: not just sized-based).
If we ever want to experiment with using compression, then it will neatly fit in here too. The quantised tensors are high entropy and will not compress at all, but the float32 tensors being used for the hidden states likely contain way less entropy and can likely be compressed to 1/2 to 1/3 of their size losslessly (eg: https://arxiv.org/abs/2504.11651 and https://clemenswinter.com/2024/04/07/the-simple-beauty-of-xor-floating-point-compression/) or even further if we accept some lossy compression to a lower precision data type (eg: deepspeed's communication_data_type https://www.deepspeed.ai/docs/config-json/#communication-options).

The only thing I'm not so sure about is if we are ever sending any pointers (or non-static node indices, etc) over the RPC channel? If so then that will make things much harder and probably a lot more work (but not impossible, eg: using the ideas from https://theboostcpplibraries.com/boost.serialization-pointers-and-references).

jukofyork · 2025-10-30T10:12:55Z

I made a lot of ninja edits just now so if reading from the email notification, please see the comment instead!

rgerganov · 2025-10-30T11:48:13Z

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

slaren · 2025-10-30T12:15:55Z

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

jukofyork · 2025-10-30T12:27:42Z

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

jukofyork · 2025-10-30T12:33:21Z

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

I've got to go out, but will try and give a more detailed example of what I mean when I get back in. The basic idea is just to extract all the "communication channel" stuff (requiring tunable heuristic, etc) away from the clean RPC interface. The send_rpc_cmd is just the point where the heuristics will have the most helpful data to make their decisions, but it could equivalently be done in the lower level functions as well (although I think it would be not as easy to design the heuristics without the actual command name, etc).

eg: The simple >10MB hashing heuristic is already failing for me where I am sending huge batches and I've tried everything up to 100MB but still getting some batches hashes and then saved.

jukofyork · 2025-10-30T20:16:29Z

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.
I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}

static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

jukofyork · 2025-10-31T05:59:39Z

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.
I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}

static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

Actually this now seems to be giving me quite a boost! With this PR and that branch merged, I'm now getting this:

model	size	params	backend	ngl	fa	dev	ts	test	t/s
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	pp512	354.63 ± 1.07
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg128	22.70 ± 0.01
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg256	22.41 ± 0.01
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg512	22.07 ± 0.05

compared to this the other day:

This makes quite a big difference for me:

master

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 318.55 ± 0.86
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 15.64 ± 0.00
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 15.64 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 15.54 ± 0.01

PR

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 320.18 ± 0.99
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 19.47 ± 0.02
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 19.48 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

It may be something else that has changed, so will check next week to be sure, but it seems odd I got around 15 tokens/s when I tested this branch a couple of days ago.

jukofyork · 2025-10-31T11:35:58Z

I made a draft PR #16892 to see if this makes any difference to others, or if it only helps with this specific PR, etc.

jukofyork · 2025-10-31T12:00:36Z

I also had a go at writing a (very!) hacky proof of concept for the volatile hash last night:

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp

Line 431 in 3561f8c

// Try use the volatile cache when data size is larger than this threshold

// Try use the volatile cache when data size is larger than this threshold
const size_t MIN_CACHE_THRESHOLD = 20 * 1024;
const size_t MAX_CACHE_THRESHOLD = 1024 * 1024;

static bool send_data(sockfd_t sockfd, const void * data, size_t size) {
    static std::unordered_set<uint64_t> sent_hashes;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint64_t hash = generate_hash((const uint8_t*)data, size);
        bool is_new = sent_hashes.find(hash) == sent_hashes.end();

        uint8_t flag = is_new ? 1 : 0;
        if (send(sockfd, (const char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (send(sockfd, (const char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (!is_new) {
            return true;
        }
        sent_hashes.insert(hash);
    }

    size_t bytes_sent = 0;
    while (bytes_sent < size) {
        size_t size_to_send = std::min(size - bytes_sent, MAX_CHUNK_SIZE);
        ssize_t n = send(sockfd, (const char *)data + bytes_sent, size_to_send, 0);
        if (n < 0) {
            GGML_LOG_ERROR("send failed (bytes_sent=%zu, size_to_send=%zu)\n",
                           bytes_sent, size_to_send);
            return false;
        }
        bytes_sent += (size_t)n;
    }
    return true;
}

static bool recv_data(sockfd_t sockfd, void * data, size_t size) {
    static std::unordered_map<uint64_t, std::vector<uint8_t>> recv_cache;

    uint64_t hash = 0;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint8_t flag;
        if (recv(sockfd, (char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (recv(sockfd, (char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (flag == 0) {
            auto it = recv_cache.find(hash);
            if (it != recv_cache.end()) {
                memcpy(data, it->second.data(), size);
                return true;
            }
            return false;
        }
    }

    size_t bytes_recv = 0;
    while (bytes_recv < size) {
        size_t size_to_recv = std::min(size - bytes_recv, MAX_CHUNK_SIZE);
        ssize_t n = recv(sockfd, (char *)data + bytes_recv, size_to_recv, 0);
        if (n < 0) {
            GGML_LOG_ERROR("recv failed (bytes_recv=%zu, size_to_recv=%zu)\n",
                           bytes_recv, size_to_recv);
            return false;
        }
        if (n == 0) {
            LOG_DBG("recv returned 0 (peer closed?)\n");
            return false;
        }
        bytes_recv += (size_t)n;
    }

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        recv_cache[hash] = std::vector<uint8_t>((uint8_t*)data, (uint8_t*)data + size);
    }

    return true;
}

It did give a small improvement (from 15.5 tokens/s to 17.5 tokens/s), but either the hashing still has a lot of overhead or the repeated serialisation of the compute graph as mentioned by @rgerganov is very costly, as with this PR I get 19.5 tokens/s in comparison! ☹️

I then destroyed the branch trying to hack in the code to join the packets after the push linked above, so will have to take another look at this next week and likely start fresh.

I did find that the other tensors that were exactly 2x the context length (and suspected was something to do with the position embeddedings) is the same each time and could also be cached (this finally produced a balanced flow over the network of about 4-5MB/s which is purely the hidden state data).

jukofyork · 2025-10-31T13:08:55Z

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

It's a pity, as this PR does seem to help a lot (maybe mainly large MOE models only though?).

From my hacky experiments last night, it definitely looks like @rgerganov is correct that the serialisation+hashing (even using xxhash64) is going to be worse than this PR (my only hope was it was the multiple packets, but after reading @rgerganov post on the other PR; I'm not hopeful).

By "visiting every node" does that mean the data of every node needs comparing (as in the full ~450kb of data for my example), or is it just a set of "node IDs" or similar that need comparing?

kyuz0 · 2025-11-10T20:05:42Z

On my setup (two AMD Strix Halo hosts), this PR has improved (~5.5%) prompt processing:

  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |        175.56 ± 1.40 |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |         17.50 ± 0.08 |

Right now I only have a screenshot of the performance without this PR:

Happy to do more testing. My setup currenlt uses USB-to-USB, 10Gbps and I'm trying to see if I can get to 20-40Gbps.

kyuz0 · 2025-11-10T20:52:13Z

Improvement was there but minimal got GLM 6.4:

llama-bench -mmp 0 -fa 1 -m GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf --rpc "192.168.2.2:50052"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |         34.18 ± 0.05 |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |          8.81 ± 0.01 |

rgerganov · 2025-11-11T07:57:06Z

@kyuz0 thanks for taking the time to test this on your setup

I will try to rework this with the approach suggested by @slaren by using the CUDA backend as reference

ggerganov · 2025-11-11T08:04:23Z

I wonder if you can simply reuse this approach that I prototyped some time ago for the Metal backend to determine when a graph is the same as the previous one:

https://github.com/ggml-org/llama.cpp/pull/14570/files#diff-4045ccc2edf36af53c4a28fca4a6a24f628b3c9bee7ccb43b06e55dc5e5e199eR5482-R5514

At the time, I thought it was hacky, but I don't actually see a failure case - it simply does a memcmp of the entire graph data at the byte level.

It didn't end up being used because it didn't lead to any improvements for Metal (#14570). But here it could be a reasonable stopgap until the graph plan API lands.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 18, 2025

rgerganov force-pushed the rpc-reuse-graphs branch from 537b237 to 2272e04 Compare October 27, 2025 12:59

rgerganov marked this pull request as ready for review October 28, 2025 17:19

rgerganov requested review from ggerganov and slaren as code owners October 28, 2025 17:19

jukofyork mentioned this pull request Oct 31, 2025

rpc: join small packets in send_msg and recv_msg #16892

Closed

rpc : reuse compute graphs #15405

Are you sure you want to change the base?

rpc : reuse compute graphs #15405

Conversation

rgerganov commented Aug 18, 2025

Uh oh!

rgerganov commented Aug 18, 2025

master

PR

Uh oh!

jukofyork commented Oct 24, 2025

Uh oh!

rgerganov commented Oct 24, 2025

Uh oh!

jukofyork commented Oct 24, 2025

Uh oh!

slaren commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Oct 25, 2025

Uh oh!

jukofyork commented Oct 25, 2025

Uh oh!

rgerganov commented Oct 27, 2025

Uh oh!

jukofyork commented Oct 27, 2025

Uh oh!

jukofyork commented Oct 28, 2025

master

PR

Uh oh!

jukofyork commented Oct 28, 2025

Uh oh!

rgerganov commented Oct 28, 2025

Uh oh!

jukofyork commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

rgerganov commented Oct 30, 2025

Uh oh!

slaren commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 31, 2025

master

PR

Uh oh!

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyuz0 commented Nov 10, 2025

Uh oh!

kyuz0 commented Nov 10, 2025

Uh oh!

rgerganov commented Nov 11, 2025

Uh oh!

ggerganov commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

slaren commented Oct 24, 2025 •

edited

Loading

jukofyork commented Oct 24, 2025 •

edited

Loading

jukofyork commented Oct 24, 2025 •

edited

Loading

jukofyork commented Oct 30, 2025 •

edited

Loading

jukofyork commented Oct 31, 2025 •

edited

Loading

jukofyork commented Oct 31, 2025 •

edited

Loading

jukofyork commented Oct 31, 2025 •

edited

Loading