Skip to content

Conversation

@rgerganov
Copy link
Collaborator

Store compute graphs on the server side and reuse them when possible. Compute graphs are kept in a ring buffer with fixed size, so we can avoid serializing and deserializing the same graph every time. Add two new commands:

  • RPC_CMD_GRAPH_COMPUTE_AND_STORE -- store the graph, compute it and return its ID
  • RPC_CMD_GRAPH_RECOMPUTE -- recompute the graph with the given ID

Currently there is no good way to associate an ID with ggml_cgraph, so we abuse tensor->extra of the first node for this purpose.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 18, 2025
@rgerganov
Copy link
Collaborator Author

The time spent on serializing/deserializing the graph is quite small compared to the graph compute time. I have measured ~1% tg improvement on 1Gpbs connection:

master

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 pp512 116.93 ± 0.34
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg128 13.62 ± 0.07
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg256 13.53 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg512 13.27 ± 0.01

PR

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 pp512 116.79 ± 0.28
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg128 13.76 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg256 13.66 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg512 13.37 ± 0.02

Not sure if its worth the complexity we are adding here ...

@jukofyork
Copy link
Collaborator

When I'm running over RPC, I'm noticing the amount of data sent from the main host to the RPC servers is very asymmetric during token generation, eg:

  • Something like 200MB/s getting sent to each RPC server.
  • Something like 5MB/s getting sent from each RPC server.

Is this the compute graph getting sent (which I can only think must have all the sin/cos values for the positional encodings or something to have this large disparity?) and the hidden state(s) getting sent back?

@rgerganov
Copy link
Collaborator Author

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:

...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...

For set_tensor you get how much bytes are received by the server.
For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296
For get_tensor you get how much bytes are sent by the server.

@jukofyork
Copy link
Collaborator

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:

...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...

For set_tensor you get how much bytes are received by the server. For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296 For get_tensor you get how much bytes are sent by the server.

Thanks! I'll give this a try and see if I can see what is getting sent.

@slaren
Copy link
Member

slaren commented Oct 24, 2025

Running the client with GGML_SCHED_DEBUG=1 could also give you an idea of what tensors are being transferred to each server when evaluating a graph (these are the "inputs" of each split). Unfortunately, many tensors don't have very informative names at the moment, but the big one is likely to be the attention mask.

@jukofyork
Copy link
Collaborator

jukofyork commented Oct 24, 2025

This is what I'm getting for the 2 RPC servers (each with 2 GPUs):

[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 122880
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c603c000, offset: 0, size: 122880
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 122880
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e03c000, offset: 0, size: 122880

1080*8 + 1400*296 = 423040

So nothing huge here for 1 token.


But this is what I'm seeing get moved over network whilst generating:

image

and here are the last couple of tokens generated:

[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 61440
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 143360
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 143360
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 61440
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 143360
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 143360

Is there anything here that can explain the huge discrepancy of what the main host is sending to the 2 clients, compared to what they are sending back?

The amount of data they are sending back looks to be in line with what you would expect for the hidden states.

@jukofyork
Copy link
Collaborator

jukofyork commented Oct 24, 2025

I think I've deciphered this now (it was confusing because I was also running speculative decoding!).

For graph_compute, the received data in bytes is n_nodes8 + n_tensors296

[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400

So this is 1080*8 + 1400*296 = 423040.


[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480

this is the hidden state stored as float32 (ie: 20480/4 = 5120).


[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072

This appears to be 2x the context length, so I assume it is something holding position encodings.


So overall we are sending 423040+131072+20480 and receiving 20480 back here:

[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480

which is a 28:1 ratio and in line with the measurements.

@rgerganov
Copy link
Collaborator Author

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

@jukofyork
Copy link
Collaborator

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

Thanks!

Store compute graphs on the server side and reuse them when possible. Compute
graphs are kept in a ring buffer with fixed size, so we can avoid serializing
and deserializing the same graph every time. Add two new commands:
 * RPC_CMD_GRAPH_COMPUTE_AND_STORE -- store the graph, compute it and return its ID
 * RPC_CMD_GRAPH_RECOMPUTE -- recompute the graph with the given ID

Currently there is no good way to associate an ID with `ggml_cgraph`,
so we abuse `tensor->extra` of the first node for this purpose.
@rgerganov
Copy link
Collaborator Author

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

@jukofyork
Copy link
Collaborator

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

Thanks! I'm away for a couple of days so will be Thursday/Friday before I can boot up the machines to test it.

@jukofyork
Copy link
Collaborator

This makes quite a big difference for me:

master

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 318.55 ± 0.86
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 15.64 ± 0.00
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 15.64 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 15.54 ± 0.01

PR

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 320.18 ± 0.99
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 19.47 ± 0.02
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 19.48 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

@jukofyork
Copy link
Collaborator

image

This also changes the relative costs as now batch_size = 1 and batch_size = 2 both generate ~19.5 tokens/s (not really relevant for this PR, but will make more sense when I get time to explain my new "profile guided speculative decoding" code that uses these values...).

@rgerganov rgerganov marked this pull request as ready for review October 28, 2025 17:19
@rgerganov
Copy link
Collaborator Author

@jukofyork this is great news, I didn't have a chance to test this patch with larger graphs, thanks for confirming it makes a significant improvement

@slaren Is there a better way to associate an ID with ggml_cgraph?

@jukofyork
Copy link
Collaborator

jukofyork commented Oct 30, 2025

@rgerganov @slaren

I've been looking at the RPC code and wonder if there might be a way to simplify/unify all this and remove the RPC_CMD_SET_TENSOR_HASH command, etc.

It's pretty cheap to hash data using #16753 (and with a few hours work I can probably decipher the 128-bit version and get an extra 1.5x performance for SSE2...), so what about if we removed the existing tensor-specific hashing and instead did it in the send_rpc_cmd code:

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// No response
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    uint8_t cmd_byte = cmd;
    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
    return true;
}

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// RPC response: | response_size (8 bytes) | response_data (response_size bytes) |
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size, void * output, size_t output_size) {
    if (!send_rpc_cmd(sock, cmd, input, input_size)) {
        return false;
    }
    // TODO: currently the output_size is always known, do we need support for commands with variable output size?
    // even if we do, we can skip sending output_size from the server for commands with known output size
    uint64_t out_size;
    if (!recv_data(sock->fd, &out_size, sizeof(out_size))) {
        return false;
    }
    if (out_size != output_size) {
        return false;
    }
    if (!recv_data(sock->fd, output, output_size)) {
        return false;
    }
    return true;
}

So:

The RPC servers would still have the (optional) non-volatile cache for large transfers and the code above would need then to negotiate these transfers, but since these are the large tensors the added latency would be insignificant and this is essentially what the RPC_CMD_SET_TENSOR_HASH command is doing anyway; we just move the logic to here. As with the current implementation, there will be some host-side-only heuristic(s) applied (minimum size, etc).

But we would also introduce a volatile LRU cache:

  • It could either cache all data sent/received, or use similar heuristic(s) as the non-volatile cache, but the heuristic(s) would need to be performed both host-side and server-side.
  • The host wouldn't actually need to store the data itself and its LRU cache would just need the hashes storing.
  • As the host knows the LRU cache size and ejection policy the RPC servers will be using, it should be able to keep track of exactly what they have stored in their LRU caches and there would be no additional latency nor any negotiation needed (ie: it would just be the same as above but with the hash in place of the actual data).
  • The LRU cache would need to be large enough so that important data like the compute graph and position embeddings (if that is what that other tensor is that I see getting sent?), don't get ejected by the batches of hidden state data during PP (or some additional heuristic applied to avoid storing hidden state data at all...).

This seems to have lots of benefits:

The only thing I'm not so sure about is if we are ever sending any pointers (or non-static node indices, etc) over the RPC channel? If so then that will make things much harder and probably a lot more work (but not impossible, eg: using the ideas from https://theboostcpplibraries.com/boost.serialization-pointers-and-references).

@jukofyork
Copy link
Collaborator

I made a lot of ninja edits just now so if reading from the email notification, please see the comment instead!

@rgerganov
Copy link
Collaborator Author

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

@slaren
Copy link
Member

slaren commented Oct 30, 2025

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

@jukofyork
Copy link
Collaborator

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

@jukofyork
Copy link
Collaborator

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

I've got to go out, but will try and give a more detailed example of what I mean when I get back in. The basic idea is just to extract all the "communication channel" stuff (requiring tunable heuristic, etc) away from the clean RPC interface. The send_rpc_cmd is just the point where the heuristics will have the most helpful data to make their decisions, but it could equivalently be done in the lower level functions as well (although I think it would be not as easy to design the heuristics without the actual command name, etc).

eg: The simple >10MB hashing heuristic is already failing for me where I am sending huge batches and I've tried everything up to 100MB but still getting some batches hashes and then saved.

@jukofyork
Copy link
Collaborator

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

@jukofyork
Copy link
Collaborator

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

Actually this now seems to be giving me quite a boost! With this PR and that branch merged, I'm now getting this:

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 354.63 ± 1.07
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 22.70 ± 0.01
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 22.41 ± 0.01
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 22.07 ± 0.05

compared to this the other day:

This makes quite a big difference for me:

master

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 318.55 ± 0.86
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 15.64 ± 0.00
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 15.64 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 15.54 ± 0.01

PR

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 320.18 ± 0.99
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 19.47 ± 0.02
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 19.48 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

It may be something else that has changed, so will check next week to be sure, but it seems odd I got around 15 tokens/s when I tested this branch a couple of days ago.

@jukofyork
Copy link
Collaborator

jukofyork commented Oct 31, 2025

I made a draft PR #16892 to see if this makes any difference to others, or if it only helps with this specific PR, etc.

@jukofyork
Copy link
Collaborator

jukofyork commented Oct 31, 2025

I also had a go at writing a (very!) hacky proof of concept for the volatile hash last night:

// Try use the volatile cache when data size is larger than this threshold

// Try use the volatile cache when data size is larger than this threshold
const size_t MIN_CACHE_THRESHOLD = 20 * 1024;
const size_t MAX_CACHE_THRESHOLD = 1024 * 1024;

static bool send_data(sockfd_t sockfd, const void * data, size_t size) {
    static std::unordered_set<uint64_t> sent_hashes;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint64_t hash = generate_hash((const uint8_t*)data, size);
        bool is_new = sent_hashes.find(hash) == sent_hashes.end();

        uint8_t flag = is_new ? 1 : 0;
        if (send(sockfd, (const char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (send(sockfd, (const char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (!is_new) {
            return true;
        }
        sent_hashes.insert(hash);
    }

    size_t bytes_sent = 0;
    while (bytes_sent < size) {
        size_t size_to_send = std::min(size - bytes_sent, MAX_CHUNK_SIZE);
        ssize_t n = send(sockfd, (const char *)data + bytes_sent, size_to_send, 0);
        if (n < 0) {
            GGML_LOG_ERROR("send failed (bytes_sent=%zu, size_to_send=%zu)\n",
                           bytes_sent, size_to_send);
            return false;
        }
        bytes_sent += (size_t)n;
    }
    return true;
}

static bool recv_data(sockfd_t sockfd, void * data, size_t size) {
    static std::unordered_map<uint64_t, std::vector<uint8_t>> recv_cache;

    uint64_t hash = 0;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint8_t flag;
        if (recv(sockfd, (char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (recv(sockfd, (char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (flag == 0) {
            auto it = recv_cache.find(hash);
            if (it != recv_cache.end()) {
                memcpy(data, it->second.data(), size);
                return true;
            }
            return false;
        }
    }

    size_t bytes_recv = 0;
    while (bytes_recv < size) {
        size_t size_to_recv = std::min(size - bytes_recv, MAX_CHUNK_SIZE);
        ssize_t n = recv(sockfd, (char *)data + bytes_recv, size_to_recv, 0);
        if (n < 0) {
            GGML_LOG_ERROR("recv failed (bytes_recv=%zu, size_to_recv=%zu)\n",
                           bytes_recv, size_to_recv);
            return false;
        }
        if (n == 0) {
            LOG_DBG("recv returned 0 (peer closed?)\n");
            return false;
        }
        bytes_recv += (size_t)n;
    }

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        recv_cache[hash] = std::vector<uint8_t>((uint8_t*)data, (uint8_t*)data + size);
    }

    return true;
}

It did give a small improvement (from 15.5 tokens/s to 17.5 tokens/s), but either the hashing still has a lot of overhead or the repeated serialisation of the compute graph as mentioned by @rgerganov is very costly, as with this PR I get 19.5 tokens/s in comparison! ☹️

I then destroyed the branch trying to hack in the code to join the packets after the push linked above, so will have to take another look at this next week and likely start fresh.

I did find that the other tensors that were exactly 2x the context length (and suspected was something to do with the position embeddedings) is the same each time and could also be cached (this finally produced a balanced flow over the network of about 4-5MB/s which is purely the hidden state data).

@jukofyork
Copy link
Collaborator

jukofyork commented Oct 31, 2025

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

It's a pity, as this PR does seem to help a lot (maybe mainly large MOE models only though?).

From my hacky experiments last night, it definitely looks like @rgerganov is correct that the serialisation+hashing (even using xxhash64) is going to be worse than this PR (my only hope was it was the multiple packets, but after reading @rgerganov post on the other PR; I'm not hopeful).

By "visiting every node" does that mean the data of every node needs comparing (as in the full ~450kb of data for my example), or is it just a set of "node IDs" or similar that need comparing?

@kyuz0
Copy link

kyuz0 commented Nov 10, 2025

On my setup (two AMD Strix Halo hosts), this PR has improved (~5.5%) prompt processing:

  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |        175.56 ± 1.40 |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |         17.50 ± 0.08 |

Right now I only have a screenshot of the performance without this PR:

image

Happy to do more testing. My setup currenlt uses USB-to-USB, 10Gbps and I'm trying to see if I can get to 20-40Gbps.

@kyuz0
Copy link

kyuz0 commented Nov 10, 2025

Improvement was there but minimal got GLM 6.4:

llama-bench -mmp 0 -fa 1 -m GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf --rpc "192.168.2.2:50052"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |         34.18 ± 0.05 |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |          8.81 ± 0.01 |
image

@rgerganov
Copy link
Collaborator Author

@kyuz0 thanks for taking the time to test this on your setup

I will try to rework this with the approach suggested by @slaren by using the CUDA backend as reference

@ggerganov
Copy link
Member

I wonder if you can simply reuse this approach that I prototyped some time ago for the Metal backend to determine when a graph is the same as the previous one:

https://github.com/ggml-org/llama.cpp/pull/14570/files#diff-4045ccc2edf36af53c4a28fca4a6a24f628b3c9bee7ccb43b06e55dc5e5e199eR5482-R5514

At the time, I thought it was hacky, but I don't actually see a failure case - it simply does a memcmp of the entire graph data at the byte level.

It didn't end up being used because it didn't lead to any improvements for Metal (#14570). But here it could be a reasonable stopgap until the graph plan API lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants