Rpc improvement #480

firecoperana · 2025-06-01T00:57:31Z

Include various improvement of rpc from mainline including:

adding rpc backend to override tensor option
add argument of number of threads in the cpu rpc backend
cache model locally in rpc
no delay for sending tcp
various bugs fix.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <[email protected]> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <[email protected]> --------- Signed-off-by: Ville Vesilehto <[email protected]> # Conflicts: # ggml/src/ggml-rpc.cpp

Signed-off-by: xiaofei <[email protected]> # Conflicts: # examples/rpc/rpc-server.cpp

Zero out the name and padding buffers.

saood06 · 2025-06-01T02:58:58Z

Has this been tested? If so with what models and backends and what configurations. I attempted a similar PR a while ago, see #193 and when tested it did not work with Qwen2.5 72B since on mainline the PR that added "non-512 aligned tensors" was created to add support for that model. I also found that using KV cache quantization still did not work with RPC with or without #193.

ikawrakow · 2025-06-01T05:43:47Z

I don't use RPC, so need other people to confirm that this works.

saood06 · 2025-06-01T06:20:06Z

I don't use RPC, so need other people to confirm that this works.

I don't mind testing and reviewing this but before I do, I want to know what new models/configurations support this PR @firecoperana tested and saw benefit from. I deleted pretty much all the models I previously used when RPC testing when trying to bring support parity up to mainline.

firecoperana · 2025-06-01T12:45:44Z

I tested various quants of Deepseek v2.5, v3, v3 0324 models and it works. V3 0324 is the one with MLA support from mainline. Didn't test other models as I don't use them on this repo.

saood06 · 2025-06-01T12:53:03Z

I tested various quants of Deepseek v2.5, v3, v3 0324 models and it works. V3 0324 is the one with MLA support from mainline. Didn't test other models as I don't use them on this repo.

Did you test with -ot or cache quantization? Do you mind sharing performance and what hardware you used?

firecoperana · 2025-06-01T13:08:24Z

My main machine is 3090 with 128GB ddr4. I did -ot to override individual expert tensors to my other machines with ddr4 3000mhz ram and 3060, and with --cache-type-k q8_0 and batch size of 512, in which case I can load the whole model into either vram and ram. I use cpu RPC backend to use ram from remote machines. For Deepseek V3 Q2_K_XL, I can get 10 it/s for pp and 3 it/s for inferencing. Deepseek V2.5 Q4 is about 6-7 it/s for inferencing.

saood06 · 2025-06-01T13:16:15Z

My main machine is 3090 with 128GB ddr4. I did -ot to override individual expert tensors to my other machines with ddr4 3000mhz ram and 3060, and with --cache-type-k q8_0 and batch size of 512, in which case I can load the whole model into either vram and ram. I use cpu RPC backend to use ram from remote machines. For Deepseek V3 Q2_K_XL, I can get 10 it/s for pp and 3 it/s for inferencing. Deepseek V2.5 Q4 is about 6-7 it/s for inferencing.

Thank you for the details. For now I'll do some testing on Deepseek, with an RPC backend on my 3090 and -ot, with the rest of the model in RAM on the DDR4 server I usually use for inference.

For reference with my pure IQ4_K_R4 I get similar speeds you get with RPC for both PP and TG so hopefully with RPC it can improve (and since those quants are now supported on CUDA, I won't need to make a new quant).

firecoperana · 2025-06-01T13:24:34Z

Be sure to set -t n -c in cpu backend, where n is the number of threads you want the tensors in ram to use. -c is to load tensors from local files next time. This is useful if you have slow LAN transfer speed.

ikawrakow · 2025-06-02T09:25:04Z

No user feedback here, so new strategy: I'll merge this tomorrow. If we don't get bug reports, all is good. If we do get bug reports, all is good too because we know that it needs further work.

saood06 · 2025-06-02T10:15:59Z

No user feedback here, so new strategy: I'll merge this tomorrow. If we don't get bug reports, all is good. If we do get bug reports, all is good too because we know that it needs further work.

I haven't found the time to test this, but I do plan to, in the next few days. (I've already downloaded a few of the models I plan to to use to test this alongside Deepseek). Either way I'll give some feedback even if it has already been merged by then.

This reverts commit 8a5f857.

ikawrakow · 2025-06-08T11:52:02Z

I get build errors after merging this PR, so reverted. Please fix and resubmit.

firecoperana · 2025-06-08T13:42:29Z

I get build errors after merging this PR, so reverted. Please fix and resubmit.

What's the error? Does the error happen only when you set DGGML_RPC=OFF?

ikawrakow · 2025-06-08T13:48:08Z

/home/iwan/other/ik_llama.cpp/common/common.cpp: In function ‘bool gpt_params_find_arg(int, char**, const std::string&, gpt_params&, int&, bool&)’:
/home/iwan/other/ik_llama.cpp/common/common.cpp:1013:13: error: ‘ggml_backend_rpc_buffer_type’ was not declared in this scope; did you mean ‘ggml_backend_cpu_buffer_type’?
 1013 |             ggml_backend_rpc_buffer_type(server.c_str());
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |             ggml_backend_cpu_buffer_type
/home/iwan/other/ik_llama.cpp/common/common.cpp:1016:9: error: ‘ggml_backend_rpc_buffer_type’ was not declared in this scope; did you mean ‘ggml_backend_cpu_buffer_type’?
 1016 |         ggml_backend_rpc_buffer_type(servers.c_str());
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |         ggml_backend_cpu_buffer_type

This is with -DGGML_RPC=OFF

firecoperana · 2025-06-08T14:23:46Z

Fixed

saood06 · 2025-06-22T20:52:33Z

Finally got around to testing this. It seems functional (sweep-bench testing only), but I couldn't get any performance advantage from offloading Deepseek-V3 based models via RPC to my 3090. I know when I tested that on mainline I also noticed a performance regression (that went up with the more I offloaded).

(I ran with ./llama-sweep-bench -m /mnt/sda/DeepseekR1_0528/DeepseekR1_0528-IQ4_K_R4_ATT1.gguf --numa distribute -t 48 -mla 3 -fa -fmoe -c 4096 --rpc 10.0.0.250:50052 -ot exps=CPU -ngl 99 --warmup-batch)

Measuring at low context values, PP drops from ~10.5 to ~4.5, TG drops from ~3.3 to ~1.

I may revisit when I eventually get an infiniband connection between the two computers and see if that helps.

firecoperana · 2025-06-23T00:37:44Z

Finally got around to testing this. It seems functional (sweep-bench testing only), but I couldn't get any performance advantage from offloading Deepseek-V3 based models via RPC to my 3090. I know when I tested that on mainline I also noticed a performance regression (that went up with the more I offloaded).

(I ran with ./llama-sweep-bench -m /mnt/sda/DeepseekR1_0528/DeepseekR1_0528-IQ4_K_R4_ATT1.gguf --numa distribute -t 48 -mla 3 -fa -fmoe -c 4096 --rpc 10.0.0.250:50052 -ot exps=CPU -ngl 99 --warmup-batch)

Measuring at low context values, PP drops from ~10.5 to ~4.5, TG drops from ~3.3 to ~1.

I may revisit when I eventually get an infiniband connection between the two computers and see if that helps.

Can you add --tensor-split 0,99? This will make sure all non-expert layers are offloaded to RPC machine. You could try to offload expert layers to your 3090 with blk.(12|13).ffn_.*_exps=RPC[10.0.0.250:50052] to fully use 3090's VRAM. You could also try to use 3090 as your main GPU and your server for offloading expert layers. Your speed drop is too much.

saood06 · 2025-06-23T01:07:12Z

Can you add --tensor-split 0,99? This will make sure all non-expert layers are offloaded to RPC machine. You could try to offload expert layers to your 3090 with blk.(12|13).ffn_.*_exps=RPC[10.0.0.250:50052] to fully use 3090's VRAM.

I ran a low context test, but I would still care about maximizing usable context (and I would use far more than 4k). Including the log below so you can see what it offloaded.

Output log from `-ot`

Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU

You could also try to use 3090 as your main GPU and your server for offloading expert layers. Your speed drop is too much.

That would mean transferring hundreds of gigs over what is currently a gigabit connection (I know I could then use the -c feature you suggest). I might test that later.

There definitely was a lot of network traffic happening during inference. I don't remember if that is normal from when I used to RPC with dense models and simple RPC offloading which netted me a benefit (even when ran like this).

HariboApfel · 2025-06-25T08:06:14Z

i am encountering abysmal performance with ik_llama and rpc. I "assume" its rpc related.

i am using
./llama-cli --version version: 3770 (b5f2f001) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

with following build flags

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_RPC=ON -DGGML_SCHED_MAX_COPIES=1

across 3 hosts with each 4 A5000 GPUS with 24gb vram each
the hosts are only connected via switch

./ik_llama.cpp/build/bin/llama-cli \ --rpc "$RPC_SERVERS" \ --model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \ --threads 48 \ --n-gpu-layers 99 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --ctx-size 16384 \ --parallel 1 \ --flash-attn \ --verbosity 3 \ -v \ -mla 3 -fa \ -amb 512 \ -fmoe \ -ctk q8_0 \ -ot "\.(3[3-9]|4[0-9]|5[0-9]|6[0-9]|7[0-9]|8[0-9]|9[0-9]|[0-9][0-9][0-9])\.ffn_up_exps.=CPU" \ --prompt

with the same -ot and compareable settings on llama.cpp i am running with around 7.4T/s

using

ik_llama.cpp

llama_print_timings: load time = 639497.76 ms llama_print_timings: sample time = 0.39 ms / 3 runs ( 0.13 ms per token, 7772.02 tokens per second) llama_print_timings: prompt eval time = 70116.00 ms / 220 tokens ( 318.71 ms per token, 3.14 tokens per second) llama_print_timings: eval time = 132398.81 ms / 2 runs (66199.41 ms per token, 0.02 tokens per second) llama_print_timings: total time = 253512.88 ms / 222 tokens

i get this 66199.41 ms per token, 0.02 tokens per second

any help would be apprichiated.

ikawrakow · 2025-06-25T08:53:42Z

I never use RPC, so somebody else should comment.

HariboApfel · 2025-06-25T10:17:36Z

i got rpc atleast working after redoing the arguments from the 1st post.

using

./ik_llama.cpp/build/bin/llama-cli \ --model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \ --threads 48 \ --n-gpu-layers 99 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --ctx-size 16384 \ --flash-attn \ -mla 3 -fa \ -amb 512 \ -fmoe \ -ctk q8_0 \ -ot "blk\.(1|2|3|4|5|6)\.ffn_.*=CUDA0" \ -ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \ -ot "blk\.(11|12|13|14)\.ffn_.*=CUDA2" \ -ot "blk\.(15|16|17|18)\.ffn_.*=CUDA3" \ --override-tensor exps=CPU \ --prompt

to only run it on one host gets me closer to "expected performance"

llama_print_timings: load time = 98945.88 ms llama_print_timings: sample time = 45.19 ms / 384 runs ( 0.12 ms per token, 8497.27 tokens per second) llama_print_timings: prompt eval time = 5969.18 ms / 224 tokens ( 26.65 ms per token, 37.53 tokens per second) llama_print_timings: eval time = 57680.32 ms / 383 runs ( 150.60 ms per token, 6.64 tokens per second) llama_print_timings: total time = 63916.49 ms / 607 tokens

with RPC

./ik_llama.cpp/build/bin/llama-cli \ --rpc "$RPC_SERVERS" \ --model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \ --threads 48 \ --n-gpu-layers 99 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --ctx-size 16384 \ --flash-attn \ -mla 3 -fa \ -amb 512 \ -fmoe \ -ctk q8_0 \ -ot "blk\.(1|2|3|4|5|6)\.ffn_.*=CUDA0" \ -ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \ -ot "blk\.(11|12|13|14)\.ffn_.*=CUDA2" \ -ot "blk\.(15|16|17|18)\.ffn_.*=CUDA3" \ -ot "blk\.(19|20|21|22)\.ffn_.*=RPC[10.0.0.28:50052]" \ -ot "blk\.(23|24|25|26)\.ffn_.*=RPC[10.0.0.28:50053]" \ -ot "blk\.(27|28|29|30)\.ffn_.*=RPC[10.0.0.28:50054]" \ -ot "blk\.(31|32|33|34)\.ffn_.*=RPC[10.0.0.28:50055]" \ -ot "blk\.(35|36|37|38)\.ffn_.*=RPC[10.0.0.40:50052]" \ -ot "blk\.(39|40|41|42)\.ffn_.*=RPC[10.0.0.40:50053]" \ -ot "blk\.(43|44|45|46)\.ffn_.*=RPC[10.0.0.40:50054]" \ -ot "blk\.(47|48|49|50)\.ffn_.*=RPC[10.0.0.40:50055]" \ --override-tensor exps=CPU \ --prompt

i get around 5.5T/s
llama_print_timings: load time = 568857.08 ms llama_print_timings: sample time = 963.77 ms / 7798 runs ( 0.12 ms per token, 8091.13 tokens per second) llama_print_timings: prompt eval time = 8689.40 ms / 224 tokens ( 38.79 ms per token, 25.78 tokens per second) llama_print_timings: eval time = 1420492.95 ms / 7797 runs ( 182.18 ms per token, 5.49 tokens per second) llama_print_timings: total time = 1432903.60 ms / 8021 tokens

which is still less then llama.cpp with the same rpc setting with the unsloth unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL quant ??

the only real difference would be the -ot setting there-
for llama.cpp i use

--cache-type-k q4_0 \ --threads 48 \ --n-gpu-layers 99 \ --prio 3 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --flash-attn \ --ctx-size 16384 \ -ot "\.(3[5-9]|4[0-9]|5[0-9]|6[0-9]|7[0-9]|8[0-9]|9[0-9]|[0-9][0-9][0-9])\.ffn_up_exps.=CPU" \ -no-cnv \ --prompt

giving me 7.5T/s

i would have assumend IK_llama is faster.

ikawrakow · 2025-06-25T10:45:43Z

You can use Unsloth's UD-Q2_K_XL model (or any model that works with llama.cpp) with ik_llama.cpp just fine, and that would be more of an apples-to-apples comparison. It would also be useful to use the same cache type if you are after a performance comparison.

firecoperana · 2025-06-25T16:12:42Z

Also for fair comparison, please compare if allocation of vram buffer and layers for each gpu and cpu are the same as mainline. I use tensor-split to control the exact number of layers for each gpu. And note that ik_llama has different order for tensor split than llama.cpp.

firecoperana and others added 20 commits May 31, 2025 17:58

Add RPC backend in device list to override tensors.

16c6291

rpc : prevent crashes on invalid input (#9040)

490fb89

Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp

rpc : print error message when failed to connect endpoint (#9042)

99a8f0e

Fix RPC error

beccddb

Add vulkan, sycl to rpc backend

9d6ee73

add thread in rpc cpu backend

18cb9e8

add cache folder and other improvement in rpc

bfe1261

add header file

d093b37

support for models with non-512 aligned tensors

04ac48a

rpc : fix cache directory initialization (#13188)

c791037

Signed-off-by: xiaofei <[email protected]> # Conflicts: # examples/rpc/rpc-server.cpp

rpc : avoid uninitialized memory in serialize_tensor (#13210)

fc5a9cd

Zero out the name and padding buffers.

fix merge error

1356a3f

Add hello command in RPC

0b7bbee

bug fix

fab3bbd

add rpc header

551c6d9

fix bug for missing rpc names

ee1fc6c

add tpc no delay for rpc

f7ef150

add back webui

277a289

ikawrakow merged commit 8a5f857 into ikawrakow:main Jun 8, 2025

ikawrakow pushed a commit that referenced this pull request Jun 8, 2025

Revert "Rpc improvement (#480)"

1eabdb4

This reverts commit 8a5f857.

saood06 mentioned this pull request Jun 15, 2025

RPC sync #193

Closed

Rpc improvement #480

Rpc improvement #480

Uh oh!

Conversation

firecoperana commented Jun 1, 2025

Uh oh!

saood06 commented Jun 1, 2025

Uh oh!

ikawrakow commented Jun 1, 2025

Uh oh!

saood06 commented Jun 1, 2025

Uh oh!

firecoperana commented Jun 1, 2025

Uh oh!

saood06 commented Jun 1, 2025

Uh oh!

firecoperana commented Jun 1, 2025

Uh oh!

saood06 commented Jun 1, 2025

Uh oh!

firecoperana commented Jun 1, 2025

Uh oh!

ikawrakow commented Jun 2, 2025

Uh oh!

saood06 commented Jun 2, 2025

Uh oh!

ikawrakow commented Jun 8, 2025

Uh oh!

firecoperana commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Jun 8, 2025

Uh oh!

firecoperana commented Jun 8, 2025

Uh oh!

saood06 commented Jun 22, 2025

Uh oh!

firecoperana commented Jun 23, 2025

Uh oh!

saood06 commented Jun 23, 2025

Uh oh!

HariboApfel commented Jun 25, 2025

Uh oh!

ikawrakow commented Jun 25, 2025

Uh oh!

HariboApfel commented Jun 25, 2025

Uh oh!

ikawrakow commented Jun 25, 2025

Uh oh!

firecoperana commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

firecoperana commented Jun 8, 2025 •

edited

Loading