-
Notifications
You must be signed in to change notification settings - Fork 155
Rpc improvement #480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rpc improvement #480
Conversation
Add more checks which prevent RPC server from crashing if invalid input is received from client # Conflicts: # ggml/src/ggml-rpc.cpp
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. # Conflicts: # ggml/src/ggml-rpc.cpp
* fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <[email protected]> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <[email protected]> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <[email protected]> --------- Signed-off-by: Ville Vesilehto <[email protected]> # Conflicts: # ggml/src/ggml-rpc.cpp
Signed-off-by: xiaofei <[email protected]> # Conflicts: # examples/rpc/rpc-server.cpp
Zero out the name and padding buffers.
Has this been tested? If so with what models and backends and what configurations. I attempted a similar PR a while ago, see #193 and when tested it did not work with Qwen2.5 72B since on mainline the PR that added "non-512 aligned tensors" was created to add support for that model. I also found that using KV cache quantization still did not work with RPC with or without #193. |
I don't use RPC, so need other people to confirm that this works. |
I don't mind testing and reviewing this but before I do, I want to know what new models/configurations support this PR @firecoperana tested and saw benefit from. I deleted pretty much all the models I previously used when RPC testing when trying to bring support parity up to mainline. |
I tested various quants of Deepseek v2.5, v3, v3 0324 models and it works. V3 0324 is the one with MLA support from mainline. Didn't test other models as I don't use them on this repo. |
Did you test with |
My main machine is 3090 with 128GB ddr4. I did -ot to override individual expert tensors to my other machines with ddr4 3000mhz ram and 3060, and with --cache-type-k q8_0 and batch size of 512, in which case I can load the whole model into either vram and ram. I use cpu RPC backend to use ram from remote machines. For Deepseek V3 Q2_K_XL, I can get 10 it/s for pp and 3 it/s for inferencing. Deepseek V2.5 Q4 is about 6-7 it/s for inferencing. |
Thank you for the details. For now I'll do some testing on Deepseek, with an RPC backend on my 3090 and For reference with my pure IQ4_K_R4 I get similar speeds you get with RPC for both PP and TG so hopefully with RPC it can improve (and since those quants are now supported on CUDA, I won't need to make a new quant). |
Be sure to set -t n -c in cpu backend, where n is the number of threads you want the tensors in ram to use. -c is to load tensors from local files next time. This is useful if you have slow LAN transfer speed. |
No user feedback here, so new strategy: I'll merge this tomorrow. If we don't get bug reports, all is good. If we do get bug reports, all is good too because we know that it needs further work. |
I haven't found the time to test this, but I do plan to, in the next few days. (I've already downloaded a few of the models I plan to to use to test this alongside Deepseek). Either way I'll give some feedback even if it has already been merged by then. |
This reverts commit 8a5f857.
I get build errors after merging this PR, so reverted. Please fix and resubmit. |
What's the error? Does the error happen only when you set DGGML_RPC=OFF? |
This is with |
Fixed |
Finally got around to testing this. It seems functional (sweep-bench testing only), but I couldn't get any performance advantage from offloading Deepseek-V3 based models via RPC to my 3090. I know when I tested that on mainline I also noticed a performance regression (that went up with the more I offloaded). (I ran with Measuring at low context values, PP drops from ~10.5 to ~4.5, TG drops from ~3.3 to ~1. I may revisit when I eventually get an infiniband connection between the two computers and see if that helps. |
Can you add --tensor-split 0,99? This will make sure all non-expert layers are offloaded to RPC machine. You could try to offload expert layers to your 3090 with blk.(12|13).ffn_.*_exps=RPC[10.0.0.250:50052] to fully use 3090's VRAM. You could also try to use 3090 as your main GPU and your server for offloading expert layers. Your speed drop is too much. |
I ran a low context test, but I would still care about maximizing usable context (and I would use far more than 4k). Including the log below so you can see what it offloaded. Output log from `-ot`Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
That would mean transferring hundreds of gigs over what is currently a gigabit connection (I know I could then use the There definitely was a lot of network traffic happening during inference. I don't remember if that is normal from when I used to RPC with dense models and simple RPC offloading which netted me a benefit (even when ran like this). |
i am encountering abysmal performance with ik_llama and rpc. I "assume" its rpc related. i am using with following build flags
across 3 hosts with each 4 A5000 GPUS with 24gb vram each
with the same -ot and compareable settings on llama.cpp i am running with around 7.4T/s using ik_llama.cpp
i get this 66199.41 ms per token, 0.02 tokens per second any help would be apprichiated. |
I never use RPC, so somebody else should comment. |
i got rpc atleast working after redoing the arguments from the 1st post. using
to only run it on one host gets me closer to "expected performance"
with RPC
i get around 5.5T/s which is still less then llama.cpp with the same rpc setting with the unsloth unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL quant ?? the only real difference would be the -ot setting there-
giving me 7.5T/s i would have assumend IK_llama is faster. |
You can use Unsloth's UD-Q2_K_XL model (or any model that works with |
Also for fair comparison, please compare if allocation of vram buffer and layers for each gpu and cpu are the same as mainline. I use tensor-split to control the exact number of layers for each gpu. And note that ik_llama has different order for tensor split than llama.cpp. |
Include various improvement of rpc from mainline including: