fix(rpc): Improve input validation and error handling #13069

thevilledev · 2025-04-22T14:58:47Z

The rpc-server was vulnerable to Denial of Service attacks via several RPC commands (SET_TENSOR, GRAPH_COMPUTE, etc.). Malformed messages could trigger failed assertions (e.g., invalid ggml_type) or out-of-bounds reads/writes leading to GGML_ABORT calls, crashing the server process.

This PR introduces robust input validation and replaces abort() calls with graceful error handling:

Type Validation: deserialize_tensor now checks if the tensor->type is within the valid GGML_TYPE_COUNT range before calling ggml_new_tensor_4d. Returns nullptr on invalid type.
Bounds Checks: Replaced GGML_ABORT in set_tensor, set_tensor_hash, and get_tensor handlers with error logging and returning false when data/offset parameters are out of buffer bounds.
Error Propagation:
- create_node now checks for nullptr return values from deserialize_tensor and its recursive calls, propagating nullptr upwards on failure. Uses find instead of at for safer map access.
- copy_tensor now checks for nullptr from deserialize_tensor and sets the response status to failure if deserialization or bounds checks fail.
- graph_compute now checks for nullptr return from create_node and returns failure status correctly. The final return value now reflects the actual computation status.
- RPC_CMD_GET_ALLOC_SIZE now checks the return value of server.get_alloc_size in the RPC server
  loop. If the call fails, return early to close the connection.

lexasub · 2025-04-22T21:18:39Z

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

ggml/src/ggml-rpc/ggml-rpc.cpp

thevilledev · 2025-04-23T18:07:59Z

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that.

Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server. From invalid tensor operations to crashing via deep recursion in create_node which I would like to also fix. I'd like to work on those one change at a time though.

I think multiple critical fixes behind a feature flag would be counterintuitive. Rather build bench tooling (if needed) and iterate on the fixes so there's minimal performance hit.

rgerganov · 2025-04-24T08:45:39Z

I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that.

We use llama-bench to test performance

Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server.

The best investment of efforts in this direction would be creating a script/job for coverage guided fuzzing. This way we can automatically test for security issues when we make RPC changes and even integrate it into the CI.

slaren · 2025-04-24T10:48:56Z

RPC_CMD_GET_ALLOC_SIZE does not check for errors, and if the call to get_alloc_size fails it will leave the client connected in a bad state:

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp

Line 1470 in 604f0a0

server.get_alloc_size(request, response);

thevilledev · 2025-04-24T17:51:56Z

Thanks @slaren, added it to this same PR since it falls under the same scope. e6dd976

thevilledev · 2025-04-24T17:53:10Z

The best investment of efforts in this direction would be creating a script/job for coverage guided fuzzing.

Sounds good, I can look into that after this PR 👍

The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <[email protected]>

removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <[email protected]>

rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <[email protected]>

The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <[email protected]>

Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <[email protected]>

Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <[email protected]>

ggml/src/ggml-rpc/ggml-rpc.cpp

Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <[email protected]>

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <[email protected]>

rgerganov

This looks fine to me now, thanks for addressing my comments. I also did some tests and didn't find performance or functional regressions.

@slaren could you please review as well?

slaren · 2025-04-28T16:39:55Z

ggml/src/ggml-rpc/ggml-rpc.cpp

    ggml_tensor * result = ggml_new_tensor_4d(ctx, (ggml_type) tensor->type,
        tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
+
+    // ggml_new_tensor_4d might fail if dimensions are invalid, although less likely to crash than invalid type


If ggml_new_tensor fails it will crash, it will not return NULL. The check is still good for future-proofing, but the comment is misleading.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 22, 2025

rgerganov reviewed Apr 23, 2025

View reviewed changes

thevilledev force-pushed the fix/tensor-ggml-type branch from bef194d to 604f0a0 Compare April 23, 2025 18:00

thevilledev force-pushed the fix/tensor-ggml-type branch from e6dd976 to 359e38e Compare April 26, 2025 06:01

thevilledev added 5 commits April 26, 2025 09:03

refactor(rpc): address pr comments

cd054aa

removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <[email protected]>

fix(rpc): Handle get_alloc_size failure in server

e38c4d7

Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <[email protected]>

refactor(rpc): input size validation in graph_compute

72c447a

Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <[email protected]>

thevilledev force-pushed the fix/tensor-ggml-type branch from 359e38e to 72c447a Compare April 26, 2025 06:04

rgerganov reviewed Apr 27, 2025

View reviewed changes

ggml/src/ggml-rpc/ggml-rpc.cpp Outdated Show resolved Hide resolved

thevilledev added 2 commits April 27, 2025 22:05

refactor(rpc): remove redundant check for tensor->type

099b835

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <[email protected]>

rgerganov approved these changes Apr 28, 2025

View reviewed changes

slaren approved these changes Apr 28, 2025

View reviewed changes

rgerganov merged commit 43ddab6 into ggml-org:master Apr 28, 2025
48 checks passed

fix(rpc): Improve input validation and error handling #13069

fix(rpc): Improve input validation and error handling #13069

Uh oh!

Conversation

thevilledev commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexasub commented Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thevilledev commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Apr 24, 2025

Uh oh!

slaren commented Apr 24, 2025

Uh oh!

thevilledev commented Apr 24, 2025

Uh oh!

thevilledev commented Apr 24, 2025

Uh oh!

Uh oh!

rgerganov left a comment

Choose a reason for hiding this comment

Uh oh!

slaren Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thevilledev commented Apr 22, 2025 •

edited

Loading

thevilledev commented Apr 23, 2025 •

edited

Loading