-
Notifications
You must be signed in to change notification settings - Fork 13.5k
fix(rpc): Improve input validation and error handling #13069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
on my opinion it may affects to perfomance, may be use feature flag (via cmake)? |
bef194d to
604f0a0
Compare
I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that. Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server. From invalid tensor operations to crashing via deep recursion in I think multiple critical fixes behind a feature flag would be counterintuitive. Rather build bench tooling (if needed) and iterate on the fixes so there's minimal performance hit. |
We use
The best investment of efforts in this direction would be creating a script/job for coverage guided fuzzing. This way we can automatically test for security issues when we make RPC changes and even integrate it into the CI. |
|
llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp Line 1470 in 604f0a0
|
Sounds good, I can look into that after this PR 👍 |
The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.
This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:
- **Type Validation:** `deserialize_tensor` now checks if the
`tensor->type` is within the valid `GGML_TYPE_COUNT` range
*before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
`set_tensor_hash`, and `get_tensor` handlers with error
logging and returning `false` when data/offset parameters
are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
`graph_compute` when calculating required message sizes based
on client-provided `n_nodes` and `n_tensors`. Returns early
if the reported sizes conflict with the actual message size or
would lead to overflow.
- **Error Propagation:**
- `create_node` now checks for `nullptr` return values from
`deserialize_tensor` and its recursive calls, propagating
`nullptr` upwards on failure. Uses `find` instead of `at`
for safer map access.
- `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
and sets the response status to failure if deserialization
or bounds checks fail.
- `graph_compute` now checks for `nullptr` return from
`create_node` and returns failure status correctly. The final
return value now reflects the actual computation status.
These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.
Signed-off-by: Ville Vesilehto <[email protected]>
e6dd976 to
359e38e
Compare
removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <[email protected]>
rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <[email protected]>
The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <[email protected]>
Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <[email protected]>
Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <[email protected]>
359e38e to
72c447a
Compare
Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <[email protected]>
Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me now, thanks for addressing my comments. I also did some tests and didn't find performance or functional regressions.
@slaren could you please review as well?
| ggml_tensor * result = ggml_new_tensor_4d(ctx, (ggml_type) tensor->type, | ||
| tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]); | ||
|
|
||
| // ggml_new_tensor_4d might fail if dimensions are invalid, although less likely to crash than invalid type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ggml_new_tensor fails it will crash, it will not return NULL. The check is still good for future-proofing, but the comment is misleading.
Fixes #13067
The
rpc-serverwas vulnerable to Denial of Service attacks via several RPC commands (SET_TENSOR,GRAPH_COMPUTE, etc.). Malformed messages could trigger failed assertions (e.g., invalidggml_type) or out-of-bounds reads/writes leading toGGML_ABORTcalls, crashing the server process.This PR introduces robust input validation and replaces
abort()calls with graceful error handling:deserialize_tensornow checks if thetensor->typeis within the validGGML_TYPE_COUNTrange before callingggml_new_tensor_4d. Returnsnullptron invalid type.GGML_ABORTinset_tensor,set_tensor_hash, andget_tensorhandlers with error logging and returningfalsewhen data/offset parameters are out of buffer bounds.create_nodenow checks fornullptrreturn values fromdeserialize_tensorand its recursive calls, propagatingnullptrupwards on failure. Usesfindinstead ofatfor safer map access.copy_tensornow checks fornullptrfromdeserialize_tensorand sets the response status to failure if deserialization or bounds checks fail.graph_computenow checks fornullptrreturn fromcreate_nodeand returns failure status correctly. The final return value now reflects the actual computation status.RPC_CMD_GET_ALLOC_SIZEnow checks the return value ofserver.get_alloc_sizein the RPC serverloop. If the call fails, return early to close the connection.