Skip to content

ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) #15188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Tak-RS
Copy link

@Tak-RS Tak-RS commented Aug 9, 2025

Fixes #15055

This PR prevents send()/recv() from being called with extremely large buffers during RPC tensor transfer by chunking I/O into 1 GiB pieces. On macOS this avoids intermittent EINVAL errors that previously caused the client to abort when offloading very large models via RPC.

What’s the symptom?
Loading very large GGUFs via RPC would fail with:

client: send: Invalid argument

server: recv: Invalid argument

Repro seen with DeepSeek-R1-0528-* and Qwen3-480B-* at large quants.

Single-node load (no RPC) worked fine.

Root cause (observed)
OS-level limits on single send()/recv() buffer sizes; very large tensors were transmitted in one shot. Splitting into smaller chunks resolves the issue.

Changes
Add RPC_IO_CHUNK = 1 GiB.

Update send_data() and recv_data() to loop with chunked I/O.

Keep existing error logging; behavior is otherwise unchanged.

Why 1 GiB?
Empirically under the limits that triggered EINVAL on macOS.

Large enough to keep throughput good; easy to tune later if needed.

Testing
macOS (Metal): Previously failing large-model RPC offload now completes. Inference runs.

macOS (non-Metal): Build + basic RPC transfer OK.

Linux/Ubuntu: Not tested yet. Relying on CI and maintainer validation. (Happy to test on request; I can also try Docker later.)

Known quirk (non-blocking)
I still see an occasional non-fatal recv: Invalid argument before the big tensor transfer starts, but the run proceeds and finishes. I suspect a minor size-field mismatch during early handshake. If useful, I can follow up with a tiny patch that always serializes message lengths as uint64_t on the wire.

Performance / compatibility
No API changes.

Chunking is per-call looped send/recv; negligible overhead in my tests vs. “one big send”.

Should be safe across platforms.

Thanks!


Update: Verified cross-OS direction as well.

Additional testing

  • Client: Ubuntu 22.04 (glibc), clang/gcc build, commit 0e7aa4e
  • RPC server: macOS (Apple M3 Ultra, 512 GB RAM, Metal enabled)
  • llama.cpp build: Release
  • Model: DeepSeek-R1-0528-Q4_K_M (GGUF format, very large tensor size)
  • Command (client):
    ./build/bin/llama-server
    -m /path/to/DeepSeek-R1-0528-Q4_K_M.gguf
    --rpc :50052 -c 3000
  • Command (server):
    ./build/bin/rpc-server -p 50052 --host

Result

  • Large tensor offload succeeds end-to-end. Inference runs normally.
  • No client/server aborts observed.
  • Occasionally still see a non-fatal recv: Invalid argument before the first large transfer; run proceeds normally.

Notes

  • The chunked I/O change fixes the main crash.
  • If you prefer, I can follow up with a tiny patch to always serialize message length fields as uint64_t to silence the early handshake warning.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 9, 2025
Copy link
Collaborator

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the bug report and the patch

@@ -32,6 +32,8 @@

namespace fs = std::filesystem;

static constexpr size_t RPC_IO_CHUNK = 1024ull * 1024ull * 1024ull; // 1 GiB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to MAX_CHUNK_SIZE

@@ -323,27 +325,43 @@ static std::shared_ptr<socket_t> create_server_socket(const char * host, int por
static bool send_data(sockfd_t sockfd, const void * data, size_t size) {
size_t bytes_sent = 0;
while (bytes_sent < size) {
ssize_t n = send(sockfd, (const char *)data + bytes_sent, size - bytes_sent, 0);
size_t size_to_send = size - bytes_sent;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t size_to_send = std::max(size - bytes_sent, MAX_CHUNK_SIZE);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean std::min, not std::max, sorry

if (n < 0) {
#ifndef _WIN32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with GGML_LOG_ERROR

@rgerganov
Copy link
Collaborator

I still see an occasional non-fatal recv: Invalid argument before the big tensor transfer starts, but the run proceeds and finishes. I suspect a minor size-field mismatch during early handshake. If useful, I can follow up with a tiny patch that always serializes message lengths as uint64_t on the wire.

Please follow up on how to reproduce this, thanks

…, switch to GGML_LOG_ERROR, handle 0-length send/recv
@Tak-RS Tak-RS force-pushed the fix/rpc-chunked-io branch from 514a5ff to 829d6b6 Compare August 11, 2025 14:49
@Tak-RS
Copy link
Author

Tak-RS commented Aug 11, 2025

Thank you for the review and suggestions!

Applied the requested changes:

  • Renamed RPC_IO_CHUNKMAX_CHUNK_SIZE
  • Switched error logging from perror/fprintf to GGML_LOG_ERROR
  • Corrected chunk size calculation to use std::min(size - bytes_sent, MAX_CHUNK_SIZE) (cap instead of max)
  • Same fix applied in recv_data()
  • Added a check to treat 0-length send/recv as an error

Please let me know if you’d like further changes.

bytes_sent, size_to_send);
return false;
}
if (n == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this special case for n == 0? if zero bytes are sent, then we should retry again until we send everything or an error occurs (n < 0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it — I’ll remove the special case for n == 0 in send_data()
and just retry in the loop as suggested.

@Tak-RS
Copy link
Author

Tak-RS commented Aug 12, 2025

Thanks — removed the n == 0 special case in send_data(). recv() is unchanged as n == 0 correctly indicates a closed connection there.

@rgerganov rgerganov requested a review from slaren August 12, 2025 15:09
return false;
}
bytes_sent += n;
bytes_sent += (size_t)n;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the trailing whitespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: Crash when offloading large models via RPC if model size exceeds ~75% of server RAM
2 participants