-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Description
Name and Version
llama.cpp version: build: 6031 (00131d6)
Operating systems
Mac
GGML backends
RPC
Hardware
Client & RPC server
OS: macOS (darwin24.3.0)
Hardware: Apple M3 Ultra, 512 GB RAM
Compiler: Apple clang version 17.0.0 (clang-1700.0.13.5)
Models
The model DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_0.gguf (379.55 GB) frequently results in errors, while the slightly smaller model DeepSeek-R1-0528-IQ4_NL (378.58 GB) runs without issues.
Similarly, the model DeepSeek-R1-0528-Q4_K_S (380.51 GB) also causes errors.
Larger models such as Q4_K_XL and Q5_K_S can run inference successfully on a single node, but crash when their tensors are offloaded to an RPC server.
Problem description & steps to reproduce
#Describe the bug
When attempting to offload a very large language model (LLM) to a remote RPC server using llama-server, the client process crashes with an abort signal.
The crash occurs consistently when the size of the model tensors being offloaded to the RPC server exceeds approximately 75% of the total physical RAM available on that server. When using models or tensor splits that stay under this ~75% threshold, the offloading process is stable and works as expected.
The final error message from the client before it aborts is: Remote RPC server crashed or returned malformed response.
Crucially, the RPC server process itself does not crash and remains running. This suggests the issue is not a fatal crash on the server, but rather a communication failure between the client and the server, which is likely under heavy memory pressure. The client may be misinterpreting a timeout or a dropped connection as a server crash.
#To Reproduce
Set up an RPC server (llama-rpc-server) on a machine.
On a separate client machine (macOS in my case), run llama-server with the --rpc flag pointing to the server.
Use a large GGUF model. Configure the tensor split (e.g., via --tensor-split) so that the portion of tensors assigned to the RPC server is greater than ~75% of the server's total RAM.
The llama-server on the client will start loading the model and begin offloading tensors.
During the tensor loading process for the RPC server, the client crashes.
First Bad Commit
No response
Relevant log output
bin/llama-server -m /path/to/models/unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_0-00001-of-00008.gguf --rpc 192.168.2.2:50052 -c 3000 --port 8081
build: 6031 (00131d6e) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.3.0
system info: n_threads = 24, n_threads_batch = 24, total_threads = 32
system_info: n_threads = 24 (n_threads_batch = 24) / 32 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 31
main: loading model
srv load_model: loading model '/path/to/models/unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_0-00001-of-00008.gguf'
llama_model_load_from_file_impl: using device RPC[192.168.2.2:50052] (RPC[192.168.2.2:50052]) - 393210 MiB free
llama_model_load_from_file_impl: using device Metal (Apple M3 Ultra) - 475135 MiB free
llama_model_loader: additional 7 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 63 key-value pairs and 1086 tensors from DeepSeek-R1-0528-Q4_0-00001-of-00008.gguf (version GGUF V3 (latest))
(...)
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: Metal_Mapped model buffer size = 724.99 MiB
load_tensors: Metal_Mapped model buffer size = 24856.66 MiB
load_tensors: Metal_Mapped model buffer size = 47364.20 MiB
load_tensors: Metal_Mapped model buffer size = 47465.58 MiB
load_tensors: Metal_Mapped model buffer size = 47472.61 MiB
load_tensors: Metal_Mapped model buffer size = 30826.90 MiB
load_tensors: CPU_Mapped model buffer size = 497.11 MiB
load_tensors: RPC[192.168.2.2:50052] model buffer size = 162749.85 MiB
.......................................................
ggml-rpc.cpp:579: Remote RPC server crashed or returned malformed response
(lldb) process attach --pid 18079
error: attach failed: attach failed (Not allowed to attach to process. Look in the console messages (Console.app), near the debugserver entries, when the attach failed. The subsystem that denied the attach permission will likely have logged an informative message about why it was denied.)
zsh: abort bin/llama-server -m --rpc 192.168.2.2:50052 -c 3000 --port 8081
#RPC Server Log
build-rpc-cuda/bin/rpc-server -p 50052 --host 192.168.2.2
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('192.168.2.2') is != '127.0.0.1'
Never expose the RPC server to an open network!
This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Ultra
ggml_metal_init: picking default device: Apple M3 Ultra
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 412316.86 MB
ggml_metal_init: loaded kernel ...
ggml_metal_init: loaded kernel_add 0x13be08510 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_fuse_2 0x13be08c50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_fuse_3 0x13be09200 | th_max = 1024 | th_width = 32
... (many similar lines for different kernels) ...
ggml_metal_init: loaded kernel_cpy_q8_0_f16 0x13be7e440 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_concat 0x13be7e990 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sqr 0x13be7f0b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sqrt 0x13be7f7d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sin 0x13be7fef0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cos 0x13be80610 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_neg 0x13be80d30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_reglu 0x13be811d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_geglu 0x13be81670 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_swiglu 0x13be81b10 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_geglu_erf 0x13be81fb0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_geglu_quick 0x13be82450 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sum_rows 0x13be829a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mean 0x13be82ef0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_argmax 0x13be83390 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_pool_2d_avg_f32 0x13be83830 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_pool_2d_max_f32 0x13be83cd0 | th_max = 1024 | th_width = 32
create_backend: using Metal backend
Starting RPC server v2.0.0
endpoint : 192.168.2.2:50052
local cache : n/a
backend memory : 393210 MB
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Client connection closed
...
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Null buffer for tensor passed to init_tensor function
...
Client connection closed
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Client connection closed
...
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Null buffer for tensor passed to init_tensor function
...
Client connection closed
...