Skip to content

Eval bug: Crash when offloading large models via RPC if model size exceeds ~75% of server RAM #15055

@Tak-RS

Description

@Tak-RS

Name and Version

llama.cpp version: build: 6031 (00131d6)

Operating systems

Mac

GGML backends

RPC

Hardware

Client & RPC server
OS: macOS (darwin24.3.0)
Hardware: Apple M3 Ultra, 512 GB RAM
Compiler: Apple clang version 17.0.0 (clang-1700.0.13.5)

Models

The model DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_0.gguf (379.55 GB) frequently results in errors, while the slightly smaller model DeepSeek-R1-0528-IQ4_NL (378.58 GB) runs without issues.
Similarly, the model DeepSeek-R1-0528-Q4_K_S (380.51 GB) also causes errors.
Larger models such as Q4_K_XL and Q5_K_S can run inference successfully on a single node, but crash when their tensors are offloaded to an RPC server.

Problem description & steps to reproduce

#Describe the bug
When attempting to offload a very large language model (LLM) to a remote RPC server using llama-server, the client process crashes with an abort signal.

The crash occurs consistently when the size of the model tensors being offloaded to the RPC server exceeds approximately 75% of the total physical RAM available on that server. When using models or tensor splits that stay under this ~75% threshold, the offloading process is stable and works as expected.

The final error message from the client before it aborts is: Remote RPC server crashed or returned malformed response.

Crucially, the RPC server process itself does not crash and remains running. This suggests the issue is not a fatal crash on the server, but rather a communication failure between the client and the server, which is likely under heavy memory pressure. The client may be misinterpreting a timeout or a dropped connection as a server crash.

#To Reproduce
Set up an RPC server (llama-rpc-server) on a machine.

On a separate client machine (macOS in my case), run llama-server with the --rpc flag pointing to the server.

Use a large GGUF model. Configure the tensor split (e.g., via --tensor-split) so that the portion of tensors assigned to the RPC server is greater than ~75% of the server's total RAM.

The llama-server on the client will start loading the model and begin offloading tensors.

During the tensor loading process for the RPC server, the client crashes.

First Bad Commit

No response

Relevant log output

bin/llama-server -m /path/to/models/unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_0-00001-of-00008.gguf --rpc 192.168.2.2:50052 -c 3000 --port 8081


build: 6031 (00131d6e) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.3.0
system info: n_threads = 24, n_threads_batch = 24, total_threads = 32

system_info: n_threads = 24 (n_threads_batch = 24) / 32 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | REPACK = 1 | 

main: binding port with default address family  
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 31  
main: loading model  
srv    load_model: loading model '/path/to/models/unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_0-00001-of-00008.gguf'  
llama_model_load_from_file_impl: using device RPC[192.168.2.2:50052] (RPC[192.168.2.2:50052]) - 393210 MiB free  
llama_model_load_from_file_impl: using device Metal (Apple M3 Ultra) - 475135 MiB free  
llama_model_loader: additional 7 GGUFs metadata loaded.  
llama_model_loader: loaded meta data with 63 key-value pairs and 1086 tensors from DeepSeek-R1-0528-Q4_0-00001-of-00008.gguf (version GGUF V3 (latest))  

(...)  

load_tensors: loading model tensors, this can take a while... (mmap = true)  
load_tensors: offloading 61 repeating layers to GPU  
load_tensors: offloading output layer to GPU  
load_tensors: offloaded 62/62 layers to GPU  
load_tensors: Metal_Mapped model buffer size =   724.99 MiB  
load_tensors: Metal_Mapped model buffer size = 24856.66 MiB  
load_tensors: Metal_Mapped model buffer size = 47364.20 MiB  
load_tensors: Metal_Mapped model buffer size = 47465.58 MiB  
load_tensors: Metal_Mapped model buffer size = 47472.61 MiB  
load_tensors: Metal_Mapped model buffer size = 30826.90 MiB  
load_tensors:   CPU_Mapped model buffer size =   497.11 MiB  
load_tensors: RPC[192.168.2.2:50052] model buffer size = 162749.85 MiB  
.......................................................
ggml-rpc.cpp:579: Remote RPC server crashed or returned malformed response  

(lldb) process attach --pid 18079  
error: attach failed: attach failed (Not allowed to attach to process. Look in the console messages (Console.app), near the debugserver entries, when the attach failed. The subsystem that denied the attach permission will likely have logged an informative message about why it was denied.)  

zsh: abort      bin/llama-server -m  --rpc 192.168.2.2:50052 -c 3000 --port 8081  

#RPC Server Log

build-rpc-cuda/bin/rpc-server -p 50052 --host 192.168.2.2

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('192.168.2.2') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Ultra
ggml_metal_init: picking default device: Apple M3 Ultra
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 412316.86 MB

ggml_metal_init: loaded kernel ... 

ggml_metal_init: loaded kernel_add                                    0x13be08510 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_fuse_2                             0x13be08c50 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_fuse_3                             0x13be09200 | th_max = 1024 | th_width =   32
... (many similar lines for different kernels) ...
ggml_metal_init: loaded kernel_cpy_q8_0_f16                           0x13be7e440 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_concat                                 0x13be7e990 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sqr                                    0x13be7f0b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sqrt                                   0x13be7f7d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sin                                    0x13be7fef0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_cos                                    0x13be80610 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_neg                                    0x13be80d30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_reglu                                  0x13be811d0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_geglu                                  0x13be81670 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_swiglu                                 0x13be81b10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_geglu_erf                              0x13be81fb0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_geglu_quick                            0x13be82450 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_sum_rows                               0x13be829a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mean                                   0x13be82ef0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_argmax                                 0x13be83390 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_pool_2d_avg_f32                        0x13be83830 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_pool_2d_max_f32                        0x13be83cd0 | th_max = 1024 | th_width =   32

create_backend: using Metal backend
Starting RPC server v2.0.0
  endpoint       : 192.168.2.2:50052
  local cache    : n/a
  backend memory : 393210 MB
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Client connection closed
...
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Null buffer for tensor passed to init_tensor function
...
Client connection closed
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Client connection closed
...
Accepted client connection, free_mem=412310618112, total_mem=412316860416
Null buffer for tensor passed to init_tensor function
...
Client connection closed
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions