-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Description
When attempting to run large models (120B parameters, ~60GB) via API remoting, the container fails with:
FATAL: Couldn't allocate the guest-host shared buffer :/
The model loads metadata successfully but fails during tensor loading when trying to allocate the shared buffer for GPU offloading.
Environment
- macOS Version: Darwin 24.6.0
- Hardware: Apple M3 Max (96GB unified memory)
- Podman: 5.4.0
- RamaLama: 0.16.0
- Tarball:
llama_cpp-api_remoting-b6298-remoting-0.1.6_b5 - Container Image:
quay.io/crcont/remoting:v0.12.1-apir.rc1_apir.b6298-remoting-0.1.6_b5 - Podman Machine Memory: 16GB
Steps to Reproduce
-
Download the 120B model:
ramalama pull "hf://ggml-org/gpt-oss-120b-GGUF" -
Configure API remoting:
cd llama_cpp-api_remoting-b6298-remoting-0.1.6_b5 bash ./update_krunkit.sh bash ./podman_start_machine.api_remoting.sh -
Run the model:
export CONTAINERS_MACHINE_PROVIDER=libkrun podman run -d --device /dev/dri -p 8180:8180 \ -v "/path/to/gpt-oss-120b-GGUF/snapshots/sha256-.../:/models:ro" \ --name gpt-oss-120b \ quay.io/crcont/remoting:v0.12.1-apir.rc1_apir.b6298-remoting-0.1.6_b5 \ llama-server --host 0.0.0.0 --port 8180 \ --model /models/gpt-oss-120b-mxfp4-00001-of-00003.gguf \ -ngl 999 --threads 8 -fa
Observed Behavior
APIR: using DRM device /dev/dri/renderD128
APIR: log_call_duration: waited 193625ns for the API Remoting handshake host reply...
APIR: log_call_duration: waited 7.55ms for the API Remoting LoadLibrary host reply...
APIR: ggml_backend_remoting_frontend_reg: initialzed
build: 1 (fc2c13f) with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) for aarch64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8180, http threads: 7
main: loading model
srv load_model: loading model '/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M3 Max) - 98298 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 38 key-value pairs and 687 tensors from /models/gpt-oss-120b-mxfp4-00001-of-00003.gguf (version GGUF V3 (latest))
load: printing all EOG tokens:
load: - 199999 ('<|endoftext|>')
load: - 200002 ('<|return|>')
load: - 200007 ('<|end|>')
load: - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
load_tensors: loading model tensors, this can take a while... (mmap = true)
FATAL: Couldn't allocate the guest-host shared buffer :/
Expected Behavior
The model should load successfully and the llama-server should start, similar to how smaller models (llama3.2, ~2GB) work correctly.
Additional Context
- Smaller models (~2GB like llama3.2, ~11GB like gpt-oss-20b) work correctly with API remoting
- The host has 128GB unified memory with ~98GB reported free for Metal
- Native execution (
ramalama --nocontainer) works fine for the 120B model - The issue appears to be a hardcoded limit in the guest-host shared buffer allocation
Feature Request
Would it be possible to:
- Make the shared buffer size configurable via an environment variable (e.g.,
APIR_SHARED_BUFFER_SIZE)? - Document the maximum model size supported by API remoting?
- Implement dynamic buffer sizing based on available host memory?
Workaround
Currently using native execution without containers:
llama-server --model /path/to/model.gguf --host 0.0.0.0 --port 8180 -ngl 999 -faThis works but loses the container isolation benefits.
Metadata
Metadata
Assignees
Labels
No labels