Skip to content

Large model support - "Couldn't allocate the guest-host shared buffer" error #24

@cloudomate

Description

@cloudomate

Description

When attempting to run large models (120B parameters, ~60GB) via API remoting, the container fails with:

FATAL: Couldn't allocate the guest-host shared buffer :/

The model loads metadata successfully but fails during tensor loading when trying to allocate the shared buffer for GPU offloading.

Environment

  • macOS Version: Darwin 24.6.0
  • Hardware: Apple M3 Max (96GB unified memory)
  • Podman: 5.4.0
  • RamaLama: 0.16.0
  • Tarball: llama_cpp-api_remoting-b6298-remoting-0.1.6_b5
  • Container Image: quay.io/crcont/remoting:v0.12.1-apir.rc1_apir.b6298-remoting-0.1.6_b5
  • Podman Machine Memory: 16GB

Steps to Reproduce

  1. Download the 120B model:

    ramalama pull "hf://ggml-org/gpt-oss-120b-GGUF"
  2. Configure API remoting:

    cd llama_cpp-api_remoting-b6298-remoting-0.1.6_b5
    bash ./update_krunkit.sh
    bash ./podman_start_machine.api_remoting.sh
  3. Run the model:

    export CONTAINERS_MACHINE_PROVIDER=libkrun
    podman run -d --device /dev/dri -p 8180:8180 \
      -v "/path/to/gpt-oss-120b-GGUF/snapshots/sha256-.../:/models:ro" \
      --name gpt-oss-120b \
      quay.io/crcont/remoting:v0.12.1-apir.rc1_apir.b6298-remoting-0.1.6_b5 \
      llama-server --host 0.0.0.0 --port 8180 \
      --model /models/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
      -ngl 999 --threads 8 -fa

Observed Behavior

APIR: using DRM device /dev/dri/renderD128
APIR: log_call_duration: waited 193625ns for the API Remoting handshake host reply...
APIR: log_call_duration: waited 7.55ms for the API Remoting LoadLibrary host reply...
APIR: ggml_backend_remoting_frontend_reg: initialzed
build: 1 (fc2c13f) with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) for aarch64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8180, http threads: 7
main: loading model
srv    load_model: loading model '/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M3 Max) - 98298 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 38 key-value pairs and 687 tensors from /models/gpt-oss-120b-mxfp4-00001-of-00003.gguf (version GGUF V3 (latest))
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
load_tensors: loading model tensors, this can take a while... (mmap = true)
FATAL: Couldn't allocate the guest-host shared buffer :/

Expected Behavior

The model should load successfully and the llama-server should start, similar to how smaller models (llama3.2, ~2GB) work correctly.

Additional Context

  • Smaller models (~2GB like llama3.2, ~11GB like gpt-oss-20b) work correctly with API remoting
  • The host has 128GB unified memory with ~98GB reported free for Metal
  • Native execution (ramalama --nocontainer) works fine for the 120B model
  • The issue appears to be a hardcoded limit in the guest-host shared buffer allocation

Feature Request

Would it be possible to:

  1. Make the shared buffer size configurable via an environment variable (e.g., APIR_SHARED_BUFFER_SIZE)?
  2. Document the maximum model size supported by API remoting?
  3. Implement dynamic buffer sizing based on available host memory?

Workaround

Currently using native execution without containers:

llama-server --model /path/to/model.gguf --host 0.0.0.0 --port 8180 -ngl 999 -fa

This works but loses the container isolation benefits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions