Large model support - "Couldn't allocate the guest-host shared buffer" error

## Description

When attempting to run large models (120B parameters, ~60GB) via API remoting, the container fails with:

```
FATAL: Couldn't allocate the guest-host shared buffer :/
```

The model loads metadata successfully but fails during tensor loading when trying to allocate the shared buffer for GPU offloading.

## Environment

- **macOS Version**: Darwin 24.6.0
- **Hardware**: Apple M3 Max (96GB unified memory)
- **Podman**: 5.4.0
- **RamaLama**: 0.16.0
- **Tarball**: `llama_cpp-api_remoting-b6298-remoting-0.1.6_b5`
- **Container Image**: `quay.io/crcont/remoting:v0.12.1-apir.rc1_apir.b6298-remoting-0.1.6_b5`
- **Podman Machine Memory**: 16GB

## Steps to Reproduce

1. Download the 120B model:
   ```bash
   ramalama pull "hf://ggml-org/gpt-oss-120b-GGUF"
   ```

2. Configure API remoting:
   ```bash
   cd llama_cpp-api_remoting-b6298-remoting-0.1.6_b5
   bash ./update_krunkit.sh
   bash ./podman_start_machine.api_remoting.sh
   ```

3. Run the model:
   ```bash
   export CONTAINERS_MACHINE_PROVIDER=libkrun
   podman run -d --device /dev/dri -p 8180:8180 \
     -v "/path/to/gpt-oss-120b-GGUF/snapshots/sha256-.../:/models:ro" \
     --name gpt-oss-120b \
     quay.io/crcont/remoting:v0.12.1-apir.rc1_apir.b6298-remoting-0.1.6_b5 \
     llama-server --host 0.0.0.0 --port 8180 \
     --model /models/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
     -ngl 999 --threads 8 -fa
   ```

## Observed Behavior

```
APIR: using DRM device /dev/dri/renderD128
APIR: log_call_duration: waited 193625ns for the API Remoting handshake host reply...
APIR: log_call_duration: waited 7.55ms for the API Remoting LoadLibrary host reply...
APIR: ggml_backend_remoting_frontend_reg: initialzed
build: 1 (fc2c13f) with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5) for aarch64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8180, http threads: 7
main: loading model
srv    load_model: loading model '/models/gpt-oss-120b-mxfp4-00001-of-00003.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M3 Max) - 98298 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 38 key-value pairs and 687 tensors from /models/gpt-oss-120b-mxfp4-00001-of-00003.gguf (version GGUF V3 (latest))
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
load_tensors: loading model tensors, this can take a while... (mmap = true)
FATAL: Couldn't allocate the guest-host shared buffer :/
```

## Expected Behavior

The model should load successfully and the llama-server should start, similar to how smaller models (llama3.2, ~2GB) work correctly.

## Additional Context

- Smaller models (~2GB like llama3.2, ~11GB like gpt-oss-20b) work correctly with API remoting
- The host has 128GB unified memory with ~98GB reported free for Metal
- Native execution (`ramalama --nocontainer`) works fine for the 120B model
- The issue appears to be a hardcoded limit in the guest-host shared buffer allocation

## Feature Request

Would it be possible to:

1. Make the shared buffer size configurable via an environment variable (e.g., `APIR_SHARED_BUFFER_SIZE`)?
2. Document the maximum model size supported by API remoting?
3. Implement dynamic buffer sizing based on available host memory?

## Workaround

Currently using native execution without containers:
```bash
llama-server --model /path/to/model.gguf --host 0.0.0.0 --port 8180 -ngl 999 -fa
```

This works but loses the container isolation benefits.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large model support - "Couldn't allocate the guest-host shared buffer" error #24

Description

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Additional Context

Feature Request

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large model support - "Couldn't allocate the guest-host shared buffer" error #24

Description

Description

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Additional Context

Feature Request

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions