Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 9, 2025

Mirrored from ggml-org/llama.cpp#17116

fix #16657
ref ggml-org/llama.cpp#16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

  • Check performance imapct
  • Cache the responses to avoid extra RPC calls?

@DajanaV DajanaV force-pushed the main branch 28 times, most recently from 930eefd to db9060f Compare November 12, 2025 23:09
@DajanaV DajanaV force-pushed the main branch 7 times, most recently from 24733fb to 4b4bb7c Compare November 13, 2025 12:15
@DajanaV DajanaV closed this Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants