Interact with llama-server models
Install this plugin in the same environment as LLM.
llm install llm-llama-server
You'll need to be running a llama-server on port 8080 to use this plugin.
You can brew install llama.cpp
to obtain that binary. Then run it like this:
llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
This loads and serves the unsloth/gemma-3-4b-it-GGUF GGUF version of Gemma 3 4B - a 3.2GB download.
To access regular models from LLM, use the llama-server
model:
llm -m llama-server "say hi"
For vision models, use llama-server-vision
:
llm -m llama-server-vision describe -a path/to/image.png
For models with tools (which also support vision) use llama-server-tools
:
llm -m llama-server-tools -T llm_time 'time?' --td
You'll need to run the llama-server
with the --jinja
flag in order for this to work:
llama-server --jinja -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
Or for a slightly stronger 7.3GB model:
llama-server --jinja -hf unsloth/gemma-3-12b-it-qat-GGUF:Q4_K_M
To set up this plugin locally, first checkout the code. Then create a new virtual environment:
cd llm-llama-server
python -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
python -m pip install -e '.[test]'
To run the tests:
python -m pytest