Add documentation how to use with llama.cpp (#318)

rgerganov · web-flow · commit 82ab6bfe46f8 · 2025-09-11T13:56:51.000-04:00
## Summary

Add documentation on how to run guidellm with llama.cpp server

## Details

guidellm can run against a llama.cpp server when the model metadata is
prefetched and the server is started with the right arguments.

## Test Plan

Verified that guidellm runs successfully when following the documented
steps

## Related Issues

n/a

---

- [x] "I certify that all code in this PR is my own, except as noted
below."

## Use of AI

- [ ] Includes AI-assisted code completion
- [ ] Includes code generated by an AI application
- [ ] Includes AI-generated tests (NOTE: AI written tests should have a
docstring that includes `## WRITTEN BY AI ##`)

Signed-off-by: Radoslav Gerganov &lt;rgerganov@gmail.com&gt;
diff --git a/docs/backends.md b/docs/backends.md
@@ -40,6 +40,24 @@ docker run --gpus 1 -ti --shm-size 1g --ipc=host --rm -p 8080:80 \
 
 For more information on starting a TGI server, see the [TGI Documentation](https://huggingface.co/docs/text-generation-inference/index).
 
+### 3. llama.cpp
+
+[llama.cpp](https://github.com/ggml-org/llama.cpp) provides lightweight, OpenAI-compatible server through its [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server) tool.
+
+To start a llama.cpp server with the gpt-oss-20b model, you can use the following command:
+
+```bash
+llama-server -hf ggml-org/gpt-oss-20b-GGUF --alias gpt-oss-20b --ctx-size 0 --jinja -ub 2048 -b 2048
+```
+
+Note that we are providing an alias `gpt-oss-20b` for the model name because `guidellm` is using it to retrieve model metadata in JSON format and such metadata is not included in GGUF model repositories. A simple workaround is to download the metadata files from safetensors repository and place them in a local directory named after the alias:
+
+```bash
+huggingface-cli download openai/gpt-oss-20b --include "*.json" --local-dir gpt-oss-20b/
+```
+
+Now you can run `guidellm` as usual and it will be able to fetch the model metadata from the local directory.
+
 ## Expanding Backend Support
 
 GuideLLM is an open platform, and we encourage contributions to extend its backend support. Whether it's adding new server implementations, integrating with Python-based backends, or enhancing existing capabilities, your contributions are welcome. For more details on how to contribute, see the [CONTRIBUTING.md](https://github.com/vllm-project/guidellm/blob/main/CONTRIBUTING.md) file.