Build and use ik_llama.cpp with CPU or CPU+CUDA

Built on top of ikawrakow/ik_llama.cpp and llama-swap

All commands are provided for Podman and Docker.

CPU or CUDA sections under Build and Run are enough to get up and running.

Overview

Build
Run
Troubleshooting
Extra Features
Credits

Build

Builds two image tags:

swap: Includes only llama-swap and llama-server.
full: Includes llama-server, llama-quantize, and other utilities.

Start: download the 4 files to a new directory (e.g. ~/ik_llama/) then follow the next steps.

└── ik_llama
    ├── ik_llama-cpu.Containerfile
    ├── ik_llama-cpu-swap.config.yaml
    ├── ik_llama-cuda.Containerfile
    └── ik_llama-cuda-swap.config.yaml

CPU

podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap

docker image build --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full . && docker image build --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap .

CUDA

podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full && podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap

docker image build --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap .

Run

Download .gguf model files to your favorite directory (e.g. /my_local_files/gguf).
Map it to /models inside the container.
Open browser http://localhost:9292 and enjoy the features.
API endpoints are available at http://localhost:9292/v1 for use in other applications.

CPU

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap

docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro ik_llama-cpu:swap

CUDA

Install Nvidia Drivers and CUDA on the host.
For Docker, install NVIDIA Container Toolkit
For Podman, install CDI Container Device Interface
Identify for your GPU:
- CUDA GPU Compute Capability (e.g. 8.6 for RTX30*0, 8.9 for RTX40*0, 12.0 for RTX50*0) then change CUDA_DOCKER_ARCH in ik_llama-cuda.Containerfile to your GPU architecture (e.g. CUDA_DOCKER_ARCH=86 for RTX30*0, CUDA_DOCKER_ARCH=89 for RTX40*0, CUDA_DOCKER_ARCH=120 for RTX50*0). If you have a mix of different GPUs add them like CUDA_DOCKER_ARCH=86;89;120).
- CUDA Toolkit supported version then adjust CUDA_VERSION in ik_llama-cuda.Containerfile to your GPU (e.g. CUDA_VERSION=13.1 for RTX50*0).

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap

docker run  -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia ik_llama-cuda:swap

Troubleshooting

If CUDA is not available, use ik_llama-cpu instead.
If models are not found, ensure you mount the correct directory: -v /my_local_files/gguf:/models:ro
If you need to install podman or docker follow the Podman Installation or Install Docker Engine for your OS.

Extra

CUSTOM_COMMIT can be used to build a specific ik_llama.cpp commit (e.g. 1ec12b8).

podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:swap

docker image build --file ik_llama-cuda.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:swap .

Using the tools in the full image:

$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...

docker run  -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...

Customize llama-swap config: save the ik_llama-cpu-swap.config.yaml or ik_llama-cuda-swap.config.yaml localy (e.g. under /my_local_files/) then map it to /app/config.yaml inside the container appending -v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro to yourpodman run ... or docker run ....
To run the container in background, replace -it with -d: podman run -d ... or docker run -d .... To stop it: podman stop ik_llama or docker stop ik_llama.
If you build the image on a diferent machine, change -DGGML_NATIVE=ON to -DGGML_NATIVE=OFF in the .Containerfile.
If you build only for your GPU architecture and want to make use of more KV quantization types, build with -DGGML_IQK_FA_ALL_QUANTS=ON.
If you experiment with several CUDA_VERSION, remember to identify with podman image ls or docker image ls then delete (e.g.podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04 or docker image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && docker image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04 the unused images as they have several GB.
If you want to build without llama-swap, change --target swap to --target server in ik_llama Containerfiles, e.g. docker image build --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full . && docker image build --file ik_llama-cuda.Containerfile --target server --tag ik_llama-cuda:server .
Look for premade quants (and imatrix files) that work well on most standard systems and are designed around ik_llama.cpp (with helpful metrics in the model card) from ubergarm.
Usefull graphs and numbers on @magikRUKKOLA Perplexity vs Size Graphs for the recent quants (GLM-4.7, Kimi-K2-Thinking, Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.) topic.
Build custom quants with Thireus's tools.
Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA if you cannot build.
For a KoboldCPP experience Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.

Credits

All credits to the awesome community:

llama-swap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build and use ik_llama.cpp with CPU or CPU+CUDA

Overview

Build

CPU

CUDA

Run

CPU

CUDA

Troubleshooting

Extra

Credits

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Build and use ik_llama.cpp with CPU or CPU+CUDA

Overview

Build

CPU

CUDA

Run

CPU

CUDA

Troubleshooting

Extra

Credits