This guide walks you through setting up llama-go from scratch to running your first inference example. By the end, you'll have a working installation and understand the basic workflow.
Before starting, you'll need:
- Git with submodule support
- Docker (recommended) or a C++ compiler with CMake
- About 1GB of disk space for the build and test model
That's it - we'll handle everything else through containers to avoid dependency issues.
Clone the repository with its llama.cpp submodule:
git clone --recurse-submodules https://github.com/tcpipuk/llama-go
cd llama-goIf you've already cloned without submodules, initialise them:
git submodule update --init --recursiveWe'll use the project's build containers which include all necessary build tools, including CMake:
docker run --rm -v $(pwd):/workspace -w /workspace git.tomfos.tr/tom/llama-go:build-cuda \
bash -c "LIBRARY_PATH=/workspace C_INCLUDE_PATH=/workspace make libbinding.a"This creates several files:
libbinding.a- The main library for Golibllama.so,libggml.so, etc. - Shared libraries needed at runtimelibcommon.a- Common utilities
The build process typically takes 2-5 minutes depending on your system.
For testing, we'll use Qwen3 0.6B - it's small enough to download quickly but capable enough to demonstrate the library:
wget -q https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.ggufThis downloads approximately 600MB. The model uses the GGUF format, which is the current standard for llama.cpp.
Now test the installation with a simple question:
docker run --rm -v $(pwd):/workspace -w /workspace golang:latest \
bash -c "LIBRARY_PATH=/workspace C_INCLUDE_PATH=/workspace LD_LIBRARY_PATH=/workspace \
go run ./examples -m Qwen3-0.6B-Q8_0.gguf \
-p 'What is the capital of France?' -n 50"If everything works correctly, you'll see:
- Model loading messages
- Your prompt: "What is the capital of France?"
- Generated text completing your prompt
The inference should complete in under a minute on most systems.
The library requires three environment variables:
LIBRARY_PATH: Tells the Go compiler where to find the static libraryC_INCLUDE_PATH: Tells the compiler where to find header filesLD_LIBRARY_PATH: Tells the runtime where to find shared libraries
Without these, you'll see "undefined symbol" or "library not found" errors.
You can also run the example in interactive mode by omitting the -p parameter:
docker run --rm -it -v $(pwd):/workspace -w /workspace golang:latest \
bash -c "LIBRARY_PATH=/workspace C_INCLUDE_PATH=/workspace LD_LIBRARY_PATH=/workspace \
go run ./examples -m Qwen3-0.6B-Q8_0.gguf"This starts an interactive session where you can type prompts and see responses in real-time.
"cmake: command not found"
- Use the build container as shown above, or install CMake on your system
"No such file or directory: wrapper.cpp"
- Make sure you're in the correct directory and the submodules are initialised
Missing .so files after build
- Check that the build completed successfully and didn't exit early due to errors
"undefined symbol" errors
- Ensure
LD_LIBRARY_PATHincludes the directory containing the.sofiles - Verify the shared libraries exist in your project directory
"failed to load model"
- Check the model file path is correct
- Confirm the file is a valid GGUF format (not GGML or corrupted)
- Ensure you have enough RAM for the model
"context size" warnings
- These are normal for small models and don't affect basic functionality
If inference seems slow:
- The CPU-only build is functional but not optimised for speed
- Consider hardware acceleration options (see building guide)
- Smaller models like Qwen3-0.6B prioritise compatibility over performance
Now that you have a working installation:
- API guide - Complete guide to Model/Context separation, thread safety, streaming, embeddings, and advanced patterns
- Building guide - Hardware acceleration options (CUDA, Metal, Vulkan, etc.)
- Examples - Working code for chat, streaming, embeddings, and speculative decoding
- Hugging Face GGUF models - Try different models
You've successfully:
- Built the llama-go library with all dependencies
- Downloaded and tested with a working language model
- Verified the complete inference pipeline works
- Understood the basic environment setup
- Seen the Model/Context separation pattern in action
Now that the library works, here's how to integrate it into your Go application:
-
Import the package in your Go code:
import llama "github.com/tcpipuk/llama-go"
-
Use the Model/Context API pattern:
// Load model weights (ModelOption: WithGPULayers, WithMLock, etc.) model, err := llama.LoadModel( "model.gguf", llama.WithGPULayers(-1), // Offload all layers to GPU ) if err != nil { return err } defer model.Close() // Create execution context (ContextOption: WithContext, WithBatch, etc.) ctx, err := model.NewContext( llama.WithContext(2048), llama.WithF16Memory(), ) if err != nil { return err } defer ctx.Close() // Generate text response, err := ctx.Generate("Hello world", llama.WithMaxTokens(50)) if err != nil { return err } fmt.Println(response)
-
Build your application with the same environment variables:
export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD go build -o myapp
-
Distribute the shared libraries (
.sofiles) alongside your binary - see the building guide for deployment details.
The API guide shows common patterns like streaming, chat completion, embeddings, concurrent inference, and speculative decoding. For hardware acceleration, see the building guide.