Getting started guide

This guide walks you through setting up llama-go from scratch to running your first inference example. By the end, you'll have a working installation and understand the basic workflow.

Prerequisites

Before starting, you'll need:

Git with submodule support
Docker (recommended) or a C++ compiler with CMake
About 1GB of disk space for the build and test model

That's it - we'll handle everything else through containers to avoid dependency issues.

Step 1: Clone the repository

Clone the repository with its llama.cpp submodule:

git clone --recurse-submodules https://github.com/tcpipuk/llama-go
cd llama-go

If you've already cloned without submodules, initialise them:

git submodule update --init --recursive

Step 2: Build the library

We'll use the project's build containers which include all necessary build tools, including CMake:

docker run --rm -v $(pwd):/workspace -w /workspace git.tomfos.tr/tom/llama-go:build-cuda \
  bash -c "LIBRARY_PATH=/workspace C_INCLUDE_PATH=/workspace make libbinding.a"

This creates several files:

libbinding.a - The main library for Go
libllama.so, libggml.so, etc. - Shared libraries needed at runtime
libcommon.a - Common utilities

The build process typically takes 2-5 minutes depending on your system.

Step 3: Download a test model

For testing, we'll use Qwen3 0.6B - it's small enough to download quickly but capable enough to demonstrate the library:

wget -q https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

This downloads approximately 600MB. The model uses the GGUF format, which is the current standard for llama.cpp.

Step 4: Run your first inference

Now test the installation with a simple question:

docker run --rm -v $(pwd):/workspace -w /workspace golang:latest \
  bash -c "LIBRARY_PATH=/workspace C_INCLUDE_PATH=/workspace LD_LIBRARY_PATH=/workspace \
           go run ./examples -m Qwen3-0.6B-Q8_0.gguf \
           -p 'What is the capital of France?' -n 50"

If everything works correctly, you'll see:

Model loading messages
Your prompt: "What is the capital of France?"
Generated text completing your prompt

The inference should complete in under a minute on most systems.

Understanding the environment variables

The library requires three environment variables:

LIBRARY_PATH: Tells the Go compiler where to find the static library
C_INCLUDE_PATH: Tells the compiler where to find header files
LD_LIBRARY_PATH: Tells the runtime where to find shared libraries

Without these, you'll see "undefined symbol" or "library not found" errors.

Interactive mode

You can also run the example in interactive mode by omitting the -p parameter:

docker run --rm -it -v $(pwd):/workspace -w /workspace golang:latest \
  bash -c "LIBRARY_PATH=/workspace C_INCLUDE_PATH=/workspace LD_LIBRARY_PATH=/workspace \
           go run ./examples -m Qwen3-0.6B-Q8_0.gguf"

This starts an interactive session where you can type prompts and see responses in real-time.

Troubleshooting

Build issues

"cmake: command not found"

Use the build container as shown above, or install CMake on your system

"No such file or directory: wrapper.cpp"

Make sure you're in the correct directory and the submodules are initialised

Missing .so files after build

Check that the build completed successfully and didn't exit early due to errors

Runtime issues

"undefined symbol" errors

Ensure LD_LIBRARY_PATH includes the directory containing the .so files
Verify the shared libraries exist in your project directory

"failed to load model"

Check the model file path is correct
Confirm the file is a valid GGUF format (not GGML or corrupted)
Ensure you have enough RAM for the model

"context size" warnings

These are normal for small models and don't affect basic functionality

Performance considerations

If inference seems slow:

The CPU-only build is functional but not optimised for speed
Consider hardware acceleration options (see building guide)
Smaller models like Qwen3-0.6B prioritise compatibility over performance

Next steps

Now that you have a working installation:

API guide - Complete guide to Model/Context separation, thread safety, streaming, embeddings, and advanced patterns
Building guide - Hardware acceleration options (CUDA, Metal, Vulkan, etc.)
Examples - Working code for chat, streaming, embeddings, and speculative decoding
Hugging Face GGUF models - Try different models

What you've accomplished

You've successfully:

Built the llama-go library with all dependencies
Downloaded and tested with a working language model
Verified the complete inference pipeline works
Understood the basic environment setup
Seen the Model/Context separation pattern in action

Using in your own project

Now that the library works, here's how to integrate it into your Go application:

Import the package in your Go code:

import llama "github.com/tcpipuk/llama-go"

Use the Model/Context API pattern:

// Load model weights (ModelOption: WithGPULayers, WithMLock, etc.)
model, err := llama.LoadModel(
    "model.gguf",
    llama.WithGPULayers(-1), // Offload all layers to GPU
)
if err != nil {
    return err
}
defer model.Close()

// Create execution context (ContextOption: WithContext, WithBatch, etc.)
ctx, err := model.NewContext(
    llama.WithContext(2048),
    llama.WithF16Memory(),
)
if err != nil {
    return err
}
defer ctx.Close()

// Generate text
response, err := ctx.Generate("Hello world", llama.WithMaxTokens(50))
if err != nil {
    return err
}
fmt.Println(response)

Build your application with the same environment variables:

export LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD LD_LIBRARY_PATH=$PWD
go build -o myapp

Distribute the shared libraries (.so files) alongside your binary - see the building guide for deployment details.

The API guide shows common patterns like streaming, chat completion, embeddings, concurrent inference, and speculative decoding. For hardware acceleration, see the building guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started guide

Prerequisites

Step 1: Clone the repository

Step 2: Build the library

Step 3: Download a test model

Step 4: Run your first inference

Understanding the environment variables

Interactive mode

Troubleshooting

Build issues

Runtime issues

Performance considerations

Next steps

What you've accomplished

Using in your own project

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting started guide

Prerequisites

Step 1: Clone the repository

Step 2: Build the library

Step 3: Download a test model

Step 4: Run your first inference

Understanding the environment variables

Interactive mode

Troubleshooting

Build issues

Runtime issues

Performance considerations

Next steps

What you've accomplished

Using in your own project