This guide walks you through running a Llama Stack server locally using Ollama and Podman.
⚡ Performance Update: Ollama now runs directly on your host machine for significantly better performance. The quickstart uses
make setup-ollamato automate the installation and configuration.
- Prerequisites
- Supported Models
- Installing the RAG QuickStart
- Quick Start with Podman Compose
- Automatic Document Ingestion
- Using the RAG UI
- Environment Variables
- Redeploying Changes
- Troubleshooting
- Podman or Docker
- podman-compose - Install with:
pip install podman-compose - Python 3.10 or newer
- uv (Python package manager)
- Ollama - runs on host machine for better performance
You can install Ollama manually or use the automated setup:
** Note: If running on Apple Silicon please read Apple Silicon Guide
# Automated setup (recommended - from deploy/local directory)
cd deploy/local
make setup-ollama
# Or check if all dependencies are installed
cd deploy/local
make check-deps
# Or verify manual installation
podman --version
podman-compose --version
python3 --version
uv --version
ollama --versionNote:
podman-composeis required to orchestrate multiple containers - install it withpip install podman-composeuvis used for fast Python package management and virtual environment handlingdockerworks as well as podman, but these instructions use podman- Ollama now runs on your host instead of in a container for ~3x better inference performance
- Use
make check-depsto verify all required dependencies (podman, podman-compose, uv, ollama) are installed - See OLLAMA_SETUP.md for detailed Ollama configuration and troubleshooting
| Function | Model Name | Hardware | Local Support |
|---|---|---|---|
| Embedding | all-MiniLM-L6-v2 |
CPU/GPU | ✓ |
| Generation | meta-llama/Llama-3.2-3B-Instruct |
CPU/GPU | ✓ |
Note: Local deployment primarily supports CPU and GPU acceleration through Ollama. The models are automatically downloaded when first requested.
Clone the repo so you have a working copy
git clone https://github.com/rh-ai-quickstart/RAGOption A: Quick Setup (Recommended)
Use the automated setup (requires navigating to deploy/local first):
cd deploy/local
make setup-ollamaThis will install Ollama, start the service, and pull the required model.
Option B: Manual Setup
If you've already installed Ollama but need to start it manually:
# On macOS, Ollama usually starts automatically
# On Linux, you may need to start it manually:
ollama serve &
# Pull the model
ollama pull llama3.2:3b-instruct-fp16
# Optional: Pre-load the model with keepalive
ollama run llama3.2:3b-instruct-fp16 --keepalive 60mNote: The --keepalive 60m option keeps the model in memory for 60 minutes for faster inference. After this, proceed to use make start or make up to start all RAG services.
Launch a new terminal window to set the necessary environment variables:
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321Pull the Docker image:
podman pull docker.io/llamastack/distribution-ollamaCreate a local directory for persistent data:
mkdir -p ~/.llamaRun the container:
podman run -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.containers.internal:11434 \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORTOptional: Use a custom network:
podman network create llama-net
podman run --privileged --network llama-net -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORTVerify the container is running:
podman psNote: Another option is to connect to a remote Llama Stack Service. Based on the instructions of this quickstart, you should have one running in cluster and you can use oc port-forward to make it available on localhost.
oc get services -l app.kubernetes.io/name=llamastack
oc port-forward svc/llamastack 8321:8321This makes the remote Service available at localhost:8321
For a simpler deployment experience, use the provided Makefile commands which use podman-compose under the hood to automatically set up Llama Stack and the RAG UI. Ollama runs directly on your host machine for better performance instead of in a container.
💡 Tip: Run
make helpin thedeploy/localdirectory to see all available commands and their descriptions.
- Navigate to the local deployment directory:
cd deploy/local- First-time setup only - Install and configure Ollama on your host machine:
make setup-ollamaThis will:
- Install Ollama on your host machine (if not already installed)
- Start the Ollama service
- Pull the
llama3.2:3b-instruct-fp16model
Note: This step is only needed once. After Ollama is installed, it will run as a background service (especially on macOS).
- Start all containerized services:
make start
# or use the shorthand alias:
make upThis will:
- Check that Ollama is running on the host (will fail if not running)
- Start Llama Stack server using
podman-compose(connects to host Ollama) - Run the ingestion service to populate vector databases
- Start the RAG UI
Important: make start (or make up) only starts the containerized services using podman-compose - it does NOT install or start Ollama. If you haven't run make setup-ollama yet, make start will fail with an error asking you to start Ollama first.
- Access the services:
- RAG UI: http://localhost:8501
- Llama Stack API: http://localhost:8321
- Ollama API (host): http://localhost:11434
Note: For detailed information about running Ollama on the host, see OLLAMA_SETUP.md.
# Check if Ollama is running on host
make check-ollama
# View all services status
make status
# View logs for all services
make logs
# View logs for specific service
make logs-ui
make logs-llamastack
make logs-ingestion
# Test all service endpoints
make test-services
# Restart all services
make restart
# Pull latest container images
make pull
# Stop all services (Note: This does NOT stop Ollama on host)
make stop
# Clean up everything (including containers, networks, and volumes)
make cleanup
# Complete reset (cleanup and rebuild everything)
make reset
# View current configuration
make config
# Monitor real-time resource usage
make monitor
# Open shell in llamastack container
make shell-llamastackIngestion Commands:
# Re-run ingestion service
make ingest
# Edit ingestion configuration
make ingest-config
# List all vector databases
make list-vector-dbsDevelopment Mode: Start backend services in containers and run UI locally for faster development:
make devThis starts Llama Stack in a container (using host Ollama) and runs the UI locally with hot-reload enabled.
UI Build Commands (for advanced users):
# Build the RAG UI container locally
make build-ui
# Build and push UI container to registry
make build-and-push-uiAPI Key Management:
# Clear saved TAVILY_SEARCH_API_KEY
make clear-tavily-keyQuick Aliases:
make up # Alias for 'make start'
make down # Alias for 'make stop'
make ps # Alias for 'make status'The local deployment includes an automatic ingestion service that creates and populates vector databases from various sources.
Edit deploy/local/ingestion-config.yaml to configure your document sources:
pipelines:
my-docs-pipeline:
enabled: true
name: "my-docs-db"
version: "1.0"
vector_store_name: "my-docs-db-v1-0"
source: GITHUB # or S3, URL
config:
url: "https://github.com/yourorg/your-repo.git"
path: "docs"
branch: "main"source: GITHUB
config:
url: "https://github.com/rh-ai-quickstart/RAG.git"
path: "notebooks/hr"
branch: "main"
# token: "ghp_xxx" # Optional for private repossource: S3
config:
endpoint: "http://minio:9000"
bucket: "documents"
access_key: "minio_user"
secret_key: "minio_password"
# prefix: "folder/" # Optionalsource: URL
config:
urls:
- "https://example.com/document1.pdf"
- "https://example.com/document2.pdf"View ingestion progress:
make logs-ingestionRe-run ingestion:
make ingestEdit configuration:
make ingest-configCheck created databases:
make list-vector-dbsThe ingestion service will:
- Wait for Llama Stack to be ready
- Fetch documents from configured sources
- Process PDFs with Docling (advanced chunking)
- Create vector databases with embeddings
- Insert all chunks into PGVector
For more details, see the Ingestion Service README.
- Navigate to the frontend directory
cd frontend- Run the RAG UI application using the provided start script
./start.shThis script will automatically:
- Create a virtual environment (if it doesn't exist)
- Install/sync all dependencies with
uv sync - Start the Streamlit server with auto-reload enabled
- Watch for file changes and automatically restart
-
Open your browser to the displayed URL (typically
http://localhost:8501) -
Upload your PDF documents through the UI and start asking questions!
The RAG UI supports the following optional environment variables for configuration:
| Environment Variable | Description | Default Value |
|---|---|---|
LLAMA_STACK_ENDPOINT |
The endpoint for the Llama Stack API server | http://localhost:8321 |
FIREWORKS_API_KEY |
API key for Fireworks AI provider (optional) | (empty string) |
TOGETHER_API_KEY |
API key for Together AI provider (optional) | (empty string) |
SAMBANOVA_API_KEY |
API key for SambaNova provider (optional) | (empty string) |
OPENAI_API_KEY |
API key for OpenAI provider (optional) | (empty string) |
TAVILY_SEARCH_API_KEY |
API key for Tavily search provider (optional) | (empty string) |
Note: For the default local deployment using Ollama, you typically only need to ensure LLAMA_STACK_ENDPOINT points to your running Llama Stack server. The API keys are only needed if you want to use external AI providers instead of the local Ollama setup.
To set environment variables before starting the UI:
export LLAMA_STACK_ENDPOINT=http://localhost:8321
./start.shDeployment after making changes requires a rebuild of the container image using either docker or podman. Replace docker.io with your target container registry such as quay.io.
Note: Use your favorite repository, organization and image name
export CONTAINER_REGISTRY=docker.io # quay.io
export CONTAINER_ORGANIZATION=yourorg
export IMAGE_NAME=rag
export IMAGE_TAG=1.0.0Build the image:
podman buildx build --platform linux/amd64,linux/arm64 -t $CONTAINER_REGISTRY/$CONTAINER_ORGANIZATION/$IMAGE_NAME:$IMAGE_TAG -f Containerfile .Login to registry:
podman login $CONTAINER_REGISTRYPush the image:
podman push $CONTAINER_REGISTRY/$CONTAINER_ORGANIZATION/$IMAGE_NAME:$IMAGE_TAGAdd modification to deploy/helm/rag/values.yaml, replacing the placeholders with your values:
image:
repository: {CONTAINER_REGISTRY}/{CONTAINER_ORGANIZATION}/{IMAGE_NAME}
pullPolicy: IfNotPresent
tag: {IMAGE_TAG}To redeploy to the cluster run the same make command as you did before.
Check if Ollama is running on host:
make check-ollama
# Or manually:
curl http://localhost:11434/api/version
ollama list # List available modelsIf Ollama is not running:
# Recommended: Use make command to setup/restart Ollama
make setup-ollama
# Or manually start Ollama (on Linux):
ollama serve &
# On macOS, Ollama typically runs as a background service automaticallyCheck if containers are running:
# Using make (recommended - shows podman-compose status):
make status
# Or check directly with podman:
podman psTest all services:
make test-servicesView service logs:
make logs # All services
make logs-llamastack # Llama Stack only
make logs-ui # UI only
make logs-ingestion # Ingestion serviceOllama can't be reached from containers:
If Llama Stack can't connect to Ollama on the host:
# First, ensure Ollama is running
make check-ollama
# If that fails, verify Ollama is listening on all interfaces
export OLLAMA_HOST=0.0.0.0
ollama serve &
# Test from container
podman exec -it rag-llamastack curl http://host.containers.internal:11434/api/version
# On Linux, you may need to use the host IP directly:
podman exec -it rag-llamastack curl http://172.17.0.1:11434/api/versionModel not found:
# Pull the required model
ollama pull llama3.2:3b-instruct-fp16
# List available models
ollama listReinstall dependencies if needed:
uv sync --reinstallTest the client in Python:
uv run python -c "from llama_stack_client import LlamaStackClient; print(LlamaStackClient)"Check uv environment:
uv pip listRun commands in the uv environment:
uv run <command>Reset the entire environment:
rm -rf .venv
uv syncComplete reset of local deployment:
# Option 1: Use the reset command (cleanup and rebuild)
make reset
# Option 2: Manual step-by-step reset
make cleanup # Remove all containers and volumes
make setup-ollama # Reinstall Ollama and models (if needed)
make start # Start services freshFor more detailed troubleshooting related to Ollama running on the host, see OLLAMA_SETUP.md.