Skip to content

michalzagrodzki/kubernetes-rag

Repository files navigation

RAG on Kubernetes

This repository shows how to ship a Retrieval-Augmented Generation (RAG) stack from a local developer workflow to a Kubernetes cluster. It includes a FastAPI backend wired to Postgres/pgvector, a React frontend, local llama.cpp and TEI services, plus the container and infrastructure assets required to run everything end-to-end.

Added Value

  • Opinionated starting point for running a complete RAG workflow (ingest → embed → store → retrieve → generate) without relying on hosted APIs.
  • Reproducible Kubernetes deployment with manifests, secrets scaffolding, and Terraform automation for managed clusters.
  • Local-first developer workflow powered by Docker Compose, hot-reload servers, and shared model volumes so you can iterate quickly.
  • Model-serving targets for both embeddings (TEI) and llama.cpp-based LLMs that map cleanly between local and cluster environments.

Technology Stack

  • Backend: FastAPI + SQLModel + LangChain, backed by Postgres with pgvector.
  • Frontend: React (Vite) + TypeScript + Tailwind + shadcn/ui + Zustand.
  • RAG Services: Text Embeddings Inference (nomic-embed-text) and llama.cpp running Qwen 2.5 1.5B Instruct.
  • Orchestration: Dockerfiles, Docker Compose, Kubernetes manifests, and Terraform modules under deploy/.
  • Tooling: Alembic migrations, pytest-ready backend, ESLint/TypeScript checks, Git LFS for model pulls.

Prerequisites

  • Docker (with BuildKit) and Docker Compose v2
    • Important: Configure Docker Desktop (or equivalent) with at least 16 GB RAM (recommended 24 GB) and 10 CPUs
    • The embedding and LLM services are resource-intensive and will fail or perform poorly with less memory
    • On macOS: Docker Desktop → Preferences → Resources → increase Memory slider and CPU allocation
  • Python 3.12+ with pip or uv for backend dependencies (Python 3.9 is not supported due to union type syntax).
  • Node.js 20+ and npm for the frontend.
  • kubectl plus a local cluster provider such as kind or minikube for Kubernetes testing.
  • Terraform (optional) when provisioning via deploy/k8s-terraform/.
  • Git LFS for downloading the embedding and LLM model weights.

Environment Variables

  • POSTGRES_URL (required): SQLAlchemy URL including the postgresql+psycopg driver and pgvector parameters.
  • POSTGRES_SERVER, POSTGRES_PORT, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB: individual settings used when POSTGRES_URL is not provided.
  • EMBEDDING_MODEL, EMBEDDINGS_BASE_URL: switch embedding model or point to an OpenAI-compatible endpoint.
  • TEI_BASE_URL: default is the in-cluster or Compose TEI service.
  • LOCAL_LLM_BASE_URL, LOCAL_LLM_MODEL, LOCAL_LLM_STREAMING: configure llama.cpp/Qwen runtime.
  • PDF_DIR: location for uploaded or seeded PDFs mounted into the backend container.

How to Run

Docker Compose (Full Stack)

Run the entire application stack (backend, frontend, Postgres, embeddings, and LLM) with a single command:

  1. Prepare environment files:

    # Copy and configure Postgres credentials
    cp .env.postgres.example .env.postgres
    
    # Copy and configure backend environment
    cp backend/.env.example backend/.env

    Edit these files if you need to change default credentials or service endpoints.

  2. Pull model weights (required for embeddings and LLM services):

    # Nomic embeddings model
    mkdir -p ~/rag-tei/models
    cd ~/rag-tei/models
    git lfs install
    git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
    cd nomic-embed-text-v1.5 && git lfs pull && cd ..
    
    # Update config.json inside nomic model with these values:
    # "hidden_size": 768, "num_attention_heads": 12, "num_hidden_layers": 12
    
    # Qwen LLM model
    mkdir -p ~/rag-chat/models
    cd ~/rag-chat/models
    git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
    cd Qwen2.5-1.5B-Instruct-GGUF && git lfs pull && cd ..
  3. Start all services:

    docker compose up

    Or run in detached mode:

    docker compose up -d
  4. Access the application:

  5. View logs (if running in detached mode):

    docker compose logs -f
  6. Stop the stack:

    docker compose down

    To remove all data including the database:

    docker compose down --volumes

Local Development (Without Docker)

For faster iteration with hot-reload during development:

  1. Copy .env.postgres.example to .env.postgres and edit credentials if needed.
  2. Pull the embedding and LLM model weights with Git LFS (see "Build and Run (Docker)" for commands) so TEI and llama.cpp have local data.
  3. Start the supporting services: docker compose up -d postgres_dev embedding_dev llm_dev.
  4. Launch the backend: cd backend && uvicorn app:app --reload --port 8000. Hot reload works against the Compose services.
  5. Start the frontend: cd frontend && npm install && npm run dev (Vite serves on http://localhost:5173 by default).

Container images

  1. Build the backend image: docker build -f deploy/containers/Dockerfile.backend -t rag-backend:dev ..
  2. Build the frontend image: docker build -f deploy/containers/Dockerfile.frontend --build-arg VITE_API_URL=http://localhost:8000 -t rag-frontend:dev ..
  3. Build the llama.cpp image (optional for local GPU/CPU inference): docker build -f deploy/containers/Dockerfile.llamacpp -t rag-llm:qwen2.5-1.5b ..

Kubernetes (kind or minikube)

  1. Provision a cluster (example): kind create cluster --config deploy/k8s/kind-cluster.yaml.
  2. Load or push your images so the cluster can pull them (kind load docker-image rag-backend:dev rag-frontend:dev).
  3. Update deploy/k8s/secrets.yaml with your database credentials and ensure the TEI/LLM ConfigMap values match your deployment.
  4. Apply the manifests: kubectl apply -f deploy/k8s/ (namespace, Postgres, embeddings, LLM, backend, frontend, network policies).
  5. Port-forward or expose the services. For example, kubectl port-forward svc/backend -n rag-dev 8000:8000 and browse to http://localhost:8000/docs.

Project Structure

  • backend/: FastAPI app (app.py), SQLModel schemas, services, and Alembic migrations.
  • frontend/: React + Vite UI with Zustand state and shadcn/ui components.
  • deploy/containers/: Dockerfiles for backend, frontend, and llama.cpp runtime.
  • deploy/k8s/: Kubernetes manifests for local clusters (namespace, secrets, Deployments, Services, PVCs, NetworkPolicies).
  • deploy/k8s-terraform/: Terraform modules for provisioning equivalent resources in managed Kubernetes.
  • docker-compose.yml: local stack with Postgres, TEI, llama.cpp, backend, and frontend services.

Directory Tree

.
├── backend/                 FastAPI app, models, services, and migrations
│   ├── app.py               FastAPI entrypoint wiring routes and dependencies
│   └── alembic/             Database migration scripts and env configuration
├── frontend/                React + Vite client with Zustand stores and UI components
│   ├── src/                 Application code, routes, and shared utilities
│   └── public/              Static assets served by Vite
├── deploy/
│   ├── containers/          Dockerfiles for backend, frontend, llama.cpp, and embeddings
│   ├── k8s/                 Kubernetes manifests (namespace, workloads, services, secrets)
│   └── k8s-terraform/       Terraform modules and scripts for provisioning clusters
├── docker-compose.yml       Local development stack definition
└── readme.md                Project overview and operating instructions

Build and Run (Docker)

Nomic Embeddings (CPU-only):

Pull model:

mkdir -p ~/rag-tei/models
cd ~/rag-tei/models

git lfs install
git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
# (optional but safe) ensure all LFS files are present
cd nomic-embed-text-v1.5 && git lfs pull && cd ..

Update config.json inside nomic model with following values:

  "hidden_size": 768,
  "num_attention_heads": 12,
  "num_hidden_layers": 12

Pull image:

# CPU-only (portable; good for M1/M2 too)
docker pull ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 --platform linux/amd64

LLama.cpp Chat (CPU-only):

Pull model:

mkdir -p ~/rag-chat/models
cd ~/rag-chat/models

git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
# (optional but safe) ensure all LFS files are present
cd Qwen2.5-1.5B-Instruct-GGUF && git lfs pull && cd ..

Roadmap

  • Add Kustomize/Helm manifests under deploy/k8s/ with Secrets, PVCs, and Ingress.
  • Horizontal Pod Autoscaler, PodDisruptionBudgets, and readiness/liveness probes.
  • Optional GPU-enabled embeddings/LLM runtime (vLLM) and node selectors/taints.
  • Observability: basic logs/metrics and guidance for Prometheus/Grafana.
  • Examples for managed clouds (GKE/EKS/AKS) and Terraform modules.

Contributing

Feedback and PRs welcome. Please treat this as a moving target and include your Kubernetes flavor, versions, and any deployment notes in issues.

Local Development with Docker Compose

Initial Setup

  1. Copy environment files:

    cp .env.postgres.example .env.postgres
    cp backend/.env.sample backend/.env

    Tweak credentials/ports in .env.postgres and backend/.env if needed.

  2. Create Python virtual environment (required for local Alembic):

    cd backend
    python3.12 -m venv venv  # Use Python 3.12 or higher
    source venv/bin/activate
    pip install -r requirements.txt
  3. Start database and supporting services:

    docker compose up -d postgres_dev embedding_dev llm_dev

    Wait for Postgres to be ready (check logs with docker compose logs postgres_dev).

  4. Run database migrations (critical - creates all tables):

    cd backend
    source venv/bin/activate  # Make sure you're in the venv
    alembic upgrade head

    This creates tables like chat_history, documents, and pdf_ingestion in the database.

  5. Start the full stack:

    # Option A: All services including backend/frontend
    docker compose up
    
    # Option B: Backend and frontend only (with supporting services already running)
    docker compose up backend_dev frontend_dev

    IMPORTANT: After starting, the embedding service (embedding_dev) needs to warm up the model. This takes 3-5 minutes on first startup. You'll see logs like:

    embedding_dev | Starting model backend
    embedding_dev | Warming up model
    embedding_dev | Ready
    

    Wait for "Ready" to appear before proceeding. The backend will automatically retry if the embedding service isn't ready yet.

  6. Upload a document (required before asking questions):

    • Open Swagger UI: http://localhost:8000/docs
    • Find the POST /v1/upload endpoint
    • Click "Try it out"
    • Upload a PDF file
    • Click "Execute"
    • You should see a response with the number of chunks processed

    Without documents in the system, the RAG pipeline has nothing to retrieve and answer questions from.

  7. Access the application:

  8. Ask questions (through the frontend):

    • Open http://localhost:8080
    • Type your question related to the uploaded documents
    • Note: First query takes 2-3 minutes while the LLM loads and processes. Subsequent queries are faster.
    • The system retrieves relevant document chunks, embeds them, and generates an answer using the Qwen LLM
  9. Data management:

    • Data is stored in the pgdata Docker volume
    • Reset the database: docker compose down --volumes
    • View logs: docker compose logs -f
    • Stop the stack: docker compose down

Service Endpoints

When running docker compose up, the following endpoints are available on your host machine:

Service URL Purpose
Frontend (Chat UI) http://localhost:8080 Main application - chat with your documents
Backend API http://localhost:8000 FastAPI backend (base endpoint)
Swagger UI http://localhost:8000/docs Upload documents & test API endpoints
ReDoc http://localhost:8000/redoc Alternative API documentation (read-only)
OpenAPI Schema http://localhost:8000/openapi.json Raw OpenAPI specification

Quick Start Flow:

  1. Wait for embedding service to be "Ready" (3-5 minutes)
  2. Upload a PDF via Swagger UI: http://localhost:8000/docsPOST /v1/upload
  3. Open frontend: http://localhost:8080
  4. Ask questions about your documents (first query takes 2-3 minutes)

Swagger UI (/docs) is essential for document uploads. You can:

  • Upload PDF documents via POST /v1/upload endpoint
  • Browse all available endpoints
  • Test endpoints directly in the browser
  • View request/response schemas

Common Issues & Fixes

Issue: psycopg.OperationalError: nodename nor servname provided

  • Cause: Missing virtual environment or incorrect Python version
  • Fix: Create venv with Python 3.12+ and install requirements as shown above

Issue: relation "chat_history" does not exist

  • Cause: Database migrations were not applied
  • Fix: Run alembic upgrade head in the backend directory (see step 4 above)

Issue: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

  • Cause: Python 3.9 or earlier (union types | syntax not supported)
  • Fix: Use Python 3.12 or higher for the venv

Issue: Extra inputs are not permitted [type=extra_forbidden]

  • Cause: Missing database_url in config or environment variable mismatch
  • Fix: Ensure DATABASE_URL is set in .env (or ignored in config)

Issue: httpx.ConnectError: [Errno 111] Connection refused when uploading documents

  • Cause: Embedding service is still warming up or not responding
  • Fix: Wait for "Ready" message in embedding_dev logs. The backend retries automatically up to 5 times with exponential backoff.

Issue: Embedding service crashes with exit code 137 (OOM kill)

  • Cause: Docker Desktop doesn't have enough memory allocated
  • Fix: Increase Docker memory to at least 16 GB (24 GB recommended) in Docker Desktop Preferences → Resources

Issue: LLM takes a very long time to respond or frontend seems frozen

  • Cause: Normal behavior on first query - the LLM model is loading into memory
  • Fix: First query takes 2-3 minutes. Subsequent queries are faster. Be patient.

The Compose stack uses the pgvector/pgvector:16 image for persistence and shares credentials with the backend via .env.postgres.

Run the stack on Kubernetes (local, prod-like)

Spin up a local Kubernetes cluster on macOS (Apple Silicon supported) that’s close to production, then deploy the stack using the manifests under deploy/k8s/ and your existing images.

Prerequisites

  • Docker Desktop for Mac (Apple Silicon OK)
  • kubectl, helm, kind
# macOS (Homebrew)
brew install kind kubectl helm

Provision the stack with Terraform

# 1. Export ingress hostnames (or add them to .env and run the helper script)
export FRONTEND_HOST=app.localtest.me
export API_HOST=api.localtest.me

# Optional: generate terraform.auto.tfvars from a .env file
./deploy/k8s-terraform/scripts/generate-tfvars.sh

# 2. Initialize and apply
cd deploy/k8s-terraform
terraform init -upgrade
terraform apply \
  -var "backend_image=rag-backend:dev" \
  -var "frontend_image=rag-frontend:dev" \
  -var "postgres_url=postgresql+asyncpg://..." \
  -var "postgres_server=..." \
  -var "postgres_user=..." \
  -var "postgres_password=..." \
  -var "postgres_db=..." \
  -var "enable_tls=false"

Prepare model/data folders (if mounting local data)

If your manifests mount local model directories (e.g., for llama.cpp or embeddings), ensure they exist and are populated as per your current Docker workflow:

mkdir -p ~/rag-chat/models
mkdir -p ~/rag-tei/models

6) Access the app

The Ingress is configured for app.rag.me with a self-signed cert (issued by selfsigned-issuer).

open https://app.rag.me

If you prefer without TLS during testing, remove the TLS section from the Ingress and use http://app.rag.me.

7) Troubleshooting

  • Pods Pending → check default StorageClass and PVC binding: kubectl get sc,pvc -A.
  • ImagePullBackOff → ensure kind load docker-image ... for locally built images or push to a reachable registry.
  • Apple Silicon perf → x86_64 embeddings under emulation can be slow; consider a native ARM alternative once the flow is verified.
  • DB connectivity → verify backend env/Secret for POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD.

About

Terraform structure for kubernetes cluster handling air-gapped RAG solution.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors