RAG on Kubernetes

This repository shows how to ship a Retrieval-Augmented Generation (RAG) stack from a local developer workflow to a Kubernetes cluster. It includes a FastAPI backend wired to Postgres/pgvector, a React frontend, local llama.cpp and TEI services, plus the container and infrastructure assets required to run everything end-to-end.

Added Value

Opinionated starting point for running a complete RAG workflow (ingest → embed → store → retrieve → generate) without relying on hosted APIs.
Reproducible Kubernetes deployment with manifests, secrets scaffolding, and Terraform automation for managed clusters.
Local-first developer workflow powered by Docker Compose, hot-reload servers, and shared model volumes so you can iterate quickly.
Model-serving targets for both embeddings (TEI) and llama.cpp-based LLMs that map cleanly between local and cluster environments.

Technology Stack

Backend: FastAPI + SQLModel + LangChain, backed by Postgres with pgvector.
Frontend: React (Vite) + TypeScript + Tailwind + shadcn/ui + Zustand.
RAG Services: Text Embeddings Inference (nomic-embed-text) and llama.cpp running Qwen 2.5 1.5B Instruct.
Orchestration: Dockerfiles, Docker Compose, Kubernetes manifests, and Terraform modules under deploy/.
Tooling: Alembic migrations, pytest-ready backend, ESLint/TypeScript checks, Git LFS for model pulls.

Prerequisites

Docker (with BuildKit) and Docker Compose v2
- Important: Configure Docker Desktop (or equivalent) with at least 16 GB RAM (recommended 24 GB) and 10 CPUs
- The embedding and LLM services are resource-intensive and will fail or perform poorly with less memory
- On macOS: Docker Desktop → Preferences → Resources → increase Memory slider and CPU allocation
Python 3.12+ with pip or uv for backend dependencies (Python 3.9 is not supported due to union type syntax).
Node.js 20+ and npm for the frontend.
kubectl plus a local cluster provider such as kind or minikube for Kubernetes testing.
Terraform (optional) when provisioning via deploy/k8s-terraform/.
Git LFS for downloading the embedding and LLM model weights.

Environment Variables

POSTGRES_URL (required): SQLAlchemy URL including the postgresql+psycopg driver and pgvector parameters.
POSTGRES_SERVER, POSTGRES_PORT, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB: individual settings used when POSTGRES_URL is not provided.
EMBEDDING_MODEL, EMBEDDINGS_BASE_URL: switch embedding model or point to an OpenAI-compatible endpoint.
TEI_BASE_URL: default is the in-cluster or Compose TEI service.
LOCAL_LLM_BASE_URL, LOCAL_LLM_MODEL, LOCAL_LLM_STREAMING: configure llama.cpp/Qwen runtime.
PDF_DIR: location for uploaded or seeded PDFs mounted into the backend container.

How to Run

Docker Compose (Full Stack)

Run the entire application stack (backend, frontend, Postgres, embeddings, and LLM) with a single command:

Prepare environment files:

# Copy and configure Postgres credentials
cp .env.postgres.example .env.postgres

# Copy and configure backend environment
cp backend/.env.example backend/.env

Edit these files if you need to change default credentials or service endpoints.

Pull model weights (required for embeddings and LLM services):

# Nomic embeddings model
mkdir -p ~/rag-tei/models
cd ~/rag-tei/models
git lfs install
git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
cd nomic-embed-text-v1.5 && git lfs pull && cd ..

# Update config.json inside nomic model with these values:
# "hidden_size": 768, "num_attention_heads": 12, "num_hidden_layers": 12

# Qwen LLM model
mkdir -p ~/rag-chat/models
cd ~/rag-chat/models
git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
cd Qwen2.5-1.5B-Instruct-GGUF && git lfs pull && cd ..

Start all services:
```
docker compose up
```
Or run in detached mode:
```
docker compose up -d
```
Access the application:
- Frontend: http://localhost:8080
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
View logs (if running in detached mode):
```
docker compose logs -f
```
Stop the stack:
```
docker compose down
```
To remove all data including the database:
```
docker compose down --volumes
```

Local Development (Without Docker)

For faster iteration with hot-reload during development:

Copy .env.postgres.example to .env.postgres and edit credentials if needed.
Pull the embedding and LLM model weights with Git LFS (see "Build and Run (Docker)" for commands) so TEI and llama.cpp have local data.
Start the supporting services: docker compose up -d postgres_dev embedding_dev llm_dev.
Launch the backend: cd backend && uvicorn app:app --reload --port 8000. Hot reload works against the Compose services.
Start the frontend: cd frontend && npm install && npm run dev (Vite serves on http://localhost:5173 by default).

Container images

Build the backend image: docker build -f deploy/containers/Dockerfile.backend -t rag-backend:dev ..
Build the frontend image: docker build -f deploy/containers/Dockerfile.frontend --build-arg VITE_API_URL=http://localhost:8000 -t rag-frontend:dev ..
Build the llama.cpp image (optional for local GPU/CPU inference): docker build -f deploy/containers/Dockerfile.llamacpp -t rag-llm:qwen2.5-1.5b ..

Kubernetes (kind or minikube)

Provision a cluster (example): kind create cluster --config deploy/k8s/kind-cluster.yaml.
Load or push your images so the cluster can pull them (kind load docker-image rag-backend:dev rag-frontend:dev).
Update deploy/k8s/secrets.yaml with your database credentials and ensure the TEI/LLM ConfigMap values match your deployment.
Apply the manifests: kubectl apply -f deploy/k8s/ (namespace, Postgres, embeddings, LLM, backend, frontend, network policies).
Port-forward or expose the services. For example, kubectl port-forward svc/backend -n rag-dev 8000:8000 and browse to http://localhost:8000/docs.

Project Structure

backend/: FastAPI app (app.py), SQLModel schemas, services, and Alembic migrations.
frontend/: React + Vite UI with Zustand state and shadcn/ui components.
deploy/containers/: Dockerfiles for backend, frontend, and llama.cpp runtime.
deploy/k8s/: Kubernetes manifests for local clusters (namespace, secrets, Deployments, Services, PVCs, NetworkPolicies).
deploy/k8s-terraform/: Terraform modules for provisioning equivalent resources in managed Kubernetes.
docker-compose.yml: local stack with Postgres, TEI, llama.cpp, backend, and frontend services.

Directory Tree

.
├── backend/                 FastAPI app, models, services, and migrations
│   ├── app.py               FastAPI entrypoint wiring routes and dependencies
│   └── alembic/             Database migration scripts and env configuration
├── frontend/                React + Vite client with Zustand stores and UI components
│   ├── src/                 Application code, routes, and shared utilities
│   └── public/              Static assets served by Vite
├── deploy/
│   ├── containers/          Dockerfiles for backend, frontend, llama.cpp, and embeddings
│   ├── k8s/                 Kubernetes manifests (namespace, workloads, services, secrets)
│   └── k8s-terraform/       Terraform modules and scripts for provisioning clusters
├── docker-compose.yml       Local development stack definition
└── readme.md                Project overview and operating instructions

Build and Run (Docker)

Nomic Embeddings (CPU-only):

Pull model:

mkdir -p ~/rag-tei/models
cd ~/rag-tei/models

git lfs install
git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
# (optional but safe) ensure all LFS files are present
cd nomic-embed-text-v1.5 && git lfs pull && cd ..

Update config.json inside nomic model with following values:

  "hidden_size": 768,
  "num_attention_heads": 12,
  "num_hidden_layers": 12

Pull image:

# CPU-only (portable; good for M1/M2 too)
docker pull ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 --platform linux/amd64

LLama.cpp Chat (CPU-only):

Pull model:

mkdir -p ~/rag-chat/models
cd ~/rag-chat/models

git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
# (optional but safe) ensure all LFS files are present
cd Qwen2.5-1.5B-Instruct-GGUF && git lfs pull && cd ..

Roadmap

Add Kustomize/Helm manifests under deploy/k8s/ with Secrets, PVCs, and Ingress.
Horizontal Pod Autoscaler, PodDisruptionBudgets, and readiness/liveness probes.
Optional GPU-enabled embeddings/LLM runtime (vLLM) and node selectors/taints.
Observability: basic logs/metrics and guidance for Prometheus/Grafana.
Examples for managed clouds (GKE/EKS/AKS) and Terraform modules.

Contributing

Feedback and PRs welcome. Please treat this as a moving target and include your Kubernetes flavor, versions, and any deployment notes in issues.

Local Development with Docker Compose

Initial Setup

Copy environment files:
```
cp .env.postgres.example .env.postgres
cp backend/.env.sample backend/.env
```
Tweak credentials/ports in .env.postgres and backend/.env if needed.

Create Python virtual environment (required for local Alembic):

cd backend
python3.12 -m venv venv  # Use Python 3.12 or higher
source venv/bin/activate
pip install -r requirements.txt

Start database and supporting services:
```
docker compose up -d postgres_dev embedding_dev llm_dev
```
Wait for Postgres to be ready (check logs with docker compose logs postgres_dev).
Run database migrations (critical - creates all tables):
```
cd backend
source venv/bin/activate  # Make sure you're in the venv
alembic upgrade head
```
This creates tables like chat_history, documents, and pdf_ingestion in the database.
Start the full stack:
```
# Option A: All services including backend/frontend
docker compose up

# Option B: Backend and frontend only (with supporting services already running)
docker compose up backend_dev frontend_dev
```
⏳ IMPORTANT: After starting, the embedding service (embedding_dev) needs to warm up the model. This takes 3-5 minutes on first startup. You'll see logs like:
```
embedding_dev | Starting model backend
embedding_dev | Warming up model
embedding_dev | Ready
```
Wait for "Ready" to appear before proceeding. The backend will automatically retry if the embedding service isn't ready yet.
Upload a document (required before asking questions):
- Open Swagger UI: http://localhost:8000/docs
- Find the POST /v1/upload endpoint
- Click "Try it out"
- Upload a PDF file
- Click "Execute"
- You should see a response with the number of chunks processed
Without documents in the system, the RAG pipeline has nothing to retrieve and answer questions from.
Access the application:
- Frontend (Chat UI): http://localhost:8080
- Backend API: http://localhost:8000
- Swagger UI (API Testing & Document Upload): http://localhost:8000/docs
- ReDoc (Alternative API Documentation): http://localhost:8000/redoc
- OpenAPI Schema: http://localhost:8000/openapi.json
Ask questions (through the frontend):
- Open http://localhost:8080
- Type your question related to the uploaded documents
- ⏳ Note: First query takes 2-3 minutes while the LLM loads and processes. Subsequent queries are faster.
- The system retrieves relevant document chunks, embeds them, and generates an answer using the Qwen LLM
Data management:
- Data is stored in the pgdata Docker volume
- Reset the database: docker compose down --volumes
- View logs: docker compose logs -f
- Stop the stack: docker compose down

Service Endpoints

When running docker compose up, the following endpoints are available on your host machine:

Service	URL	Purpose
Frontend (Chat UI)	http://localhost:8080	Main application - chat with your documents
Backend API	http://localhost:8000	FastAPI backend (base endpoint)
Swagger UI	http://localhost:8000/docs	Upload documents & test API endpoints
ReDoc	http://localhost:8000/redoc	Alternative API documentation (read-only)
OpenAPI Schema	http://localhost:8000/openapi.json	Raw OpenAPI specification

Quick Start Flow:

Wait for embedding service to be "Ready" (3-5 minutes)
Upload a PDF via Swagger UI: http://localhost:8000/docs → POST /v1/upload
Open frontend: http://localhost:8080
Ask questions about your documents (first query takes 2-3 minutes)

Swagger UI (/docs) is essential for document uploads. You can:

Upload PDF documents via POST /v1/upload endpoint
Browse all available endpoints
Test endpoints directly in the browser
View request/response schemas

Common Issues & Fixes

Issue: psycopg.OperationalError: nodename nor servname provided

Cause: Missing virtual environment or incorrect Python version
Fix: Create venv with Python 3.12+ and install requirements as shown above

Issue: relation "chat_history" does not exist

Cause: Database migrations were not applied
Fix: Run alembic upgrade head in the backend directory (see step 4 above)

Issue: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Cause: Python 3.9 or earlier (union types | syntax not supported)
Fix: Use Python 3.12 or higher for the venv

Issue: Extra inputs are not permitted [type=extra_forbidden]

Cause: Missing database_url in config or environment variable mismatch
Fix: Ensure DATABASE_URL is set in .env (or ignored in config)

Issue: httpx.ConnectError: [Errno 111] Connection refused when uploading documents

Cause: Embedding service is still warming up or not responding
Fix: Wait for "Ready" message in embedding_dev logs. The backend retries automatically up to 5 times with exponential backoff.

Issue: Embedding service crashes with exit code 137 (OOM kill)

Cause: Docker Desktop doesn't have enough memory allocated
Fix: Increase Docker memory to at least 16 GB (24 GB recommended) in Docker Desktop Preferences → Resources

Issue: LLM takes a very long time to respond or frontend seems frozen

Cause: Normal behavior on first query - the LLM model is loading into memory
Fix: First query takes 2-3 minutes. Subsequent queries are faster. Be patient.

The Compose stack uses the pgvector/pgvector:16 image for persistence and shares credentials with the backend via .env.postgres.

Run the stack on Kubernetes (local, prod-like)

Spin up a local Kubernetes cluster on macOS (Apple Silicon supported) that’s close to production, then deploy the stack using the manifests under deploy/k8s/ and your existing images.

Prerequisites

Docker Desktop for Mac (Apple Silicon OK)
kubectl, helm, kind

# macOS (Homebrew)
brew install kind kubectl helm

Provision the stack with Terraform

# 1. Export ingress hostnames (or add them to .env and run the helper script)
export FRONTEND_HOST=app.localtest.me
export API_HOST=api.localtest.me

# Optional: generate terraform.auto.tfvars from a .env file
./deploy/k8s-terraform/scripts/generate-tfvars.sh

# 2. Initialize and apply
cd deploy/k8s-terraform
terraform init -upgrade
terraform apply \
  -var "backend_image=rag-backend:dev" \
  -var "frontend_image=rag-frontend:dev" \
  -var "postgres_url=postgresql+asyncpg://..." \
  -var "postgres_server=..." \
  -var "postgres_user=..." \
  -var "postgres_password=..." \
  -var "postgres_db=..." \
  -var "enable_tls=false"

Prepare model/data folders (if mounting local data)

If your manifests mount local model directories (e.g., for llama.cpp or embeddings), ensure they exist and are populated as per your current Docker workflow:

mkdir -p ~/rag-chat/models
mkdir -p ~/rag-tei/models

6) Access the app

The Ingress is configured for app.rag.me with a self-signed cert (issued by selfsigned-issuer).

open https://app.rag.me

If you prefer without TLS during testing, remove the TLS section from the Ingress and use http://app.rag.me.

7) Troubleshooting

Pods Pending → check default StorageClass and PVC binding: kubectl get sc,pvc -A.
ImagePullBackOff → ensure kind load docker-image ... for locally built images or push to a reachable registry.
Apple Silicon perf → x86_64 embeddings under emulation can be slow; consider a native ARM alternative once the flow is verified.
DB connectivity → verify backend env/Secret for POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
backend		backend
deploy		deploy
frontend		frontend
.dockerignore		.dockerignore
.env.postgres.example		.env.postgres.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG on Kubernetes

Added Value

Technology Stack

Prerequisites

Environment Variables

How to Run

Docker Compose (Full Stack)

Local Development (Without Docker)

Container images

Kubernetes (kind or minikube)

Project Structure

Directory Tree

Build and Run (Docker)

Nomic Embeddings (CPU-only):

LLama.cpp Chat (CPU-only):

Roadmap

Contributing

Local Development with Docker Compose

Initial Setup

Service Endpoints

Common Issues & Fixes

Run the stack on Kubernetes (local, prod-like)

Prerequisites

Provision the stack with Terraform

Prepare model/data folders (if mounting local data)

6) Access the app

7) Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG on Kubernetes

Added Value

Technology Stack

Prerequisites

Environment Variables

How to Run

Docker Compose (Full Stack)

Local Development (Without Docker)

Container images

Kubernetes (kind or minikube)

Project Structure

Directory Tree

Build and Run (Docker)

Nomic Embeddings (CPU-only):

LLama.cpp Chat (CPU-only):

Roadmap

Contributing

Local Development with Docker Compose

Initial Setup

Service Endpoints

Common Issues & Fixes

Run the stack on Kubernetes (local, prod-like)

Prerequisites

Provision the stack with Terraform

Prepare model/data folders (if mounting local data)

6) Access the app

7) Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages