This repository shows how to ship a Retrieval-Augmented Generation (RAG) stack from a local developer workflow to a Kubernetes cluster. It includes a FastAPI backend wired to Postgres/pgvector, a React frontend, local llama.cpp and TEI services, plus the container and infrastructure assets required to run everything end-to-end.
- Opinionated starting point for running a complete RAG workflow (ingest → embed → store → retrieve → generate) without relying on hosted APIs.
- Reproducible Kubernetes deployment with manifests, secrets scaffolding, and Terraform automation for managed clusters.
- Local-first developer workflow powered by Docker Compose, hot-reload servers, and shared model volumes so you can iterate quickly.
- Model-serving targets for both embeddings (TEI) and llama.cpp-based LLMs that map cleanly between local and cluster environments.
- Backend: FastAPI + SQLModel + LangChain, backed by Postgres with pgvector.
- Frontend: React (Vite) + TypeScript + Tailwind + shadcn/ui + Zustand.
- RAG Services: Text Embeddings Inference (nomic-embed-text) and llama.cpp running Qwen 2.5 1.5B Instruct.
- Orchestration: Dockerfiles, Docker Compose, Kubernetes manifests, and Terraform modules under
deploy/. - Tooling: Alembic migrations, pytest-ready backend, ESLint/TypeScript checks, Git LFS for model pulls.
- Docker (with BuildKit) and Docker Compose v2
- Important: Configure Docker Desktop (or equivalent) with at least 16 GB RAM (recommended 24 GB) and 10 CPUs
- The embedding and LLM services are resource-intensive and will fail or perform poorly with less memory
- On macOS: Docker Desktop → Preferences → Resources → increase Memory slider and CPU allocation
- Python 3.12+ with
piporuvfor backend dependencies (Python 3.9 is not supported due to union type syntax). - Node.js 20+ and
npmfor the frontend. kubectlplus a local cluster provider such askindorminikubefor Kubernetes testing.- Terraform (optional) when provisioning via
deploy/k8s-terraform/. - Git LFS for downloading the embedding and LLM model weights.
POSTGRES_URL(required): SQLAlchemy URL including thepostgresql+psycopgdriver and pgvector parameters.POSTGRES_SERVER,POSTGRES_PORT,POSTGRES_USER,POSTGRES_PASSWORD,POSTGRES_DB: individual settings used whenPOSTGRES_URLis not provided.EMBEDDING_MODEL,EMBEDDINGS_BASE_URL: switch embedding model or point to an OpenAI-compatible endpoint.TEI_BASE_URL: default is the in-cluster or Compose TEI service.LOCAL_LLM_BASE_URL,LOCAL_LLM_MODEL,LOCAL_LLM_STREAMING: configure llama.cpp/Qwen runtime.PDF_DIR: location for uploaded or seeded PDFs mounted into the backend container.
Run the entire application stack (backend, frontend, Postgres, embeddings, and LLM) with a single command:
-
Prepare environment files:
# Copy and configure Postgres credentials cp .env.postgres.example .env.postgres # Copy and configure backend environment cp backend/.env.example backend/.env
Edit these files if you need to change default credentials or service endpoints.
-
Pull model weights (required for embeddings and LLM services):
# Nomic embeddings model mkdir -p ~/rag-tei/models cd ~/rag-tei/models git lfs install git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 cd nomic-embed-text-v1.5 && git lfs pull && cd .. # Update config.json inside nomic model with these values: # "hidden_size": 768, "num_attention_heads": 12, "num_hidden_layers": 12 # Qwen LLM model mkdir -p ~/rag-chat/models cd ~/rag-chat/models git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF cd Qwen2.5-1.5B-Instruct-GGUF && git lfs pull && cd ..
-
Start all services:
docker compose up
Or run in detached mode:
docker compose up -d
-
Access the application:
- Frontend: http://localhost:8080
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
-
View logs (if running in detached mode):
docker compose logs -f
-
Stop the stack:
docker compose down
To remove all data including the database:
docker compose down --volumes
For faster iteration with hot-reload during development:
- Copy
.env.postgres.exampleto.env.postgresand edit credentials if needed. - Pull the embedding and LLM model weights with Git LFS (see "Build and Run (Docker)" for commands) so TEI and llama.cpp have local data.
- Start the supporting services:
docker compose up -d postgres_dev embedding_dev llm_dev. - Launch the backend:
cd backend && uvicorn app:app --reload --port 8000. Hot reload works against the Compose services. - Start the frontend:
cd frontend && npm install && npm run dev(Vite serves on http://localhost:5173 by default).
- Build the backend image:
docker build -f deploy/containers/Dockerfile.backend -t rag-backend:dev .. - Build the frontend image:
docker build -f deploy/containers/Dockerfile.frontend --build-arg VITE_API_URL=http://localhost:8000 -t rag-frontend:dev .. - Build the llama.cpp image (optional for local GPU/CPU inference):
docker build -f deploy/containers/Dockerfile.llamacpp -t rag-llm:qwen2.5-1.5b ..
- Provision a cluster (example):
kind create cluster --config deploy/k8s/kind-cluster.yaml. - Load or push your images so the cluster can pull them (
kind load docker-image rag-backend:dev rag-frontend:dev). - Update
deploy/k8s/secrets.yamlwith your database credentials and ensure the TEI/LLM ConfigMap values match your deployment. - Apply the manifests:
kubectl apply -f deploy/k8s/(namespace, Postgres, embeddings, LLM, backend, frontend, network policies). - Port-forward or expose the services. For example,
kubectl port-forward svc/backend -n rag-dev 8000:8000and browse tohttp://localhost:8000/docs.
backend/: FastAPI app (app.py), SQLModel schemas, services, and Alembic migrations.frontend/: React + Vite UI with Zustand state and shadcn/ui components.deploy/containers/: Dockerfiles for backend, frontend, and llama.cpp runtime.deploy/k8s/: Kubernetes manifests for local clusters (namespace, secrets, Deployments, Services, PVCs, NetworkPolicies).deploy/k8s-terraform/: Terraform modules for provisioning equivalent resources in managed Kubernetes.docker-compose.yml: local stack with Postgres, TEI, llama.cpp, backend, and frontend services.
.
├── backend/ FastAPI app, models, services, and migrations
│ ├── app.py FastAPI entrypoint wiring routes and dependencies
│ └── alembic/ Database migration scripts and env configuration
├── frontend/ React + Vite client with Zustand stores and UI components
│ ├── src/ Application code, routes, and shared utilities
│ └── public/ Static assets served by Vite
├── deploy/
│ ├── containers/ Dockerfiles for backend, frontend, llama.cpp, and embeddings
│ ├── k8s/ Kubernetes manifests (namespace, workloads, services, secrets)
│ └── k8s-terraform/ Terraform modules and scripts for provisioning clusters
├── docker-compose.yml Local development stack definition
└── readme.md Project overview and operating instructions
Pull model:
mkdir -p ~/rag-tei/models
cd ~/rag-tei/models
git lfs install
git clone https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
# (optional but safe) ensure all LFS files are present
cd nomic-embed-text-v1.5 && git lfs pull && cd ..Update config.json inside nomic model with following values:
"hidden_size": 768,
"num_attention_heads": 12,
"num_hidden_layers": 12Pull image:
# CPU-only (portable; good for M1/M2 too)
docker pull ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 --platform linux/amd64Pull model:
mkdir -p ~/rag-chat/models
cd ~/rag-chat/models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF
# (optional but safe) ensure all LFS files are present
cd Qwen2.5-1.5B-Instruct-GGUF && git lfs pull && cd ..- Add Kustomize/Helm manifests under
deploy/k8s/with Secrets, PVCs, and Ingress. - Horizontal Pod Autoscaler, PodDisruptionBudgets, and readiness/liveness probes.
- Optional GPU-enabled embeddings/LLM runtime (vLLM) and node selectors/taints.
- Observability: basic logs/metrics and guidance for Prometheus/Grafana.
- Examples for managed clouds (GKE/EKS/AKS) and Terraform modules.
Feedback and PRs welcome. Please treat this as a moving target and include your Kubernetes flavor, versions, and any deployment notes in issues.
-
Copy environment files:
cp .env.postgres.example .env.postgres cp backend/.env.sample backend/.env
Tweak credentials/ports in
.env.postgresandbackend/.envif needed. -
Create Python virtual environment (required for local Alembic):
cd backend python3.12 -m venv venv # Use Python 3.12 or higher source venv/bin/activate pip install -r requirements.txt
-
Start database and supporting services:
docker compose up -d postgres_dev embedding_dev llm_dev
Wait for Postgres to be ready (check logs with
docker compose logs postgres_dev). -
Run database migrations (critical - creates all tables):
cd backend source venv/bin/activate # Make sure you're in the venv alembic upgrade head
This creates tables like
chat_history,documents, andpdf_ingestionin the database. -
Start the full stack:
# Option A: All services including backend/frontend docker compose up # Option B: Backend and frontend only (with supporting services already running) docker compose up backend_dev frontend_dev
⏳ IMPORTANT: After starting, the embedding service (
embedding_dev) needs to warm up the model. This takes 3-5 minutes on first startup. You'll see logs like:embedding_dev | Starting model backend embedding_dev | Warming up model embedding_dev | ReadyWait for "Ready" to appear before proceeding. The backend will automatically retry if the embedding service isn't ready yet.
-
Upload a document (required before asking questions):
- Open Swagger UI: http://localhost:8000/docs
- Find the
POST /v1/uploadendpoint - Click "Try it out"
- Upload a PDF file
- Click "Execute"
- You should see a response with the number of chunks processed
Without documents in the system, the RAG pipeline has nothing to retrieve and answer questions from.
-
Access the application:
- Frontend (Chat UI): http://localhost:8080
- Backend API: http://localhost:8000
- Swagger UI (API Testing & Document Upload): http://localhost:8000/docs
- ReDoc (Alternative API Documentation): http://localhost:8000/redoc
- OpenAPI Schema: http://localhost:8000/openapi.json
-
Ask questions (through the frontend):
- Open http://localhost:8080
- Type your question related to the uploaded documents
- ⏳ Note: First query takes 2-3 minutes while the LLM loads and processes. Subsequent queries are faster.
- The system retrieves relevant document chunks, embeds them, and generates an answer using the Qwen LLM
-
Data management:
- Data is stored in the
pgdataDocker volume - Reset the database:
docker compose down --volumes - View logs:
docker compose logs -f - Stop the stack:
docker compose down
- Data is stored in the
When running docker compose up, the following endpoints are available on your host machine:
| Service | URL | Purpose |
|---|---|---|
| Frontend (Chat UI) | http://localhost:8080 | Main application - chat with your documents |
| Backend API | http://localhost:8000 | FastAPI backend (base endpoint) |
| Swagger UI | http://localhost:8000/docs | Upload documents & test API endpoints |
| ReDoc | http://localhost:8000/redoc | Alternative API documentation (read-only) |
| OpenAPI Schema | http://localhost:8000/openapi.json | Raw OpenAPI specification |
Quick Start Flow:
- Wait for embedding service to be "Ready" (3-5 minutes)
- Upload a PDF via Swagger UI: http://localhost:8000/docs →
POST /v1/upload - Open frontend: http://localhost:8080
- Ask questions about your documents (first query takes 2-3 minutes)
Swagger UI (/docs) is essential for document uploads. You can:
- Upload PDF documents via
POST /v1/uploadendpoint - Browse all available endpoints
- Test endpoints directly in the browser
- View request/response schemas
Issue: psycopg.OperationalError: nodename nor servname provided
- Cause: Missing virtual environment or incorrect Python version
- Fix: Create venv with Python 3.12+ and install requirements as shown above
Issue: relation "chat_history" does not exist
- Cause: Database migrations were not applied
- Fix: Run
alembic upgrade headin the backend directory (see step 4 above)
Issue: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
- Cause: Python 3.9 or earlier (union types
|syntax not supported) - Fix: Use Python 3.12 or higher for the venv
Issue: Extra inputs are not permitted [type=extra_forbidden]
- Cause: Missing
database_urlin config or environment variable mismatch - Fix: Ensure
DATABASE_URLis set in.env(or ignored in config)
Issue: httpx.ConnectError: [Errno 111] Connection refused when uploading documents
- Cause: Embedding service is still warming up or not responding
- Fix: Wait for "Ready" message in embedding_dev logs. The backend retries automatically up to 5 times with exponential backoff.
Issue: Embedding service crashes with exit code 137 (OOM kill)
- Cause: Docker Desktop doesn't have enough memory allocated
- Fix: Increase Docker memory to at least 16 GB (24 GB recommended) in Docker Desktop Preferences → Resources
Issue: LLM takes a very long time to respond or frontend seems frozen
- Cause: Normal behavior on first query - the LLM model is loading into memory
- Fix: First query takes 2-3 minutes. Subsequent queries are faster. Be patient.
The Compose stack uses the pgvector/pgvector:16 image for persistence and shares credentials with the backend via .env.postgres.
Spin up a local Kubernetes cluster on macOS (Apple Silicon supported) that’s close to production, then deploy the stack using the manifests under deploy/k8s/ and your existing images.
- Docker Desktop for Mac (Apple Silicon OK)
kubectl,helm,kind
# macOS (Homebrew)
brew install kind kubectl helm# 1. Export ingress hostnames (or add them to .env and run the helper script)
export FRONTEND_HOST=app.localtest.me
export API_HOST=api.localtest.me
# Optional: generate terraform.auto.tfvars from a .env file
./deploy/k8s-terraform/scripts/generate-tfvars.sh
# 2. Initialize and apply
cd deploy/k8s-terraform
terraform init -upgrade
terraform apply \
-var "backend_image=rag-backend:dev" \
-var "frontend_image=rag-frontend:dev" \
-var "postgres_url=postgresql+asyncpg://..." \
-var "postgres_server=..." \
-var "postgres_user=..." \
-var "postgres_password=..." \
-var "postgres_db=..." \
-var "enable_tls=false"If your manifests mount local model directories (e.g., for llama.cpp or embeddings), ensure they exist and are populated as per your current Docker workflow:
mkdir -p ~/rag-chat/models
mkdir -p ~/rag-tei/modelsThe Ingress is configured for app.rag.me with a self-signed cert (issued by selfsigned-issuer).
open https://app.rag.meIf you prefer without TLS during testing, remove the TLS section from the Ingress and use http://app.rag.me.
- Pods Pending → check default StorageClass and PVC binding:
kubectl get sc,pvc -A. - ImagePullBackOff → ensure
kind load docker-image ...for locally built images or push to a reachable registry. - Apple Silicon perf → x86_64 embeddings under emulation can be slow; consider a native ARM alternative once the flow is verified.
- DB connectivity → verify backend env/Secret for
POSTGRES_DB,POSTGRES_USER,POSTGRES_PASSWORD.