diff --git a/e2e-tests/llm-katan/deploy/docs/README.md b/e2e-tests/llm-katan/deploy/docs/README.md new file mode 100644 index 000000000..9077f5a49 --- /dev/null +++ b/e2e-tests/llm-katan/deploy/docs/README.md @@ -0,0 +1,288 @@ +# LLM Katan - Kubernetes Deployment + +Comprehensive Kubernetes support for deploying LLM Katan in cloud-native environments. + +## Overview + +This directory provides production-ready Kubernetes manifests using Kustomize for deploying LLM Katan - a lightweight LLM server designed for testing and development workflows. + +## Architecture + +### Pod Structure +Each deployment consists of two containers: +- **initContainer (model-downloader)**: Downloads models from HuggingFace to PVC + - Image: `python:3.11-slim` (~45MB) + - Checks if model exists before downloading + - Runs once before main container starts + +- **main container (llm-katan)**: Serves the LLM API + - Image: `llm-katan:latest` (~1.35GB) + - Loads model from PVC cache + - Exposes OpenAI-compatible API on port 8000 + +### Storage +- **PersistentVolumeClaim**: 5Gi for model caching +- **Mount Path**: `/cache/models/` +- **Access Mode**: ReadWriteOnce (single Pod write) +- Models persist across Pod restarts + +### Namespace +All resources deploy to the `llm-katan-system` namespace. Each overlay creates isolated instances within this namespace: +- **gpt35**: Simulates GPT-3.5-turbo +- **claude**: Simulates Claude-3-Haiku + +### Resource Naming +Kustomize applies `nameSuffix` to avoid conflicts: +- Base: `llm-katan` +- gpt35 overlay: `llm-katan-gpt35` (via `nameSuffix: -gpt35`) +- claude overlay: `llm-katan-claude` (via `nameSuffix: -claude`) + +**How it works:** +```yaml +# overlays/gpt35/kustomization.yaml +nameSuffix: -gpt35 # Automatically appends to all resource names +``` + +This creates unique resource names for each overlay without manual patches, allowing multiple instances to coexist in the same namespace. + +### Networking +- **Service Type**: ClusterIP (internal only) +- **Port**: 8000 (HTTP) +- **Endpoints**: `/health`, `/v1/models`, `/v1/chat/completions` + +### Health Checks +- **Startup Probe**: 30s initial delay, 60 failures (15 min max startup) +- **Liveness Probe**: 15s delay, checks every 20s +- **Readiness Probe**: 5s delay, checks every 10s + +## Directory Structure + +kubernetes/ +├── base/ # Base Kubernetes manifests +│ ├── namespace.yaml # llm-katan-system namespace +│ ├── deployment.yaml # Main deployment with health checks +│ ├── service.yaml # ClusterIP service (port 8000) +│ ├── pvc.yaml # Model cache storage (5Gi) +│ └── kustomization.yaml # Base kustomization +│ +├── components/ # Reusable Kustomize components +│ └── common/ # Common labels for all resources +│ └── kustomization.yaml # Shared label definitions +│ +└── overlays/ # Environment-specific configurations + ├── gpt35/ # GPT-3.5-turbo simulation + │ └── kustomization.yaml # Overlay with patches for gpt35 + │ + └── claude/ # Claude-3-Haiku simulation + └── kustomization.yaml # Overlay with patches for claude + +## Prerequisites + +Before starting, ensure you have the following tools installed: + +- [Docker](https://docs.docker.com/get-docker/) - Container runtime +- [minikube](https://minikube.sigs.k8s.io/docs/start/) - Local Kubernetes +- [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) - Kubernetes in Docker +- [kubectl](https://kubernetes.io/docs/tasks/tools/) - Kubernetes CLI +- `kustomize` (built into kubectl 1.14+) + + +## Configuration + +### Environment Variables + +Configure via `config.env` or overlay ConfigMaps: + +| Variable | Default | Description | +|----------|---------|-------------| +| `YLLM_MODEL` | `Qwen/Qwen3-0.6B` | HuggingFace model to load | +| `YLLM_SERVED_MODEL_NAME` | (empty) | Model name for API (defaults to YLLM_MODEL) | +| `YLLM_BACKEND` | `transformers` | Backend: `transformers` or `vllm` | +| `YLLM_HOST` | `0.0.0.0` | Server bind address | +| `YLLM_PORT` | `8000` | Server port | + +### Resource Limits + +Default per instance: + +```yaml +resources: + requests: + cpu: "1" + memory: "3Gi" + limits: + cpu: "2" + memory: "6Gi" +``` + +### Storage + +- **PVC Size**: 5Gi (adjust in overlays if needed) +- **Access Mode**: ReadWriteOnce +- **Mount Path**: `/cache/models/` +- **Purpose**: Cache downloaded models between restarts + +### Deploy Single Instance (Base) + +```bash +# From repository root +cd e2e-tests/llm-katan/deploy/kubernetes + +# Deploy with default settings +kubectl apply -k base/ + +# Check status +kubectl get pods -n llm-katan-system +kubectl logs -n llm-katan-system -l app=llm-katan -f + +# Test the deployment +kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000 +curl http://localhost:8000/health +``` + +### Deploy Multi-Instance (Overlays) + +```bash +# Deploy GPT-3.5-turbo simulation +kubectl apply -k overlays/gpt35/ + +# Deploy Claude-3-Haiku simulation +kubectl apply -k overlays/claude/ + +# Or deploy both simultaneously +kubectl apply -k overlays/gpt35/ && kubectl apply -k overlays/claude/ + +# Verify both are running +kubectl get pods -n llm-katan-system +kubectl get svc -n llm-katan-system + + +## Testing & Verification + +### Health Check + +```bash +kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000 +curl http://localhost:8000/health + +# Expected response: +# {"status":"ok","model":"Qwen/Qwen3-0.6B","backend":"transformers"} +``` + +### Chat Completion + +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-0.6B", + "messages": [{"role": "user", "content": "Hello!"}] + }' +``` + +### Models Endpoint + +```bash +curl http://localhost:8000/v1/models +``` + +### Metrics (Prometheus) + +```bash +dont forget -> kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000 +curl http://localhost:8000/metrics + +# Metrics exposed: +# - llm_katan_requests_total +# - llm_katan_tokens_generated_total +# - llm_katan_response_time_seconds +# - llm_katan_uptime_seconds +``` + +## Troubleshooting + +### Common Issues + +**Common pod error:** + + - OOMKilled: Increase memory limits (current: 6Gi) + - ImagePullBackOff: Load image into Kind with kind load docker-image llm-katan:latest + - Init:CrashLoopBackOff: Check initContainer logs for download issues + +**Pod not starting:** + +```bash +# Check pod status +kubectl get pods -n llm-katan-system + +# Describe pod for events +kubectl describe pod -n llm-katan-system -l app.kubernetes.io/name=llm-katan + +# Check initContainer logs (model download) +kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c model-downloader + +# Check main container logs +kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c llm-katan -f +``` + +**LLM Katan not responding::** + +```bash +# Check deployment status +kubectl get deployment -n llm-katan-system + +# Check service +kubectl get svc -n llm-katan-system + +# Check if port-forward is active +ps aux | grep "port-forward" | grep llm-katan + +# Test health endpoint +kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 & +curl http://localhost:8000/health +``` + +**PVC issues:** + +```bash +# Check PVC status +kubectl get pvc -n llm-katan-system + +# Check PVC details +kubectl describe pvc -n llm-katan-system + +# Check volume contents (if pod is running) +kubectl exec -n llm-katan-system -- ls -lah /cache/models/ +``` + +## Cleanup + +**Remove Specific Overlay:** + +```bash +# Remove gpt35 instance +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/ + +# Remove claude instance +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/ +``` + +**Remove All llm-katan Resources:** + +```bash +# Delete entire namespace (removes everything) +kubectl delete namespace llm-katan-system + +# Or delete base deployment +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/base/ +``` + +**Cleanup Kind Cluster:** + +```bash +# Stop Kind cluster +kind delete cluster --name llm-katan-test + +# Or if using default cluster name +kind delete cluster +``` diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml new file mode 100644 index 000000000..a6579e94e --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml @@ -0,0 +1,144 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: llm-katan +spec: + selector: + matchLabels: {} + replicas: 1 + template: + metadata: + labels: {} + spec: + # Create a non-root user for security (matching Dockerfile) + securityContext: + fsGroup: 1000 + runAsUser: 1000 + runAsNonRoot: true + + initContainers: + # Pre-download model to cache for faster startup + # Uses lightweight python:3.11-slim image and checks if model exists before downloading + - name: model-downloader + image: python:3.11-slim + imagePullPolicy: IfNotPresent + securityContext: + runAsUser: 0 # Run as root to install packages + runAsNonRoot: false + allowPrivilegeEscalation: false + command: ["/bin/bash", "-c"] + args: + - | + set -e + + MODEL_ID="${YLLM_MODEL:-Qwen/Qwen3-0.6B}" + MODEL_DIR=$(basename "$MODEL_ID") + + mkdir -p /cache/models + cd /cache/models + + # Check if model already exists in PVC + if [ -d "$MODEL_DIR" ]; then + echo "Model $MODEL_ID already cached. Skipping download." + exit 0 + fi + + # Model not found, proceed with download + echo "Downloading model $MODEL_ID..." + pip install --no-cache-dir huggingface_hub[cli] + hf download "$MODEL_ID" --local-dir "$MODEL_DIR" + env: + - name: YLLM_MODEL + value: "Qwen/Qwen3-0.6B" + - name: HF_HUB_CACHE + value: "/tmp/hf_cache" + volumeMounts: + - name: models-volume + mountPath: /cache/models + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + + containers: + - name: llm-katan + image: llm-katan:latest + imagePullPolicy: IfNotPresent + + # Command is set via environment variables + # Default: llm-katan --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000 + + ports: + - name: http + containerPort: 8000 + protocol: TCP + + env: + # These can be overridden via ConfigMap in overlays + - name: YLLM_MODEL + value: "/cache/models/Qwen3-0.6B" # Local path to downloaded model + - name: YLLM_PORT + value: "8000" + - name: YLLM_HOST + value: "0.0.0.0" + - name: YLLM_BACKEND + value: "transformers" + - name: PYTHONUNBUFFERED + value: "1" + - name: PYTHONDONTWRITEBYTECODE + value: "1" + + volumeMounts: + - name: models-volume + mountPath: /cache/models + + livenessProbe: + httpGet: + path: /health + port: http + initialDelaySeconds: 15 + periodSeconds: 20 + timeoutSeconds: 5 + failureThreshold: 3 + + readinessProbe: + httpGet: + path: /health + port: http + initialDelaySeconds: 5 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + + startupProbe: + httpGet: + path: /health + port: http + initialDelaySeconds: 30 + periodSeconds: 15 + timeoutSeconds: 5 + failureThreshold: 60 # 15 minutes max startup time (for slow model downloads) + + resources: + requests: + memory: "3Gi" + cpu: "1" + limits: + memory: "6Gi" + cpu: "2" + + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false # HuggingFace needs to write to cache + runAsNonRoot: true + capabilities: + drop: + - ALL + + volumes: + - name: models-volume + persistentVolumeClaim: + claimName: llm-katan-models diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml new file mode 100644 index 000000000..f378082c4 --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml @@ -0,0 +1,21 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +metadata: + name: llm-katan-base + +namespace: llm-katan-system + + +resources: + - namespace.yaml + - pvc.yaml + - deployment.yaml + - service.yaml + +# Images (can be overridden in overlays) +images: + - name: llm-katan + newName: llm-katan + newTag: latest + diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml new file mode 100644 index 000000000..f53e19f9a --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml @@ -0,0 +1,4 @@ +apiVersion: v1 +kind: Namespace +metadata: + name: llm-katan-system diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml new file mode 100644 index 000000000..ed12f2a5f --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml @@ -0,0 +1,10 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: llm-katan-models +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 5Gi # Increased for model cache (~600MB model + overhead) diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml new file mode 100644 index 000000000..ef4f4d3ec --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml @@ -0,0 +1,14 @@ +apiVersion: v1 +kind: Service +metadata: + name: llm-katan +spec: + type: ClusterIP + selector: + app: llm-katan + ports: + - name: http + port: 8000 + targetPort: http + protocol: TCP + diff --git a/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml new file mode 100644 index 000000000..5312fe4af --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml @@ -0,0 +1,10 @@ +apiVersion: kustomize.config.k8s.io/v1alpha1 +kind: Component + +# Common labels applied to all resources that use this component +labels: +- includeSelectors: true + pairs: + app.kubernetes.io/name: llm-katan + app.kubernetes.io/part-of: semantic-router-workspaces + app.kubernetes.io/managed-by: kustomize diff --git a/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml new file mode 100644 index 000000000..bb8a55dcf --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml @@ -0,0 +1,49 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +metadata: + name: llm-katan-claude + +# Base configuration +resources: + - ../../base + +components: + - ../../components/common + +# Namespace suffix for multi-instance deployment +nameSuffix: -claude + +# Patches to customize for Claude-3-Haiku simulation +patches: + # Set served model name for API + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: add + path: /spec/template/spec/containers/0/env/- + value: + name: YLLM_SERVED_MODEL_NAME + value: "claude-3-haiku-20240307" + - op: add + path: /spec/template/metadata/labels/model-alias + value: "claude-3-haiku" + + # Add model-alias label to service + - target: + kind: Service + name: llm-katan + patch: |- + - op: add + path: /metadata/labels/model-alias + value: "claude-3-haiku" + + # Update PVC reference in deployment to match suffixed PVC name + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: replace + path: /spec/template/spec/volumes/0/persistentVolumeClaim/claimName + value: llm-katan-models-claude diff --git a/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml new file mode 100644 index 000000000..3f714d60b --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml @@ -0,0 +1,41 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: + - ../../base + +components: + - ../../components/common + +nameSuffix: -gpt35 + +patches: + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: add + path: /spec/template/spec/containers/0/env/- + value: + name: YLLM_SERVED_MODEL_NAME + value: "gpt-3.5-turbo" + - op: add + path: /spec/template/metadata/labels/model-alias + value: "gpt-3.5-turbo" + + - target: + kind: Service + name: llm-katan + patch: |- + - op: add + path: /metadata/labels/model-alias + value: "gpt-3.5-turbo" + + # Update PVC reference in deployment to match suffixed PVC name + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: replace + path: /spec/template/spec/volumes/0/persistentVolumeClaim/claimName + value: llm-katan-models-gpt35 diff --git a/e2e-tests/llm-katan/llm_katan/config.py b/e2e-tests/llm-katan/llm_katan/config.py index 1f91d8ac9..138ed9d92 100644 --- a/e2e-tests/llm-katan/llm_katan/config.py +++ b/e2e-tests/llm-katan/llm_katan/config.py @@ -31,15 +31,14 @@ def __post_init__(self): # Apply environment variable overrides self.model_name = os.getenv("YLLM_MODEL", self.model_name) + self.served_model_name = os.getenv("YLLM_SERVED_MODEL_NAME", self.served_model_name) self.port = int(os.getenv("YLLM_PORT", str(self.port))) self.backend = os.getenv("YLLM_BACKEND", self.backend) self.host = os.getenv("YLLM_HOST", self.host) # Validate backend if self.backend not in ["transformers", "vllm"]: - raise ValueError( - f"Invalid backend: {self.backend}. Must be 'transformers' or 'vllm'" - ) + raise ValueError(f"Invalid backend: {self.backend}. Must be 'transformers' or 'vllm'") @property def device_auto(self) -> str: