diff --git a/deploy/kubernetes/README.md b/deploy/kubernetes/README.md index 175763cd..ab225a43 100644 --- a/deploy/kubernetes/README.md +++ b/deploy/kubernetes/README.md @@ -1,13 +1,16 @@ # Semantic Router Kubernetes Deployment -This directory contains Kubernetes manifests for deploying the Semantic Router using Kustomize. +This directory contains Kubernetes manifests for deploying the Semantic Router using Kustomize. It provides two modes similar to docker-compose profiles: + +- core: only the semantic-router (no llm-katan) +- llm-katan: semantic-router plus an llm-katan sidecar listening on 8002 (served model name `qwen3`) ## Architecture The deployment consists of: - **ConfigMap**: Contains `config.yaml` and `tools_db.json` configuration files -- **PersistentVolumeClaim**: 10Gi storage for model files +- **PersistentVolumeClaim**: 30Gi storage for model files (adjust based on models you enable) - **Deployment**: - **Init Container**: Downloads/copies model files to persistent volume - **Main Container**: Runs the semantic router service @@ -29,11 +32,11 @@ The deployment consists of: kubectl apply -k deploy/kubernetes/ # Check deployment status -kubectl get pods -l app=semantic-router -n semantic-router -kubectl get services -l app=semantic-router -n semantic-router +kubectl get pods -l app=semantic-router -n vllm-semantic-router-system +kubectl get services -l app=semantic-router -n vllm-semantic-router-system # View logs -kubectl logs -l app=semantic-router -n semantic-router -f +kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f ``` ### Kind (Kubernetes in Docker) Deployment @@ -86,20 +89,20 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s kubectl apply -k deploy/kubernetes/ # Wait for deployment to be ready -kubectl wait --for=condition=Available deployment/semantic-router -n semantic-router --timeout=600s +kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s ``` **Step 3: Check deployment status** ```bash # Check pods -kubectl get pods -n semantic-router -o wide +kubectl get pods -n vllm-semantic-router-system -o wide # Check services -kubectl get services -n semantic-router +kubectl get services -n vllm-semantic-router-system # View logs -kubectl logs -l app=semantic-router -n semantic-router -f +kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f ``` #### Resource Requirements for Kind @@ -137,13 +140,13 @@ Or using kubectl directly: ```bash # Access Classification API (HTTP REST) -kubectl port-forward -n semantic-router svc/semantic-router 8080:8080 +kubectl port-forward -n vllm-semantic-router-system svc/semantic-router 8080:8080 # Access gRPC API -kubectl port-forward -n semantic-router svc/semantic-router 50051:50051 +kubectl port-forward -n vllm-semantic-router-system svc/semantic-router 50051:50051 # Access metrics -kubectl port-forward -n semantic-router svc/semantic-router-metrics 9190:9190 +kubectl port-forward -n vllm-semantic-router-system svc/semantic-router-metrics 9190:9190 ``` #### Testing the Deployment @@ -195,6 +198,11 @@ kubectl delete -k deploy/kubernetes/ kind delete cluster --name semantic-router-cluster ``` +## Notes on dependencies + +- Gateway API Inference Extension CRDs are required only when using the Envoy AI Gateway integration in `deploy/kubernetes/ai-gateway/`. Follow the installation steps in `website/docs/installation/kubernetes.md` if you plan to use the gateway path. +- The core kustomize deployment in this folder does not install Envoy Gateway or AI Gateway; those are optional components documented separately. + ## Make Commands Reference The project provides comprehensive make targets for managing kind clusters and deployments: @@ -293,6 +301,11 @@ kubectl top pods -n semantic-router # Adjust resource limits in deployment.yaml if needed ``` +### Storage sizing + +- The default PVC is 30Gi. If the enabled models are small, you can reduce it; otherwise reserve at least 2–3x the total model size. +- If your cluster's default StorageClass isn't named `standard`, change `storageClassName` in `pvc.yaml` accordingly or remove the field to use the default class. + ### Resource Optimization For different environments, you can adjust resource requirements: @@ -307,16 +320,43 @@ Edit the `resources` section in `deployment.yaml` accordingly. ### Kubernetes Manifests (`deploy/kubernetes/`) -- `deployment.yaml` - Main application deployment with optimized resource settings -- `service.yaml` - Services for gRPC, HTTP API, and metrics +- `base/` - Shared resources (Namespace, PVC, Service, ConfigMap) +- `overlays/core/` - Core deployment (no llm-katan) +- `overlays/llm-katan/` - Deployment with llm-katan sidecar +- `deployment.yaml` - Plain deployment (used by core overlay) +- `deployment.katan.yaml` - Sidecar deployment (used by llm-katan overlay) +- `service.yaml` - gRPC, HTTP API, and metrics services - `pvc.yaml` - Persistent volume claim for model storage - `namespace.yaml` - Dedicated namespace for the application -- `config.yaml` - Application configuration +- `config.yaml` - Application configuration (defaults to qwen3 @ 127.0.0.1:8002) - `tools_db.json` - Tools database for semantic routing -- `kustomization.yaml` - Kustomize configuration for easy deployment +- `kustomization.yaml` - Root entry (defaults to core overlay) ### Development Tools +## Choose a mode: core or llm-katan + +- Core mode (default root points here): + + ```bash + kubectl apply -k deploy/kubernetes + # or explicitly + kubectl apply -k deploy/kubernetes/overlays/core + ``` + +- llm-katan mode: + + ```bash + kubectl apply -k deploy/kubernetes/overlays/llm-katan + ``` + +Notes for llm-katan: + +Notes for llm-katan: + +- The init container will attempt to download `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B` and the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`. In restricted networks, these downloads may fail—pre-populate the PV or point the init script to your internal artifact store as needed. +- The default Kubernetes `config.yaml` has been aligned to use `qwen3` and endpoint `127.0.0.1:8002`. + - `tools/kind/kind-config.yaml` - Kind cluster configuration for local development - `tools/make/kube.mk` - Make targets for Kubernetes operations - `Makefile` - Root makefile including all make targets diff --git a/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml b/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml index 64afc6f9..7b52e07b 100644 --- a/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml +++ b/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml @@ -11,7 +11,7 @@ spec: - number: 50051 selector: matchLabels: - app: vllm-semantic-router + app: semantic-router endpointPickerRef: name: semantic-router port: diff --git a/deploy/kubernetes/base/kustomization.yaml b/deploy/kubernetes/base/kustomization.yaml new file mode 100644 index 00000000..90192015 --- /dev/null +++ b/deploy/kubernetes/base/kustomization.yaml @@ -0,0 +1,19 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: + - ../namespace.yaml + - ../pvc.yaml + - ../service.yaml + +configMapGenerator: + - name: semantic-router-config + files: + - ../config.yaml + - ../tools_db.json + +namespace: vllm-semantic-router-system + +images: + - name: ghcr.io/vllm-project/semantic-router/extproc + newTag: latest diff --git a/deploy/kubernetes/config.yaml b/deploy/kubernetes/config.yaml index 5bc40cbb..77760013 100644 --- a/deploy/kubernetes/config.yaml +++ b/deploy/kubernetes/config.yaml @@ -1,15 +1,15 @@ bert_model: - model_id: sentence-transformers/all-MiniLM-L12-v2 + model_id: models/all-MiniLM-L12-v2 threshold: 0.6 use_cpu: true semantic_cache: enabled: true - backend_type: "memory" # Options: "memory" or "milvus" + backend_type: "memory" # Options: "memory" or "milvus" similarity_threshold: 0.8 - max_entries: 1000 # Only applies to memory backend + max_entries: 1000 # Only applies to memory backend ttl_seconds: 3600 - eviction_policy: "fifo" + eviction_policy: "fifo" tools: enabled: true @@ -32,13 +32,13 @@ prompt_guard: # NOT supported: domain names (example.com), protocol prefixes (http://), paths (/api), ports in address (use 'port' field) vllm_endpoints: - name: "endpoint1" - address: "127.0.0.1" # IPv4 address - REQUIRED format - port: 8000 + address: "127.0.0.1" # llm-katan sidecar or local backend + port: 8002 weight: 1 model_config: - "openai/gpt-oss-20b": - reasoning_family: "gpt-oss" # This model uses GPT-OSS reasoning syntax + "qwen3": + reasoning_family: "qwen3" # Match docker-compose default model name preferred_endpoints: ["endpoint1"] pii_policy: allow_by_default: true @@ -62,76 +62,76 @@ classifier: categories: - name: business model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.7 - use_reasoning: false # Business performs better without reasoning + use_reasoning: false # Business performs better without reasoning - name: law model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.4 use_reasoning: false - name: psychology model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.6 use_reasoning: false - name: biology model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.9 use_reasoning: false - name: chemistry model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.6 - use_reasoning: true # Enable reasoning for complex chemistry + use_reasoning: true # Enable reasoning for complex chemistry - name: history model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.7 use_reasoning: false - name: other model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.7 use_reasoning: false - name: health model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.5 use_reasoning: false - name: economics model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 1.0 use_reasoning: false - name: math model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 1.0 - use_reasoning: true # Enable reasoning for complex math + use_reasoning: true # Enable reasoning for complex math - name: physics model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.7 - use_reasoning: true # Enable reasoning for physics + use_reasoning: true # Enable reasoning for physics - name: computer science model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.6 use_reasoning: false - name: philosophy model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.5 use_reasoning: false - name: engineering model_scores: - - model: openai/gpt-oss-20b + - model: qwen3 score: 0.7 use_reasoning: false -default_model: openai/gpt-oss-20b +default_model: qwen3 # Reasoning family configurations reasoning_families: @@ -164,5 +164,6 @@ api: detailed_goroutine_tracking: true high_resolution_timing: false sample_rate: 1.0 - duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30] + duration_buckets: + [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30] size_buckets: [1, 2, 5, 10, 20, 50, 100, 200] diff --git a/deploy/kubernetes/deployment.katan.yaml b/deploy/kubernetes/deployment.katan.yaml new file mode 100644 index 00000000..3aa74d5b --- /dev/null +++ b/deploy/kubernetes/deployment.katan.yaml @@ -0,0 +1,181 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: semantic-router + namespace: vllm-semantic-router-system + labels: + app: semantic-router +spec: + replicas: 1 + selector: + matchLabels: + app: semantic-router + template: + metadata: + labels: + app: semantic-router + spec: + initContainers: + - name: model-downloader + image: python:3.11-slim + securityContext: + runAsNonRoot: false + allowPrivilegeEscalation: false + command: ["/bin/bash", "-c"] + args: + - | + set -e + echo "Installing Hugging Face CLI..." + pip install --no-cache-dir huggingface_hub[cli] + + echo "Downloading classifier models to persistent volume..." + cd /app/models + + # Download category classifier model + if [ ! -d "category_classifier_modernbert-base_model" ]; then + echo "Downloading category classifier model..." + huggingface-cli download LLM-Semantic-Router/category_classifier_modernbert-base_model --local-dir category_classifier_modernbert-base_model + else + echo "Category classifier model already exists, skipping..." + fi + + # Download PII classifier model + if [ ! -d "pii_classifier_modernbert-base_model" ]; then + echo "Downloading PII classifier model..." + huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_model --local-dir pii_classifier_modernbert-base_model + else + echo "PII classifier model already exists, skipping..." + fi + + # Download jailbreak classifier model + if [ ! -d "jailbreak_classifier_modernbert-base_model" ]; then + echo "Downloading jailbreak classifier model..." + huggingface-cli download LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model --local-dir jailbreak_classifier_modernbert-base_model + else + echo "Jailbreak classifier model already exists, skipping..." + fi + + # Download PII token classifier model + if [ ! -d "pii_classifier_modernbert-base_presidio_token_model" ]; then + echo "Downloading PII token classifier model..." + huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model --local-dir pii_classifier_modernbert-base_presidio_token_model + else + echo "PII token classifier model already exists, skipping..." + fi + + # Download embedding model all-MiniLM-L12-v2 + if [ ! -d "all-MiniLM-L12-v2" ]; then + echo "Downloading all-MiniLM-L12-v2 embedding model..." + huggingface-cli download sentence-transformers/all-MiniLM-L12-v2 --local-dir all-MiniLM-L12-v2 + else + echo "all-MiniLM-L12-v2 already exists, skipping..." + fi + + # Optional: Prepare Qwen model directory for llm-katan sidecar + # NOTE: Provide the model content under /app/models/Qwen/Qwen3-0.6B via pre-populated PV + # or customize the following block to fetch from your internal artifact store. + if [ ! -d "Qwen/Qwen3-0.6B" ]; then + echo "Downloading Qwen/Qwen3-0.6B for llm-katan..." + mkdir -p Qwen + huggingface-cli download Qwen/Qwen3-0.6B --local-dir Qwen/Qwen3-0.6B || echo "Warning: Qwen3-0.6B download failed; ensure offline pre-population if needed." + else + echo "Qwen/Qwen3-0.6B already exists, skipping..." + fi + + echo "Model directory listing:" && ls -la /app/models/ + env: + - name: HF_HUB_CACHE + value: /tmp/hf_cache + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + volumeMounts: + - name: models-volume + mountPath: /app/models + containers: + - name: semantic-router + image: ghcr.io/vllm-project/semantic-router/extproc:latest + args: ["--secure=true"] + securityContext: + runAsNonRoot: false + allowPrivilegeEscalation: false + ports: + - containerPort: 50051 + name: grpc + protocol: TCP + - containerPort: 9190 + name: metrics + protocol: TCP + - containerPort: 8080 + name: classify-api + protocol: TCP + env: + - name: LD_LIBRARY_PATH + value: "/app/lib" + volumeMounts: + - name: config-volume + mountPath: /app/config + readOnly: true + - name: models-volume + mountPath: /app/models + livenessProbe: + tcpSocket: + port: 50051 + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + readinessProbe: + tcpSocket: + port: 50051 + initialDelaySeconds: 90 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + resources: + requests: + memory: "3Gi" + cpu: "1" + limits: + memory: "6Gi" + cpu: "2" + - name: llm-katan + image: ghcr.io/vllm-project/semantic-router/llm-katan:latest + imagePullPolicy: IfNotPresent + args: + [ + "llm-katan", + "--model", + "/app/models/Qwen/Qwen3-0.6B", + "--served-model-name", + "qwen3", + "--host", + "0.0.0.0", + "--port", + "8002", + ] + ports: + - containerPort: 8002 + name: katan + protocol: TCP + volumeMounts: + - name: models-volume + mountPath: /app/models + resources: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1" + volumes: + - name: config-volume + configMap: + name: semantic-router-config + - name: models-volume + persistentVolumeClaim: + claimName: semantic-router-models diff --git a/deploy/kubernetes/deployment.yaml b/deploy/kubernetes/deployment.yaml index ab7000f9..560b9850 100644 --- a/deploy/kubernetes/deployment.yaml +++ b/deploy/kubernetes/deployment.yaml @@ -16,121 +16,130 @@ spec: app: semantic-router spec: initContainers: - - name: model-downloader - image: python:3.11-slim - securityContext: - runAsNonRoot: false - allowPrivilegeEscalation: false - command: ["/bin/bash", "-c"] - args: - - | - set -e - echo "Installing Hugging Face CLI..." - pip install --no-cache-dir huggingface_hub[cli] + - name: model-downloader + image: python:3.11-slim + securityContext: + runAsNonRoot: false + allowPrivilegeEscalation: false + command: ["/bin/bash", "-c"] + args: + - | + set -e + echo "Installing Hugging Face CLI..." + pip install --no-cache-dir huggingface_hub[cli] - echo "Downloading models to persistent volume..." - cd /app/models + echo "Downloading models to persistent volume..." + cd /app/models - # Download category classifier model - if [ ! -d "category_classifier_modernbert-base_model" ]; then - echo "Downloading category classifier model..." - huggingface-cli download LLM-Semantic-Router/category_classifier_modernbert-base_model --local-dir category_classifier_modernbert-base_model - else - echo "Category classifier model already exists, skipping..." - fi + # Download category classifier model + if [ ! -d "category_classifier_modernbert-base_model" ]; then + echo "Downloading category classifier model..." + huggingface-cli download LLM-Semantic-Router/category_classifier_modernbert-base_model --local-dir category_classifier_modernbert-base_model + else + echo "Category classifier model already exists, skipping..." + fi - # Download PII classifier model - if [ ! -d "pii_classifier_modernbert-base_model" ]; then - echo "Downloading PII classifier model..." - huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_model --local-dir pii_classifier_modernbert-base_model - else - echo "PII classifier model already exists, skipping..." - fi + # Download PII classifier model + if [ ! -d "pii_classifier_modernbert-base_model" ]; then + echo "Downloading PII classifier model..." + huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_model --local-dir pii_classifier_modernbert-base_model + else + echo "PII classifier model already exists, skipping..." + fi - # Download jailbreak classifier model - if [ ! -d "jailbreak_classifier_modernbert-base_model" ]; then - echo "Downloading jailbreak classifier model..." - huggingface-cli download LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model --local-dir jailbreak_classifier_modernbert-base_model - else - echo "Jailbreak classifier model already exists, skipping..." - fi + # Download jailbreak classifier model + if [ ! -d "jailbreak_classifier_modernbert-base_model" ]; then + echo "Downloading jailbreak classifier model..." + huggingface-cli download LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model --local-dir jailbreak_classifier_modernbert-base_model + else + echo "Jailbreak classifier model already exists, skipping..." + fi - # Download PII token classifier model - if [ ! -d "pii_classifier_modernbert-base_presidio_token_model" ]; then - echo "Downloading PII token classifier model..." - huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model --local-dir pii_classifier_modernbert-base_presidio_token_model - else - echo "PII token classifier model already exists, skipping..." - fi + # Download PII token classifier model + if [ ! -d "pii_classifier_modernbert-base_presidio_token_model" ]; then + echo "Downloading PII token classifier model..." + huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model --local-dir pii_classifier_modernbert-base_presidio_token_model + else + echo "PII token classifier model already exists, skipping..." + fi - echo "All models downloaded successfully!" - ls -la /app/models/ - env: - - name: HF_HUB_CACHE - value: /tmp/hf_cache - # Reduced resource requirements for init container - resources: - requests: - memory: "512Mi" - cpu: "250m" - limits: - memory: "1Gi" - cpu: "500m" - volumeMounts: - - name: models-volume - mountPath: /app/models + # Download embedding model all-MiniLM-L12-v2 + if [ ! -d "all-MiniLM-L12-v2" ]; then + echo "Downloading all-MiniLM-L12-v2 embedding model..." + huggingface-cli download sentence-transformers/all-MiniLM-L12-v2 --local-dir all-MiniLM-L12-v2 + else + echo "all-MiniLM-L12-v2 already exists, skipping..." + fi + + + echo "Model setup complete." + ls -la /app/models/ + env: + - name: HF_HUB_CACHE + value: /tmp/hf_cache + # Reduced resource requirements for init container + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + volumeMounts: + - name: models-volume + mountPath: /app/models containers: - - name: semantic-router - image: ghcr.io/vllm-project/semantic-router/extproc:latest - args: ["--secure=true"] - securityContext: - runAsNonRoot: false - allowPrivilegeEscalation: false - ports: - - containerPort: 50051 - name: grpc - protocol: TCP - - containerPort: 9190 - name: metrics - protocol: TCP - - containerPort: 8080 - name: classify-api - protocol: TCP - env: - - name: LD_LIBRARY_PATH - value: "/app/lib" - volumeMounts: + - name: semantic-router + image: ghcr.io/vllm-project/semantic-router/extproc:latest + args: ["--secure=true"] + securityContext: + runAsNonRoot: false + allowPrivilegeEscalation: false + ports: + - containerPort: 50051 + name: grpc + protocol: TCP + - containerPort: 9190 + name: metrics + protocol: TCP + - containerPort: 8080 + name: classify-api + protocol: TCP + env: + - name: LD_LIBRARY_PATH + value: "/app/lib" + volumeMounts: + - name: config-volume + mountPath: /app/config + readOnly: true + - name: models-volume + mountPath: /app/models + livenessProbe: + tcpSocket: + port: 50051 + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + readinessProbe: + tcpSocket: + port: 50051 + initialDelaySeconds: 90 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + # Significantly reduced resource requirements for kind cluster + resources: + requests: + memory: "3Gi" # Reduced from 8Gi + cpu: "1" # Reduced from 2 + limits: + memory: "6Gi" # Reduced from 12Gi + cpu: "2" # Reduced from 4 + volumes: - name: config-volume - mountPath: /app/config - readOnly: true + configMap: + name: semantic-router-config - name: models-volume - mountPath: /app/models - livenessProbe: - tcpSocket: - port: 50051 - initialDelaySeconds: 60 - periodSeconds: 30 - timeoutSeconds: 10 - failureThreshold: 3 - readinessProbe: - tcpSocket: - port: 50051 - initialDelaySeconds: 90 - periodSeconds: 30 - timeoutSeconds: 10 - failureThreshold: 3 - # Significantly reduced resource requirements for kind cluster - resources: - requests: - memory: "3Gi" # Reduced from 8Gi - cpu: "1" # Reduced from 2 - limits: - memory: "6Gi" # Reduced from 12Gi - cpu: "2" # Reduced from 4 - volumes: - - name: config-volume - configMap: - name: semantic-router-config - - name: models-volume - persistentVolumeClaim: - claimName: semantic-router-models + persistentVolumeClaim: + claimName: semantic-router-models diff --git a/deploy/kubernetes/kustomization.yaml b/deploy/kubernetes/kustomization.yaml index 3eae4ac9..65b2ccae 100644 --- a/deploy/kubernetes/kustomization.yaml +++ b/deploy/kubernetes/kustomization.yaml @@ -1,25 +1,6 @@ apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization -metadata: - name: semantic-router - +# This root points to the 'core' overlay by default for clarity. resources: -- namespace.yaml -- pvc.yaml -- deployment.yaml -- service.yaml - -# Generate ConfigMap -configMapGenerator: -- name: semantic-router-config - files: - - config.yaml - - tools_db.json - -# Namespace for all resources -namespace: vllm-semantic-router-system - -images: -- name: ghcr.io/vllm-project/semantic-router/extproc - newTag: latest + - overlays/core diff --git a/deploy/kubernetes/overlays/core/kustomization.yaml b/deploy/kubernetes/overlays/core/kustomization.yaml new file mode 100644 index 00000000..59d6cf23 --- /dev/null +++ b/deploy/kubernetes/overlays/core/kustomization.yaml @@ -0,0 +1,6 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: + - ../../base + - ../../deployment.yaml diff --git a/deploy/kubernetes/overlays/llm-katan/kustomization.yaml b/deploy/kubernetes/overlays/llm-katan/kustomization.yaml new file mode 100644 index 00000000..a20ca370 --- /dev/null +++ b/deploy/kubernetes/overlays/llm-katan/kustomization.yaml @@ -0,0 +1,6 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: + - ../../base + - ../../deployment.katan.yaml diff --git a/deploy/kubernetes/pvc.yaml b/deploy/kubernetes/pvc.yaml index 08929306..43b66eb9 100644 --- a/deploy/kubernetes/pvc.yaml +++ b/deploy/kubernetes/pvc.yaml @@ -9,5 +9,5 @@ spec: - ReadWriteOnce resources: requests: - storage: 10Gi + storage: 30Gi storageClassName: standard diff --git a/website/docs/installation/kubernetes.md b/website/docs/installation/kubernetes.md index abad76f2..7024ab7f 100644 --- a/website/docs/installation/kubernetes.md +++ b/website/docs/installation/kubernetes.md @@ -35,12 +35,28 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s ## Step 2: Deploy vLLM Semantic Router -Configure the semantic router by editing `deploy/kubernetes/config.yaml`. This file contains the vLLM configuration, including model config, endpoints, and policies. +Configure the semantic router by editing `deploy/kubernetes/config.yaml`. This file contains the vLLM configuration, including model config, endpoints, and policies. The repository provides two Kustomize overlays similar to docker-compose profiles: -Deploy the semantic router service with all required components: +- core (default): only the semantic-router + - Path: `deploy/kubernetes/overlays/core` (root `deploy/kubernetes/` points here by default) +- llm-katan: semantic-router + an llm-katan sidecar listening on 8002 and serving model name `qwen3` + - Path: `deploy/kubernetes/overlays/llm-katan` -```bash -# Deploy semantic router using Kustomize +Important notes before you apply manifests: + +- `vllm_endpoints.address` must be an IP address (not hostname) reachable from inside the cluster. If your LLM backends run as K8s Services, use the ClusterIP (for example `10.96.0.10`) and set `port` accordingly. Do not include protocol or path. +- The PVC in `deploy/kubernetes/pvc.yaml` uses `storageClassName: standard`. On some clouds or local clusters, the default StorageClass name may differ (e.g., `standard-rwo`, `gp2`, or a provisioner like local-path). Adjust as needed. +- Default PVC size is 30Gi. Size it to at least 2–3x of your total model footprint to leave room for indexes and updates. +- The initContainer downloads several models from Hugging Face on first run and writes them into the PVC. Ensure outbound egress to Hugging Face is allowed and there is at least ~6–8 GiB free space for the models specified. +- Per mode, the init container downloads differ: + - core: classifiers + the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`. + - llm-katan: everything in core, plus `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B`. +- The default `config.yaml` points to `qwen3` at `127.0.0.1:8002`, which matches the llm-katan overlay. If you use core (no sidecar), either change `vllm_endpoints` to your actual backend Service IP:Port, or deploy the llm-katan overlay. + +Deploy the semantic router service with all required components (core mode by default): + +````bash +# Deploy semantic router (core mode) kubectl apply -k deploy/kubernetes/ # Wait for deployment to be ready (this may take several minutes for model downloads) @@ -48,7 +64,14 @@ kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semant # Verify deployment status kubectl get pods -n vllm-semantic-router-system -``` + +To run with the llm-katan overlay instead: + +```bash +kubectl apply -k deploy/kubernetes/overlays/llm-katan +```` + +```` ## Step 3: Install Envoy Gateway @@ -63,7 +86,7 @@ helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \ # Wait for Envoy Gateway to be ready kubectl wait --timeout=300s -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available -``` +```` ## Step 4: Install Envoy AI Gateway @@ -135,26 +158,28 @@ Expected output should show the inference pool in `Accepted` state: ```yaml status: parent: - - conditions: - - lastTransitionTime: "2025-09-27T09:27:32Z" - message: 'InferencePool has been Accepted by controller ai-gateway-controller: - InferencePool reconciled successfully' - observedGeneration: 1 - reason: Accepted - status: "True" - type: Accepted - - lastTransitionTime: "2025-09-27T09:27:32Z" - message: 'Reference resolution by controller ai-gateway-controller: All references - resolved successfully' - observedGeneration: 1 - reason: ResolvedRefs - status: "True" - type: ResolvedRefs - parentRef: - group: gateway.networking.k8s.io - kind: Gateway - name: vllm-semantic-router - namespace: vllm-semantic-router-system + - conditions: + - lastTransitionTime: "2025-09-27T09:27:32Z" + message: + "InferencePool has been Accepted by controller ai-gateway-controller: + InferencePool reconciled successfully" + observedGeneration: 1 + reason: Accepted + status: "True" + type: Accepted + - lastTransitionTime: "2025-09-27T09:27:32Z" + message: + "Reference resolution by controller ai-gateway-controller: All references + resolved successfully" + observedGeneration: 1 + reason: ResolvedRefs + status: "True" + type: ResolvedRefs + parentRef: + group: gateway.networking.k8s.io + kind: Gateway + name: vllm-semantic-router + namespace: vllm-semantic-router-system ``` ## Testing the Deployment