Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 56 additions & 16 deletions deploy/kubernetes/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
# Semantic Router Kubernetes Deployment

This directory contains Kubernetes manifests for deploying the Semantic Router using Kustomize.
This directory contains Kubernetes manifests for deploying the Semantic Router using Kustomize. It provides two modes similar to docker-compose profiles:

- core: only the semantic-router (no llm-katan)
- llm-katan: semantic-router plus an llm-katan sidecar listening on 8002 (served model name `qwen3`)

## Architecture

The deployment consists of:

- **ConfigMap**: Contains `config.yaml` and `tools_db.json` configuration files
- **PersistentVolumeClaim**: 10Gi storage for model files
- **PersistentVolumeClaim**: 30Gi storage for model files (adjust based on models you enable)
- **Deployment**:
- **Init Container**: Downloads/copies model files to persistent volume
- **Main Container**: Runs the semantic router service
Expand All @@ -29,11 +32,11 @@ The deployment consists of:
kubectl apply -k deploy/kubernetes/

# Check deployment status
kubectl get pods -l app=semantic-router -n semantic-router
kubectl get services -l app=semantic-router -n semantic-router
kubectl get pods -l app=semantic-router -n vllm-semantic-router-system
kubectl get services -l app=semantic-router -n vllm-semantic-router-system

# View logs
kubectl logs -l app=semantic-router -n semantic-router -f
kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f
```

### Kind (Kubernetes in Docker) Deployment
Expand Down Expand Up @@ -86,20 +89,20 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s
kubectl apply -k deploy/kubernetes/

# Wait for deployment to be ready
kubectl wait --for=condition=Available deployment/semantic-router -n semantic-router --timeout=600s
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
```

**Step 3: Check deployment status**

```bash
# Check pods
kubectl get pods -n semantic-router -o wide
kubectl get pods -n vllm-semantic-router-system -o wide

# Check services
kubectl get services -n semantic-router
kubectl get services -n vllm-semantic-router-system

# View logs
kubectl logs -l app=semantic-router -n semantic-router -f
kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f
```

#### Resource Requirements for Kind
Expand Down Expand Up @@ -137,13 +140,13 @@ Or using kubectl directly:

```bash
# Access Classification API (HTTP REST)
kubectl port-forward -n semantic-router svc/semantic-router 8080:8080
kubectl port-forward -n vllm-semantic-router-system svc/semantic-router 8080:8080

# Access gRPC API
kubectl port-forward -n semantic-router svc/semantic-router 50051:50051
kubectl port-forward -n vllm-semantic-router-system svc/semantic-router 50051:50051

# Access metrics
kubectl port-forward -n semantic-router svc/semantic-router-metrics 9190:9190
kubectl port-forward -n vllm-semantic-router-system svc/semantic-router-metrics 9190:9190
```

#### Testing the Deployment
Expand Down Expand Up @@ -195,6 +198,11 @@ kubectl delete -k deploy/kubernetes/
kind delete cluster --name semantic-router-cluster
```

## Notes on dependencies

- Gateway API Inference Extension CRDs are required only when using the Envoy AI Gateway integration in `deploy/kubernetes/ai-gateway/`. Follow the installation steps in `website/docs/installation/kubernetes.md` if you plan to use the gateway path.
- The core kustomize deployment in this folder does not install Envoy Gateway or AI Gateway; those are optional components documented separately.

## Make Commands Reference

The project provides comprehensive make targets for managing kind clusters and deployments:
Expand Down Expand Up @@ -293,6 +301,11 @@ kubectl top pods -n semantic-router
# Adjust resource limits in deployment.yaml if needed
```

### Storage sizing

- The default PVC is 30Gi. If the enabled models are small, you can reduce it; otherwise reserve at least 2–3x the total model size.
- If your cluster's default StorageClass isn't named `standard`, change `storageClassName` in `pvc.yaml` accordingly or remove the field to use the default class.

### Resource Optimization

For different environments, you can adjust resource requirements:
Expand All @@ -307,16 +320,43 @@ Edit the `resources` section in `deployment.yaml` accordingly.

### Kubernetes Manifests (`deploy/kubernetes/`)

- `deployment.yaml` - Main application deployment with optimized resource settings
- `service.yaml` - Services for gRPC, HTTP API, and metrics
- `base/` - Shared resources (Namespace, PVC, Service, ConfigMap)
- `overlays/core/` - Core deployment (no llm-katan)
- `overlays/llm-katan/` - Deployment with llm-katan sidecar
- `deployment.yaml` - Plain deployment (used by core overlay)
- `deployment.katan.yaml` - Sidecar deployment (used by llm-katan overlay)
- `service.yaml` - gRPC, HTTP API, and metrics services
- `pvc.yaml` - Persistent volume claim for model storage
- `namespace.yaml` - Dedicated namespace for the application
- `config.yaml` - Application configuration
- `config.yaml` - Application configuration (defaults to qwen3 @ 127.0.0.1:8002)
- `tools_db.json` - Tools database for semantic routing
- `kustomization.yaml` - Kustomize configuration for easy deployment
- `kustomization.yaml` - Root entry (defaults to core overlay)

### Development Tools

## Choose a mode: core or llm-katan

- Core mode (default root points here):

```bash
kubectl apply -k deploy/kubernetes
# or explicitly
kubectl apply -k deploy/kubernetes/overlays/core
```

- llm-katan mode:

```bash
kubectl apply -k deploy/kubernetes/overlays/llm-katan
```

Notes for llm-katan:

Notes for llm-katan:

- The init container will attempt to download `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B` and the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`. In restricted networks, these downloads may fail—pre-populate the PV or point the init script to your internal artifact store as needed.
- The default Kubernetes `config.yaml` has been aligned to use `qwen3` and endpoint `127.0.0.1:8002`.

- `tools/kind/kind-config.yaml` - Kind cluster configuration for local development
- `tools/make/kube.mk` - Make targets for Kubernetes operations
- `Makefile` - Root makefile including all make targets
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ spec:
- number: 50051
selector:
matchLabels:
app: vllm-semantic-router
app: semantic-router
endpointPickerRef:
name: semantic-router
port:
Expand Down
19 changes: 19 additions & 0 deletions deploy/kubernetes/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- ../namespace.yaml
- ../pvc.yaml
- ../service.yaml

configMapGenerator:
- name: semantic-router-config
files:
- ../config.yaml
- ../tools_db.json

namespace: vllm-semantic-router-system

images:
- name: ghcr.io/vllm-project/semantic-router/extproc
newTag: latest
57 changes: 29 additions & 28 deletions deploy/kubernetes/config.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
bert_model:
model_id: sentence-transformers/all-MiniLM-L12-v2
model_id: models/all-MiniLM-L12-v2
threshold: 0.6
use_cpu: true

semantic_cache:
enabled: true
backend_type: "memory" # Options: "memory" or "milvus"
backend_type: "memory" # Options: "memory" or "milvus"
similarity_threshold: 0.8
max_entries: 1000 # Only applies to memory backend
max_entries: 1000 # Only applies to memory backend
ttl_seconds: 3600
eviction_policy: "fifo"
eviction_policy: "fifo"

tools:
enabled: true
Expand All @@ -32,13 +32,13 @@ prompt_guard:
# NOT supported: domain names (example.com), protocol prefixes (http://), paths (/api), ports in address (use 'port' field)
vllm_endpoints:
- name: "endpoint1"
address: "127.0.0.1" # IPv4 address - REQUIRED format
port: 8000
address: "127.0.0.1" # llm-katan sidecar or local backend
port: 8002
weight: 1

model_config:
"openai/gpt-oss-20b":
reasoning_family: "gpt-oss" # This model uses GPT-OSS reasoning syntax
"qwen3":
reasoning_family: "qwen3" # Match docker-compose default model name
preferred_endpoints: ["endpoint1"]
pii_policy:
allow_by_default: true
Expand All @@ -62,76 +62,76 @@ classifier:
categories:
- name: business
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.7
use_reasoning: false # Business performs better without reasoning
use_reasoning: false # Business performs better without reasoning
- name: law
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.4
use_reasoning: false
- name: psychology
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.6
use_reasoning: false
- name: biology
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.9
use_reasoning: false
- name: chemistry
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.6
use_reasoning: true # Enable reasoning for complex chemistry
use_reasoning: true # Enable reasoning for complex chemistry
- name: history
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.7
use_reasoning: false
- name: other
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.7
use_reasoning: false
- name: health
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.5
use_reasoning: false
- name: economics
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 1.0
use_reasoning: false
- name: math
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 1.0
use_reasoning: true # Enable reasoning for complex math
use_reasoning: true # Enable reasoning for complex math
- name: physics
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.7
use_reasoning: true # Enable reasoning for physics
use_reasoning: true # Enable reasoning for physics
- name: computer science
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.6
use_reasoning: false
- name: philosophy
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.5
use_reasoning: false
- name: engineering
model_scores:
- model: openai/gpt-oss-20b
- model: qwen3
score: 0.7
use_reasoning: false

default_model: openai/gpt-oss-20b
default_model: qwen3

# Reasoning family configurations
reasoning_families:
Expand Down Expand Up @@ -164,5 +164,6 @@ api:
detailed_goroutine_tracking: true
high_resolution_timing: false
sample_rate: 1.0
duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
duration_buckets:
[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
size_buckets: [1, 2, 5, 10, 20, 50, 100, 200]
Loading
Loading