diff --git a/deploy/kubernetes/README.md b/deploy/kubernetes/README.md
index 175763cd..ab225a43 100644
--- a/deploy/kubernetes/README.md
+++ b/deploy/kubernetes/README.md
@@ -1,13 +1,16 @@
 # Semantic Router Kubernetes Deployment
 
-This directory contains Kubernetes manifests for deploying the Semantic Router using Kustomize.
+This directory contains Kubernetes manifests for deploying the Semantic Router using Kustomize. It provides two modes similar to docker-compose profiles:
+
+- core: only the semantic-router (no llm-katan)
+- llm-katan: semantic-router plus an llm-katan sidecar listening on 8002 (served model name `qwen3`)
 
 ## Architecture
 
 The deployment consists of:
 
 - **ConfigMap**: Contains `config.yaml` and `tools_db.json` configuration files
-- **PersistentVolumeClaim**: 10Gi storage for model files
+- **PersistentVolumeClaim**: 30Gi storage for model files (adjust based on models you enable)
 - **Deployment**:
   - **Init Container**: Downloads/copies model files to persistent volume
   - **Main Container**: Runs the semantic router service
@@ -29,11 +32,11 @@ The deployment consists of:
 kubectl apply -k deploy/kubernetes/
 
 # Check deployment status
-kubectl get pods -l app=semantic-router -n semantic-router
-kubectl get services -l app=semantic-router -n semantic-router
+kubectl get pods -l app=semantic-router -n vllm-semantic-router-system
+kubectl get services -l app=semantic-router -n vllm-semantic-router-system
 
 # View logs
-kubectl logs -l app=semantic-router -n semantic-router -f
+kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f
 ```
 
 ### Kind (Kubernetes in Docker) Deployment
@@ -86,20 +89,20 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s
 kubectl apply -k deploy/kubernetes/
 
 # Wait for deployment to be ready
-kubectl wait --for=condition=Available deployment/semantic-router -n semantic-router --timeout=600s
+kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
 ```
 
 **Step 3: Check deployment status**
 
 ```bash
 # Check pods
-kubectl get pods -n semantic-router -o wide
+kubectl get pods -n vllm-semantic-router-system -o wide
 
 # Check services
-kubectl get services -n semantic-router
+kubectl get services -n vllm-semantic-router-system
 
 # View logs
-kubectl logs -l app=semantic-router -n semantic-router -f
+kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f
 ```
 
 #### Resource Requirements for Kind
@@ -137,13 +140,13 @@ Or using kubectl directly:
 
 ```bash
 # Access Classification API (HTTP REST)
-kubectl port-forward -n semantic-router svc/semantic-router 8080:8080
+kubectl port-forward -n vllm-semantic-router-system svc/semantic-router 8080:8080
 
 # Access gRPC API
-kubectl port-forward -n semantic-router svc/semantic-router 50051:50051
+kubectl port-forward -n vllm-semantic-router-system svc/semantic-router 50051:50051
 
 # Access metrics
-kubectl port-forward -n semantic-router svc/semantic-router-metrics 9190:9190
+kubectl port-forward -n vllm-semantic-router-system svc/semantic-router-metrics 9190:9190
 ```
 
 #### Testing the Deployment
@@ -195,6 +198,11 @@ kubectl delete -k deploy/kubernetes/
 kind delete cluster --name semantic-router-cluster
 ```
 
+## Notes on dependencies
+
+- Gateway API Inference Extension CRDs are required only when using the Envoy AI Gateway integration in `deploy/kubernetes/ai-gateway/`. Follow the installation steps in `website/docs/installation/kubernetes.md` if you plan to use the gateway path.
+- The core kustomize deployment in this folder does not install Envoy Gateway or AI Gateway; those are optional components documented separately.
+
 ## Make Commands Reference
 
 The project provides comprehensive make targets for managing kind clusters and deployments:
@@ -293,6 +301,11 @@ kubectl top pods -n semantic-router
 # Adjust resource limits in deployment.yaml if needed
 ```
 
+### Storage sizing
+
+- The default PVC is 30Gi. If the enabled models are small, you can reduce it; otherwise reserve at least 2–3x the total model size.
+- If your cluster's default StorageClass isn't named `standard`, change `storageClassName` in `pvc.yaml` accordingly or remove the field to use the default class.
+
 ### Resource Optimization
 
 For different environments, you can adjust resource requirements:
@@ -307,16 +320,43 @@ Edit the `resources` section in `deployment.yaml` accordingly.
 
 ### Kubernetes Manifests (`deploy/kubernetes/`)
 
-- `deployment.yaml` - Main application deployment with optimized resource settings
-- `service.yaml` - Services for gRPC, HTTP API, and metrics
+- `base/` - Shared resources (Namespace, PVC, Service, ConfigMap)
+- `overlays/core/` - Core deployment (no llm-katan)
+- `overlays/llm-katan/` - Deployment with llm-katan sidecar
+- `deployment.yaml` - Plain deployment (used by core overlay)
+- `deployment.katan.yaml` - Sidecar deployment (used by llm-katan overlay)
+- `service.yaml` - gRPC, HTTP API, and metrics services
 - `pvc.yaml` - Persistent volume claim for model storage
 - `namespace.yaml` - Dedicated namespace for the application
-- `config.yaml` - Application configuration
+- `config.yaml` - Application configuration (defaults to qwen3 @ 127.0.0.1:8002)
 - `tools_db.json` - Tools database for semantic routing
-- `kustomization.yaml` - Kustomize configuration for easy deployment
+- `kustomization.yaml` - Root entry (defaults to core overlay)
 
 ### Development Tools
 
+## Choose a mode: core or llm-katan
+
+- Core mode (default root points here):
+
+  ```bash
+  kubectl apply -k deploy/kubernetes
+  # or explicitly
+  kubectl apply -k deploy/kubernetes/overlays/core
+  ```
+
+- llm-katan mode:
+
+  ```bash
+  kubectl apply -k deploy/kubernetes/overlays/llm-katan
+  ```
+
+Notes for llm-katan:
+
+Notes for llm-katan:
+
+- The init container will attempt to download `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B` and the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`. In restricted networks, these downloads may fail—pre-populate the PV or point the init script to your internal artifact store as needed.
+- The default Kubernetes `config.yaml` has been aligned to use `qwen3` and endpoint `127.0.0.1:8002`.
+
 - `tools/kind/kind-config.yaml` - Kind cluster configuration for local development
 - `tools/make/kube.mk` - Make targets for Kubernetes operations
 - `Makefile` - Root makefile including all make targets
diff --git a/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml b/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml
index 64afc6f9..7b52e07b 100644
--- a/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml
+++ b/deploy/kubernetes/ai-gateway/inference-pool/inference-pool.yaml
@@ -11,7 +11,7 @@ spec:
     - number: 50051
   selector:
     matchLabels:
-      app: vllm-semantic-router
+      app: semantic-router
   endpointPickerRef:
     name: semantic-router
     port:
diff --git a/deploy/kubernetes/base/kustomization.yaml b/deploy/kubernetes/base/kustomization.yaml
new file mode 100644
index 00000000..90192015
--- /dev/null
+++ b/deploy/kubernetes/base/kustomization.yaml
@@ -0,0 +1,19 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+resources:
+  - ../namespace.yaml
+  - ../pvc.yaml
+  - ../service.yaml
+
+configMapGenerator:
+  - name: semantic-router-config
+    files:
+      - ../config.yaml
+      - ../tools_db.json
+
+namespace: vllm-semantic-router-system
+
+images:
+  - name: ghcr.io/vllm-project/semantic-router/extproc
+    newTag: latest
diff --git a/deploy/kubernetes/config.yaml b/deploy/kubernetes/config.yaml
index 5bc40cbb..77760013 100644
--- a/deploy/kubernetes/config.yaml
+++ b/deploy/kubernetes/config.yaml
@@ -1,15 +1,15 @@
 bert_model:
-  model_id: sentence-transformers/all-MiniLM-L12-v2
+  model_id: models/all-MiniLM-L12-v2
   threshold: 0.6
   use_cpu: true
 
 semantic_cache:
   enabled: true
-  backend_type: "memory"  # Options: "memory" or "milvus"
+  backend_type: "memory" # Options: "memory" or "milvus"
   similarity_threshold: 0.8
-  max_entries: 1000  # Only applies to memory backend
+  max_entries: 1000 # Only applies to memory backend
   ttl_seconds: 3600
-  eviction_policy: "fifo"  
+  eviction_policy: "fifo"
 
 tools:
   enabled: true
@@ -32,13 +32,13 @@ prompt_guard:
 # NOT supported: domain names (example.com), protocol prefixes (http://), paths (/api), ports in address (use 'port' field)
 vllm_endpoints:
   - name: "endpoint1"
-    address: "127.0.0.1"  # IPv4 address - REQUIRED format
-    port: 8000
+    address: "127.0.0.1" # llm-katan sidecar or local backend
+    port: 8002
     weight: 1
 
 model_config:
-  "openai/gpt-oss-20b":
-    reasoning_family: "gpt-oss"  # This model uses GPT-OSS reasoning syntax
+  "qwen3":
+    reasoning_family: "qwen3" # Match docker-compose default model name
     preferred_endpoints: ["endpoint1"]
     pii_policy:
       allow_by_default: true
@@ -62,76 +62,76 @@ classifier:
 categories:
   - name: business
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.7
-        use_reasoning: false  # Business performs better without reasoning
+        use_reasoning: false # Business performs better without reasoning
   - name: law
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.4
         use_reasoning: false
   - name: psychology
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.6
         use_reasoning: false
   - name: biology
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.9
         use_reasoning: false
   - name: chemistry
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.6
-        use_reasoning: true  # Enable reasoning for complex chemistry
+        use_reasoning: true # Enable reasoning for complex chemistry
   - name: history
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.7
         use_reasoning: false
   - name: other
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.7
         use_reasoning: false
   - name: health
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.5
         use_reasoning: false
   - name: economics
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 1.0
         use_reasoning: false
   - name: math
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 1.0
-        use_reasoning: true  # Enable reasoning for complex math
+        use_reasoning: true # Enable reasoning for complex math
   - name: physics
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.7
-        use_reasoning: true  # Enable reasoning for physics
+        use_reasoning: true # Enable reasoning for physics
   - name: computer science
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.6
         use_reasoning: false
   - name: philosophy
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.5
         use_reasoning: false
   - name: engineering
     model_scores:
-      - model: openai/gpt-oss-20b
+      - model: qwen3
         score: 0.7
         use_reasoning: false
 
-default_model: openai/gpt-oss-20b
+default_model: qwen3
 
 # Reasoning family configurations
 reasoning_families:
@@ -164,5 +164,6 @@ api:
       detailed_goroutine_tracking: true
       high_resolution_timing: false
       sample_rate: 1.0
-      duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
+      duration_buckets:
+        [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
       size_buckets: [1, 2, 5, 10, 20, 50, 100, 200]
diff --git a/deploy/kubernetes/deployment.katan.yaml b/deploy/kubernetes/deployment.katan.yaml
new file mode 100644
index 00000000..3aa74d5b
--- /dev/null
+++ b/deploy/kubernetes/deployment.katan.yaml
@@ -0,0 +1,181 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: semantic-router
+  namespace: vllm-semantic-router-system
+  labels:
+    app: semantic-router
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: semantic-router
+  template:
+    metadata:
+      labels:
+        app: semantic-router
+    spec:
+      initContainers:
+        - name: model-downloader
+          image: python:3.11-slim
+          securityContext:
+            runAsNonRoot: false
+            allowPrivilegeEscalation: false
+          command: ["/bin/bash", "-c"]
+          args:
+            - |
+              set -e
+              echo "Installing Hugging Face CLI..."
+              pip install --no-cache-dir huggingface_hub[cli]
+
+              echo "Downloading classifier models to persistent volume..."
+              cd /app/models
+
+              # Download category classifier model
+              if [ ! -d "category_classifier_modernbert-base_model" ]; then
+                echo "Downloading category classifier model..."
+                huggingface-cli download LLM-Semantic-Router/category_classifier_modernbert-base_model --local-dir category_classifier_modernbert-base_model
+              else
+                echo "Category classifier model already exists, skipping..."
+              fi
+
+              # Download PII classifier model
+              if [ ! -d "pii_classifier_modernbert-base_model" ]; then
+                echo "Downloading PII classifier model..."
+                huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_model --local-dir pii_classifier_modernbert-base_model
+              else
+                echo "PII classifier model already exists, skipping..."
+              fi
+
+              # Download jailbreak classifier model
+              if [ ! -d "jailbreak_classifier_modernbert-base_model" ]; then
+                echo "Downloading jailbreak classifier model..."
+                huggingface-cli download LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model --local-dir jailbreak_classifier_modernbert-base_model
+              else
+                echo "Jailbreak classifier model already exists, skipping..."
+              fi
+
+              # Download PII token classifier model
+              if [ ! -d "pii_classifier_modernbert-base_presidio_token_model" ]; then
+                echo "Downloading PII token classifier model..."
+                huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model --local-dir pii_classifier_modernbert-base_presidio_token_model
+              else
+                echo "PII token classifier model already exists, skipping..."
+              fi
+
+              # Download embedding model all-MiniLM-L12-v2
+              if [ ! -d "all-MiniLM-L12-v2" ]; then
+                echo "Downloading all-MiniLM-L12-v2 embedding model..."
+                huggingface-cli download sentence-transformers/all-MiniLM-L12-v2 --local-dir all-MiniLM-L12-v2
+              else
+                echo "all-MiniLM-L12-v2 already exists, skipping..."
+              fi
+
+              # Optional: Prepare Qwen model directory for llm-katan sidecar
+              # NOTE: Provide the model content under /app/models/Qwen/Qwen3-0.6B via pre-populated PV
+              # or customize the following block to fetch from your internal artifact store.
+              if [ ! -d "Qwen/Qwen3-0.6B" ]; then
+                echo "Downloading Qwen/Qwen3-0.6B for llm-katan..."
+                mkdir -p Qwen
+                huggingface-cli download Qwen/Qwen3-0.6B --local-dir Qwen/Qwen3-0.6B || echo "Warning: Qwen3-0.6B download failed; ensure offline pre-population if needed."
+              else
+                echo "Qwen/Qwen3-0.6B already exists, skipping..."
+              fi
+
+              echo "Model directory listing:" && ls -la /app/models/
+          env:
+            - name: HF_HUB_CACHE
+              value: /tmp/hf_cache
+          resources:
+            requests:
+              memory: "512Mi"
+              cpu: "250m"
+            limits:
+              memory: "1Gi"
+              cpu: "500m"
+          volumeMounts:
+            - name: models-volume
+              mountPath: /app/models
+      containers:
+        - name: semantic-router
+          image: ghcr.io/vllm-project/semantic-router/extproc:latest
+          args: ["--secure=true"]
+          securityContext:
+            runAsNonRoot: false
+            allowPrivilegeEscalation: false
+          ports:
+            - containerPort: 50051
+              name: grpc
+              protocol: TCP
+            - containerPort: 9190
+              name: metrics
+              protocol: TCP
+            - containerPort: 8080
+              name: classify-api
+              protocol: TCP
+          env:
+            - name: LD_LIBRARY_PATH
+              value: "/app/lib"
+          volumeMounts:
+            - name: config-volume
+              mountPath: /app/config
+              readOnly: true
+            - name: models-volume
+              mountPath: /app/models
+          livenessProbe:
+            tcpSocket:
+              port: 50051
+            initialDelaySeconds: 60
+            periodSeconds: 30
+            timeoutSeconds: 10
+            failureThreshold: 3
+          readinessProbe:
+            tcpSocket:
+              port: 50051
+            initialDelaySeconds: 90
+            periodSeconds: 30
+            timeoutSeconds: 10
+            failureThreshold: 3
+          resources:
+            requests:
+              memory: "3Gi"
+              cpu: "1"
+            limits:
+              memory: "6Gi"
+              cpu: "2"
+        - name: llm-katan
+          image: ghcr.io/vllm-project/semantic-router/llm-katan:latest
+          imagePullPolicy: IfNotPresent
+          args:
+            [
+              "llm-katan",
+              "--model",
+              "/app/models/Qwen/Qwen3-0.6B",
+              "--served-model-name",
+              "qwen3",
+              "--host",
+              "0.0.0.0",
+              "--port",
+              "8002",
+            ]
+          ports:
+            - containerPort: 8002
+              name: katan
+              protocol: TCP
+          volumeMounts:
+            - name: models-volume
+              mountPath: /app/models
+          resources:
+            requests:
+              memory: "1Gi"
+              cpu: "500m"
+            limits:
+              memory: "2Gi"
+              cpu: "1"
+      volumes:
+        - name: config-volume
+          configMap:
+            name: semantic-router-config
+        - name: models-volume
+          persistentVolumeClaim:
+            claimName: semantic-router-models
diff --git a/deploy/kubernetes/deployment.yaml b/deploy/kubernetes/deployment.yaml
index ab7000f9..560b9850 100644
--- a/deploy/kubernetes/deployment.yaml
+++ b/deploy/kubernetes/deployment.yaml
@@ -16,121 +16,130 @@ spec:
         app: semantic-router
     spec:
       initContainers:
-      - name: model-downloader
-        image: python:3.11-slim
-        securityContext:
-          runAsNonRoot: false
-          allowPrivilegeEscalation: false
-        command: ["/bin/bash", "-c"]
-        args:
-        - |
-          set -e
-          echo "Installing Hugging Face CLI..."
-          pip install --no-cache-dir huggingface_hub[cli]
+        - name: model-downloader
+          image: python:3.11-slim
+          securityContext:
+            runAsNonRoot: false
+            allowPrivilegeEscalation: false
+          command: ["/bin/bash", "-c"]
+          args:
+            - |
+              set -e
+              echo "Installing Hugging Face CLI..."
+              pip install --no-cache-dir huggingface_hub[cli]
 
-          echo "Downloading models to persistent volume..."
-          cd /app/models
+              echo "Downloading models to persistent volume..."
+              cd /app/models
 
-          # Download category classifier model
-          if [ ! -d "category_classifier_modernbert-base_model" ]; then
-            echo "Downloading category classifier model..."
-            huggingface-cli download LLM-Semantic-Router/category_classifier_modernbert-base_model --local-dir category_classifier_modernbert-base_model
-          else
-            echo "Category classifier model already exists, skipping..."
-          fi
+              # Download category classifier model
+              if [ ! -d "category_classifier_modernbert-base_model" ]; then
+                echo "Downloading category classifier model..."
+                huggingface-cli download LLM-Semantic-Router/category_classifier_modernbert-base_model --local-dir category_classifier_modernbert-base_model
+              else
+                echo "Category classifier model already exists, skipping..."
+              fi
 
-          # Download PII classifier model
-          if [ ! -d "pii_classifier_modernbert-base_model" ]; then
-            echo "Downloading PII classifier model..."
-            huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_model --local-dir pii_classifier_modernbert-base_model
-          else
-            echo "PII classifier model already exists, skipping..."
-          fi
+              # Download PII classifier model
+              if [ ! -d "pii_classifier_modernbert-base_model" ]; then
+                echo "Downloading PII classifier model..."
+                huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_model --local-dir pii_classifier_modernbert-base_model
+              else
+                echo "PII classifier model already exists, skipping..."
+              fi
 
-          # Download jailbreak classifier model
-          if [ ! -d "jailbreak_classifier_modernbert-base_model" ]; then
-            echo "Downloading jailbreak classifier model..."
-            huggingface-cli download LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model --local-dir jailbreak_classifier_modernbert-base_model
-          else
-            echo "Jailbreak classifier model already exists, skipping..."
-          fi
+              # Download jailbreak classifier model
+              if [ ! -d "jailbreak_classifier_modernbert-base_model" ]; then
+                echo "Downloading jailbreak classifier model..."
+                huggingface-cli download LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model --local-dir jailbreak_classifier_modernbert-base_model
+              else
+                echo "Jailbreak classifier model already exists, skipping..."
+              fi
 
-          # Download PII token classifier model
-          if [ ! -d "pii_classifier_modernbert-base_presidio_token_model" ]; then
-            echo "Downloading PII token classifier model..."
-            huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model --local-dir pii_classifier_modernbert-base_presidio_token_model
-          else
-            echo "PII token classifier model already exists, skipping..."
-          fi
+              # Download PII token classifier model
+              if [ ! -d "pii_classifier_modernbert-base_presidio_token_model" ]; then
+                echo "Downloading PII token classifier model..."
+                huggingface-cli download LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model --local-dir pii_classifier_modernbert-base_presidio_token_model
+              else
+                echo "PII token classifier model already exists, skipping..."
+              fi
 
-          echo "All models downloaded successfully!"
-          ls -la /app/models/
-        env:
-        - name: HF_HUB_CACHE
-          value: /tmp/hf_cache
-        # Reduced resource requirements for init container
-        resources:
-          requests:
-            memory: "512Mi"
-            cpu: "250m"
-          limits:
-            memory: "1Gi"
-            cpu: "500m"
-        volumeMounts:
-        - name: models-volume
-          mountPath: /app/models
+              # Download embedding model all-MiniLM-L12-v2
+              if [ ! -d "all-MiniLM-L12-v2" ]; then
+                echo "Downloading all-MiniLM-L12-v2 embedding model..."
+                huggingface-cli download sentence-transformers/all-MiniLM-L12-v2 --local-dir all-MiniLM-L12-v2
+              else
+                echo "all-MiniLM-L12-v2 already exists, skipping..."
+              fi
+
+
+              echo "Model setup complete."
+              ls -la /app/models/
+          env:
+            - name: HF_HUB_CACHE
+              value: /tmp/hf_cache
+          # Reduced resource requirements for init container
+          resources:
+            requests:
+              memory: "512Mi"
+              cpu: "250m"
+            limits:
+              memory: "1Gi"
+              cpu: "500m"
+          volumeMounts:
+            - name: models-volume
+              mountPath: /app/models
       containers:
-      - name: semantic-router
-        image: ghcr.io/vllm-project/semantic-router/extproc:latest
-        args: ["--secure=true"]
-        securityContext:
-          runAsNonRoot: false
-          allowPrivilegeEscalation: false
-        ports:
-        - containerPort: 50051
-          name: grpc
-          protocol: TCP
-        - containerPort: 9190
-          name: metrics
-          protocol: TCP
-        - containerPort: 8080
-          name: classify-api
-          protocol: TCP
-        env:
-        - name: LD_LIBRARY_PATH
-          value: "/app/lib"
-        volumeMounts:
+        - name: semantic-router
+          image: ghcr.io/vllm-project/semantic-router/extproc:latest
+          args: ["--secure=true"]
+          securityContext:
+            runAsNonRoot: false
+            allowPrivilegeEscalation: false
+          ports:
+            - containerPort: 50051
+              name: grpc
+              protocol: TCP
+            - containerPort: 9190
+              name: metrics
+              protocol: TCP
+            - containerPort: 8080
+              name: classify-api
+              protocol: TCP
+          env:
+            - name: LD_LIBRARY_PATH
+              value: "/app/lib"
+          volumeMounts:
+            - name: config-volume
+              mountPath: /app/config
+              readOnly: true
+            - name: models-volume
+              mountPath: /app/models
+          livenessProbe:
+            tcpSocket:
+              port: 50051
+            initialDelaySeconds: 60
+            periodSeconds: 30
+            timeoutSeconds: 10
+            failureThreshold: 3
+          readinessProbe:
+            tcpSocket:
+              port: 50051
+            initialDelaySeconds: 90
+            periodSeconds: 30
+            timeoutSeconds: 10
+            failureThreshold: 3
+          # Significantly reduced resource requirements for kind cluster
+          resources:
+            requests:
+              memory: "3Gi" # Reduced from 8Gi
+              cpu: "1" # Reduced from 2
+            limits:
+              memory: "6Gi" # Reduced from 12Gi
+              cpu: "2" # Reduced from 4
+      volumes:
         - name: config-volume
-          mountPath: /app/config
-          readOnly: true
+          configMap:
+            name: semantic-router-config
         - name: models-volume
-          mountPath: /app/models
-        livenessProbe:
-          tcpSocket:
-            port: 50051
-          initialDelaySeconds: 60
-          periodSeconds: 30
-          timeoutSeconds: 10
-          failureThreshold: 3
-        readinessProbe:
-          tcpSocket:
-            port: 50051
-          initialDelaySeconds: 90
-          periodSeconds: 30
-          timeoutSeconds: 10
-          failureThreshold: 3
-        # Significantly reduced resource requirements for kind cluster
-        resources:
-          requests:
-            memory: "3Gi"    # Reduced from 8Gi
-            cpu: "1"         # Reduced from 2
-          limits:
-            memory: "6Gi"    # Reduced from 12Gi
-            cpu: "2"         # Reduced from 4
-      volumes:
-      - name: config-volume
-        configMap:
-          name: semantic-router-config
-      - name: models-volume
-        persistentVolumeClaim:
-          claimName: semantic-router-models
+          persistentVolumeClaim:
+            claimName: semantic-router-models
diff --git a/deploy/kubernetes/kustomization.yaml b/deploy/kubernetes/kustomization.yaml
index 3eae4ac9..65b2ccae 100644
--- a/deploy/kubernetes/kustomization.yaml
+++ b/deploy/kubernetes/kustomization.yaml
@@ -1,25 +1,6 @@
 apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization
 
-metadata:
-  name: semantic-router
-
+# This root points to the 'core' overlay by default for clarity.
 resources:
-- namespace.yaml
-- pvc.yaml
-- deployment.yaml
-- service.yaml
-
-# Generate ConfigMap
-configMapGenerator:
-- name: semantic-router-config
-  files:
-  - config.yaml
-  - tools_db.json
-
-# Namespace for all resources
-namespace: vllm-semantic-router-system
-
-images:
-- name: ghcr.io/vllm-project/semantic-router/extproc
-  newTag: latest
+  - overlays/core
diff --git a/deploy/kubernetes/overlays/core/kustomization.yaml b/deploy/kubernetes/overlays/core/kustomization.yaml
new file mode 100644
index 00000000..59d6cf23
--- /dev/null
+++ b/deploy/kubernetes/overlays/core/kustomization.yaml
@@ -0,0 +1,6 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+resources:
+  - ../../base
+  - ../../deployment.yaml
diff --git a/deploy/kubernetes/overlays/llm-katan/kustomization.yaml b/deploy/kubernetes/overlays/llm-katan/kustomization.yaml
new file mode 100644
index 00000000..a20ca370
--- /dev/null
+++ b/deploy/kubernetes/overlays/llm-katan/kustomization.yaml
@@ -0,0 +1,6 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+resources:
+  - ../../base
+  - ../../deployment.katan.yaml
diff --git a/deploy/kubernetes/pvc.yaml b/deploy/kubernetes/pvc.yaml
index 08929306..43b66eb9 100644
--- a/deploy/kubernetes/pvc.yaml
+++ b/deploy/kubernetes/pvc.yaml
@@ -9,5 +9,5 @@ spec:
     - ReadWriteOnce
   resources:
     requests:
-      storage: 10Gi
+      storage: 30Gi
   storageClassName: standard
diff --git a/website/docs/installation/kubernetes.md b/website/docs/installation/kubernetes.md
index abad76f2..7024ab7f 100644
--- a/website/docs/installation/kubernetes.md
+++ b/website/docs/installation/kubernetes.md
@@ -35,12 +35,28 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s
 
 ## Step 2: Deploy vLLM Semantic Router
 
-Configure the semantic router by editing `deploy/kubernetes/config.yaml`. This file contains the vLLM configuration, including model config, endpoints, and policies.
+Configure the semantic router by editing `deploy/kubernetes/config.yaml`. This file contains the vLLM configuration, including model config, endpoints, and policies. The repository provides two Kustomize overlays similar to docker-compose profiles:
 
-Deploy the semantic router service with all required components:
+- core (default): only the semantic-router
+  - Path: `deploy/kubernetes/overlays/core` (root `deploy/kubernetes/` points here by default)
+- llm-katan: semantic-router + an llm-katan sidecar listening on 8002 and serving model name `qwen3`
+  - Path: `deploy/kubernetes/overlays/llm-katan`
 
-```bash
-# Deploy semantic router using Kustomize
+Important notes before you apply manifests:
+
+- `vllm_endpoints.address` must be an IP address (not hostname) reachable from inside the cluster. If your LLM backends run as K8s Services, use the ClusterIP (for example `10.96.0.10`) and set `port` accordingly. Do not include protocol or path.
+- The PVC in `deploy/kubernetes/pvc.yaml` uses `storageClassName: standard`. On some clouds or local clusters, the default StorageClass name may differ (e.g., `standard-rwo`, `gp2`, or a provisioner like local-path). Adjust as needed.
+- Default PVC size is 30Gi. Size it to at least 2–3x of your total model footprint to leave room for indexes and updates.
+- The initContainer downloads several models from Hugging Face on first run and writes them into the PVC. Ensure outbound egress to Hugging Face is allowed and there is at least ~6–8 GiB free space for the models specified.
+- Per mode, the init container downloads differ:
+  - core: classifiers + the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`.
+  - llm-katan: everything in core, plus `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B`.
+- The default `config.yaml` points to `qwen3` at `127.0.0.1:8002`, which matches the llm-katan overlay. If you use core (no sidecar), either change `vllm_endpoints` to your actual backend Service IP:Port, or deploy the llm-katan overlay.
+
+Deploy the semantic router service with all required components (core mode by default):
+
+````bash
+# Deploy semantic router (core mode)
 kubectl apply -k deploy/kubernetes/
 
 # Wait for deployment to be ready (this may take several minutes for model downloads)
@@ -48,7 +64,14 @@ kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semant
 
 # Verify deployment status
 kubectl get pods -n vllm-semantic-router-system
-```
+
+To run with the llm-katan overlay instead:
+
+```bash
+kubectl apply -k deploy/kubernetes/overlays/llm-katan
+````
+
+````
 
 ## Step 3: Install Envoy Gateway
 
@@ -63,7 +86,7 @@ helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
 
 # Wait for Envoy Gateway to be ready
 kubectl wait --timeout=300s -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
-```
+````
 
 ## Step 4: Install Envoy AI Gateway
 
@@ -135,26 +158,28 @@ Expected output should show the inference pool in `Accepted` state:
 ```yaml
 status:
   parent:
-  - conditions:
-    - lastTransitionTime: "2025-09-27T09:27:32Z"
-      message: 'InferencePool has been Accepted by controller ai-gateway-controller:
-        InferencePool reconciled successfully'
-      observedGeneration: 1
-      reason: Accepted
-      status: "True"
-      type: Accepted
-    - lastTransitionTime: "2025-09-27T09:27:32Z"
-      message: 'Reference resolution by controller ai-gateway-controller: All references
-        resolved successfully'
-      observedGeneration: 1
-      reason: ResolvedRefs
-      status: "True"
-      type: ResolvedRefs
-    parentRef:
-      group: gateway.networking.k8s.io
-      kind: Gateway
-      name: vllm-semantic-router
-      namespace: vllm-semantic-router-system
+    - conditions:
+        - lastTransitionTime: "2025-09-27T09:27:32Z"
+          message:
+            "InferencePool has been Accepted by controller ai-gateway-controller:
+            InferencePool reconciled successfully"
+          observedGeneration: 1
+          reason: Accepted
+          status: "True"
+          type: Accepted
+        - lastTransitionTime: "2025-09-27T09:27:32Z"
+          message:
+            "Reference resolution by controller ai-gateway-controller: All references
+            resolved successfully"
+          observedGeneration: 1
+          reason: ResolvedRefs
+          status: "True"
+          type: ResolvedRefs
+      parentRef:
+        group: gateway.networking.k8s.io
+        kind: Gateway
+        name: vllm-semantic-router
+        namespace: vllm-semantic-router-system
 ```
 
 ## Testing the Deployment