Skip to content

Commit cd7e05a

Browse files
committed
working solution
Signed-off-by: Ryan Cook <[email protected]>
1 parent ba8d60a commit cd7e05a

File tree

6 files changed

+141
-66
lines changed

6 files changed

+141
-66
lines changed

deploy/kserve/QUICKSTART.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,9 +81,9 @@ Just need semantic routing with defaults:
8181
./deploy.sh -n myproject -i mymodel -m mymodel
8282
```
8383

84-
### Scenario 2: Custom Storage
84+
### Scenario 2: Custom Storage and Embedding Model
8585

86-
Using a specific storage class or larger PVCs:
86+
Using a specific storage class, larger PVCs, and custom embedding model:
8787

8888
```bash
8989
./deploy.sh \
@@ -92,9 +92,16 @@ Using a specific storage class or larger PVCs:
9292
-m mymodel \
9393
-s gp3-csi \
9494
--models-pvc-size 20Gi \
95-
--cache-pvc-size 10Gi
95+
--cache-pvc-size 10Gi \
96+
--embedding-model all-mpnet-base-v2
9697
```
9798

99+
**Available Embedding Models:**
100+
- `all-MiniLM-L12-v2` (default) - Balanced speed/quality (~90MB)
101+
- `all-mpnet-base-v2` - Higher quality, larger (~420MB)
102+
- `all-MiniLM-L6-v2` - Faster, smaller (~80MB)
103+
- `paraphrase-multilingual-MiniLM-L12-v2` - Multilingual support
104+
98105
### Scenario 3: Preview Before Deploying
99106

100107
Want to see what will be created first:
@@ -235,7 +242,12 @@ Simply redeploy:
235242

236243
1. **Run validation tests**:
237244
```bash
238-
NAMESPACE=<ns> MODEL_NAME=<model> ./test-semantic-routing.sh
245+
# Set namespace and model name
246+
NAMESPACE=<namespace> MODEL_NAME=<model> ./test-semantic-routing.sh
247+
248+
# Or let the script auto-detect from your deployment
249+
cd deploy/kserve
250+
./test-semantic-routing.sh
239251
```
240252

241253
2. **Customize configuration**: See [README.md](./README.md) for detailed configuration options:

deploy/kserve/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -417,7 +417,12 @@ time curl -k -s "https://$ROUTER_URL/v1/chat/completions" \
417417
Run comprehensive validation tests:
418418

419419
```bash
420+
# Set environment variables and run tests
420421
NAMESPACE=$NAMESPACE MODEL_NAME=my-model-name ./test-semantic-routing.sh
422+
423+
# Or let the script auto-detect from config
424+
cd deploy/kserve
425+
./test-semantic-routing.sh
421426
```
422427

423428
## Configuration Deep Dive

deploy/kserve/configmap-router-config.yaml

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ metadata:
88
data:
99
config.yaml: |
1010
bert_model:
11-
model_id: models/all-MiniLM-L12-v2
11+
model_id: models/{{EMBEDDING_MODEL}}
1212
threshold: 0.6
1313
use_cpu: true
1414
@@ -25,7 +25,7 @@ data:
2525
embedding_model: "bert"
2626
2727
tools:
28-
enabled: true
28+
enabled: false # Disabled - tools_db.json not included in KServe deployment
2929
top_k: 3
3030
similarity_threshold: 0.2
3131
tools_db_path: "config/tools_db.json"
@@ -203,11 +203,19 @@ data:
203203
duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
204204
size_buckets: [1, 2, 5, 10, 20, 50, 100, 200]
205205
206-
# Embedding Models Configuration
207-
embedding_models:
208-
qwen3_model_path: "models/Qwen3-Embedding-0.6B"
209-
gemma_model_path: "models/embeddinggemma-300m"
210-
use_cpu: true
206+
# Embedding Models Configuration (Optional)
207+
# These are SEPARATE from the bert_model above and are used for the /v1/embeddings API endpoint.
208+
# The bert_model (configured above) is used for semantic caching and tools similarity.
209+
#
210+
# To enable the embeddings API with Qwen3/Gemma models:
211+
# 1. Uncomment the section below
212+
# 2. Update the deployment init container to download these models
213+
# 3. Note: These models are large (~600MB each) and not required for routing functionality
214+
#
215+
# embedding_models:
216+
# qwen3_model_path: "models/Qwen3-Embedding-0.6B"
217+
# gemma_model_path: "models/embeddinggemma-300m"
218+
# use_cpu: true
211219
212220
# Observability Configuration
213221
observability:

deploy/kserve/deploy.sh

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,13 @@ MODEL_NAME=""
2222
STORAGE_CLASS=""
2323
MODELS_PVC_SIZE="10Gi"
2424
CACHE_PVC_SIZE="5Gi"
25+
# Embedding model for semantic caching and tools similarity
26+
# Common options from sentence-transformers:
27+
# - all-MiniLM-L12-v2 (default, balanced speed/quality)
28+
# - all-mpnet-base-v2 (higher quality, slower)
29+
# - all-MiniLM-L6-v2 (faster, lower quality)
30+
# - paraphrase-multilingual-MiniLM-L12-v2 (multilingual)
31+
EMBEDDING_MODEL="all-MiniLM-L12-v2"
2532
DRY_RUN=false
2633
SKIP_VALIDATION=false
2734

@@ -41,6 +48,7 @@ Optional:
4148
-s, --storage-class CLASS StorageClass for PVCs (default: cluster default)
4249
--models-pvc-size SIZE Size for models PVC (default: 10Gi)
4350
--cache-pvc-size SIZE Size for cache PVC (default: 5Gi)
51+
--embedding-model MODEL BERT embedding model (default: all-MiniLM-L12-v2)
4452
--dry-run Generate manifests without applying
4553
--skip-validation Skip pre-deployment validation
4654
-h, --help Show this help message
@@ -49,8 +57,8 @@ Examples:
4957
# Deploy to namespace 'semantic' with granite32-8b model
5058
$0 -n semantic -i granite32-8b -m granite32-8b
5159
52-
# Deploy with custom storage class
53-
$0 -n myproject -i llama3-70b -m llama3-70b -s gp3-csi
60+
# Deploy with custom storage class and embedding model
61+
$0 -n myproject -i llama3-70b -m llama3-70b -s gp3-csi --embedding-model all-mpnet-base-v2
5462
5563
# Dry run to see what will be deployed
5664
$0 -n semantic -i granite32-8b -m granite32-8b --dry-run
@@ -93,6 +101,10 @@ while [[ $# -gt 0 ]]; do
93101
CACHE_PVC_SIZE="$2"
94102
shift 2
95103
;;
104+
--embedding-model)
105+
EMBEDDING_MODEL="$2"
106+
shift 2
107+
;;
96108
--dry-run)
97109
DRY_RUN=true
98110
shift
@@ -129,6 +141,7 @@ echo -e "${BLUE}Configuration:${NC}"
129141
echo " Namespace: $NAMESPACE"
130142
echo " InferenceService: $INFERENCESERVICE_NAME"
131143
echo " Model Name: $MODEL_NAME"
144+
echo " Embedding Model: $EMBEDDING_MODEL"
132145
echo " Storage Class: ${STORAGE_CLASS:-<cluster default>}"
133146
echo " Models PVC Size: $MODELS_PVC_SIZE"
134147
echo " Cache PVC Size: $CACHE_PVC_SIZE"
@@ -237,7 +250,7 @@ fi
237250
echo -e "${BLUE}Step 2: Generating manifests...${NC}"
238251

239252
TEMP_DIR=$(mktemp -d)
240-
trap "rm -rf $TEMP_DIR" EXIT
253+
trap 'rm -rf "$TEMP_DIR"' EXIT
241254

242255
# Function to substitute variables in a file
243256
substitute_vars() {
@@ -247,6 +260,7 @@ substitute_vars() {
247260
sed -e "s/{{NAMESPACE}}/$NAMESPACE/g" \
248261
-e "s/{{INFERENCESERVICE_NAME}}/$INFERENCESERVICE_NAME/g" \
249262
-e "s/{{MODEL_NAME}}/$MODEL_NAME/g" \
263+
-e "s|{{EMBEDDING_MODEL}}|$EMBEDDING_MODEL|g" \
250264
-e "s/{{PREDICTOR_SERVICE_IP}}/${PREDICTOR_SERVICE_IP:-10.0.0.1}/g" \
251265
-e "s/{{MODELS_PVC_SIZE}}/$MODELS_PVC_SIZE/g" \
252266
-e "s/{{CACHE_PVC_SIZE}}/$CACHE_PVC_SIZE/g" \

deploy/kserve/deployment.yaml

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ spec:
6363
("LLM-Semantic-Router/pii_classifier_modernbert-base_model", "pii_classifier_modernbert-base_model"),
6464
("LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model", "jailbreak_classifier_modernbert-base_model"),
6565
("LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model", "pii_classifier_modernbert-base_presidio_token_model"),
66-
("sentence-transformers/all-MiniLM-L12-v2", "all-MiniLM-L12-v2")
66+
("sentence-transformers/{{EMBEDDING_MODEL}}", "{{EMBEDDING_MODEL}}")
6767
]
6868
6969
cache_dir = "/app/cache/hf"
@@ -72,17 +72,25 @@ spec:
7272
for repo_id, local_dir_name in models:
7373
local_dir = os.path.join(base_dir, local_dir_name)
7474
75-
# Check if model already exists
75+
# Check if model weights actually exist (not just the directory)
7676
has_model = False
7777
if os.path.exists(local_dir):
78-
for ext in ['.safetensors', '.bin', 'pytorch_model.']:
79-
for root, dirs, files in os.walk(local_dir):
80-
if any(ext in f for f in files):
78+
# Look specifically for model weight files
79+
for root, dirs, files in os.walk(local_dir):
80+
for f in files:
81+
if f.endswith('.safetensors') or f.endswith('.bin') or f.startswith('pytorch_model.'):
8182
has_model = True
83+
print(f"Found model weights: {f}")
8284
break
8385
if has_model:
8486
break
8587
88+
# Clean up incomplete downloads
89+
if os.path.exists(local_dir) and not has_model:
90+
print(f"Removing incomplete download: {local_dir_name}")
91+
import shutil
92+
shutil.rmtree(local_dir, ignore_errors=True)
93+
8694
if not has_model:
8795
print(f"Downloading {repo_id}...")
8896
snapshot_download(
@@ -124,11 +132,11 @@ spec:
124132
value: /tmp/python_user/bin:/usr/local/bin:/usr/bin:/bin
125133
resources:
126134
requests:
127-
memory: "512Mi"
128-
cpu: "250m"
129-
limits:
130135
memory: "1Gi"
131136
cpu: "500m"
137+
limits:
138+
memory: "2Gi"
139+
cpu: "1"
132140
volumeMounts:
133141
- name: models-volume
134142
mountPath: /app/models
@@ -194,11 +202,11 @@ spec:
194202
failureThreshold: 3
195203
resources:
196204
requests:
197-
memory: "3Gi"
198-
cpu: "1"
205+
memory: "4Gi"
206+
cpu: "1500m"
199207
limits:
200-
memory: "6Gi"
201-
cpu: "2"
208+
memory: "8Gi"
209+
cpu: "3"
202210

203211
# Envoy proxy container - routes to KServe endpoints
204212
- name: envoy-proxy

0 commit comments

Comments
 (0)