diff --git a/deploy/kserve/README.md b/deploy/kserve/README.md new file mode 100644 index 00000000..9ff79be0 --- /dev/null +++ b/deploy/kserve/README.md @@ -0,0 +1,289 @@ +# Semantic Router Integration with OpenShift AI KServe + +Deploy vLLM Semantic Router as an intelligent gateway for your OpenShift AI KServe InferenceServices. + +> **Deployment Focus**: This guide is specifically for deploying semantic router on **OpenShift AI with KServe**. +> +> **Learn about features?** See links to feature documentation throughout this guide. + +## Overview + +The semantic router acts as an intelligent API gateway that provides: + +- **Intelligent Model Selection**: Automatically routes requests to the best model based on semantic understanding +- **PII Detection & Protection**: Blocks or redacts sensitive information before sending to models +- **Prompt Guard**: Detects and blocks jailbreak attempts +- **Semantic Caching**: Reduces latency and costs through intelligent response caching +- **Category-Specific Prompts**: Injects domain-specific system prompts for better results +- **Tools Auto-Selection**: Automatically selects relevant tools for function calling + +## Prerequisites + +Before deploying, ensure you have: + +1. **OpenShift Cluster** with OpenShift AI (RHOAI) installed +2. **KServe InferenceService** already deployed and running +3. **OpenShift CLI (oc)** installed and logged in +4. **Cluster admin or namespace admin** permissions + +## Quick Deployment + +Use the `deploy.sh` script for automated deployment. It handles validation, model downloads, and resource creation: + +```bash +./deploy.sh --namespace --inferenceservice --model +``` + +**Example:** + +```bash +./deploy.sh -n semantic -i granite32-8b -m granite32-8b +``` + +The script validates prerequisites, creates a stable service for your predictor, downloads classification models (~2-3 min), and deploys all resources. Optional flags include `--embedding-model`, `--storage-class`, `--models-pvc-size`, and `--cache-pvc-size`. For manual step-by-step deployment, continue reading below. + +## Manual Deployment + +### Step 1: Verify InferenceService + +Check that your InferenceService is deployed and ready: + +```bash +NAMESPACE= +INFERENCESERVICE_NAME= + +# List InferenceServices +oc get inferenceservice -n $NAMESPACE + +# Create stable ClusterIP service for predictor +cat <-predictor..svc.cluster.local` + +### Step 3: Deploy Resources + +Apply manifests in order: + +```bash +NAMESPACE= + +# Deploy resources +oc apply -f serviceaccount.yaml -n $NAMESPACE +oc apply -f pvc.yaml -n $NAMESPACE +oc apply -f configmap-router-config.yaml -n $NAMESPACE +oc apply -f configmap-envoy-config.yaml -n $NAMESPACE +oc apply -f peerauthentication.yaml -n $NAMESPACE +oc apply -f deployment.yaml -n $NAMESPACE +oc apply -f service.yaml -n $NAMESPACE +oc apply -f route.yaml -n $NAMESPACE +``` + +### Step 4: Wait for Ready + +Monitor deployment progress: + +```bash +# Watch pod status +oc get pods -l app=semantic-router -n $NAMESPACE -w + +# Check logs +oc logs -l app=semantic-router -c semantic-router -n $NAMESPACE -f +``` + +The pod will download models (~2-3 minutes) then start serving traffic. + +## Accessing Services + +Get the route URL: + +```bash +ROUTER_URL=$(oc get route semantic-router-kserve -n $NAMESPACE -o jsonpath='{.spec.host}') +echo "External URL: https://$ROUTER_URL" +``` + +Test the deployment: + +```bash +# Test models endpoint +curl -k "https://$ROUTER_URL/v1/models" + +# Test chat completion +curl -k "https://$ROUTER_URL/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "What is 2+2?"}], + "max_tokens": 50 + }' +``` + +Run validation tests: + +```bash +# Auto-detect configuration +./test-semantic-routing.sh + +# Or specify explicitly +NAMESPACE=$NAMESPACE MODEL_NAME= ./test-semantic-routing.sh +``` + +## Monitoring + +### Check Deployment Status + +```bash +# Check pods +oc get pods -l app=semantic-router -n $NAMESPACE + +# Check services +oc get svc -n $NAMESPACE + +# Check routes +oc get routes -n $NAMESPACE +``` + +### View Logs + +```bash +# Router logs +oc logs -l app=semantic-router -c semantic-router -n $NAMESPACE -f + +# Model download logs (init container) +oc logs -l app=semantic-router -c model-downloader -n $NAMESPACE + +# Envoy logs +oc logs -l app=semantic-router -c envoy-proxy -n $NAMESPACE -f +``` + +### Metrics + +```bash +# Port-forward metrics endpoint +POD=$(oc get pods -l app=semantic-router -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}') +oc port-forward $POD 9190:9190 -n $NAMESPACE + +# View metrics +curl http://localhost:9190/metrics +``` + +## Cleanup + +Remove all deployed resources: + +```bash +NAMESPACE= + +oc delete route semantic-router-kserve -n $NAMESPACE +oc delete service semantic-router-kserve -n $NAMESPACE +oc delete deployment semantic-router-kserve -n $NAMESPACE +oc delete configmap semantic-router-kserve-config semantic-router-envoy-kserve-config -n $NAMESPACE +oc delete pvc semantic-router-models semantic-router-cache -n $NAMESPACE +oc delete peerauthentication semantic-router-kserve-permissive -n $NAMESPACE +oc delete serviceaccount semantic-router -n $NAMESPACE +``` + +**Warning**: Deleting PVCs will remove downloaded models and cache data. To preserve data, skip PVC deletion. + +## Troubleshooting + +### Pod Not Starting + +```bash +# Check pod status and events +oc get pods -l app=semantic-router -n $NAMESPACE +oc describe pod -l app=semantic-router -n $NAMESPACE + +# Check init container logs (model download) +oc logs -l app=semantic-router -c model-downloader -n $NAMESPACE +``` + +**Common causes:** + +- Network issues downloading models +- PVC not bound - check storage class +- Insufficient memory - increase init container resources + +### Router Container Crashing + +```bash +# Check router logs +oc logs -l app=semantic-router -c semantic-router -n $NAMESPACE --previous +``` + +**Common causes:** + +- Configuration error - validate YAML syntax +- Invalid IP address - use ClusterIP not DNS in `vllm_endpoints.address` +- Missing models - verify init container completed + +### Cannot Connect to InferenceService + +```bash +# Test from router pod +POD=$(oc get pods -l app=semantic-router -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}') +oc exec $POD -c semantic-router -n $NAMESPACE -- \ + curl -v http://-predictor.$NAMESPACE.svc.cluster.local:8080/v1/models +``` + +**Common causes:** + +- InferenceService not ready - check `oc get inferenceservice -n $NAMESPACE` +- Wrong DNS name - verify format: `-predictor..svc.cluster.local` +- Network policy blocking traffic +- mTLS mode mismatch - ensure PERMISSIVE mode in PeerAuthentication + +## Configuration + +For detailed configuration options, see the main project documentation: + +- **Category Classification**: Train custom models at [Category Classifier Training](../../src/training/classifier_model_fine_tuning/) +- **PII Detection**: Train custom models at [PII Detection Training](../../src/training/pii_model_fine_tuning/) +- **Prompt Guard**: Train custom models at [Prompt Guard Training](../../src/training/prompt_guard_fine_tuning/) + +## Related Documentation + +### Within This Repository + +- **[Category Classifier Training](../../src/training/classifier_model_fine_tuning/)** - Train custom category classification models +- **[PII Detector Training](../../src/training/pii_model_fine_tuning/)** - Train custom PII detection models +- **[Prompt Guard Training](../../src/training/prompt_guard_fine_tuning/)** - Train custom jailbreak detection models + +### Other Deployment Options + +- **[OpenShift Deployment](../openshift/)** - Deploy with standalone vLLM containers (not KServe) +- *This directory* - OpenShift AI KServe deployment (you are here) + +### External Resources + +- **Main Project**: https://github.com/vllm-project/semantic-router +- **Full Documentation**: https://vllm-semantic-router.com +- **KServe Docs**: https://kserve.github.io/website/ diff --git a/deploy/kserve/configmap-envoy-config.yaml b/deploy/kserve/configmap-envoy-config.yaml new file mode 100644 index 00000000..3fe150de --- /dev/null +++ b/deploy/kserve/configmap-envoy-config.yaml @@ -0,0 +1,167 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: semantic-router-envoy-kserve-config + labels: + app: semantic-router + component: envoy +data: + envoy.yaml: | + # Envoy configuration for KServe InferenceService integration + # This config routes traffic to KServe predictors based on semantic router decisions + static_resources: + listeners: + - name: listener_0 + address: + socket_address: + address: 0.0.0.0 + port_value: 8801 + filter_chains: + - filters: + - name: envoy.filters.network.http_connection_manager + typed_config: + "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager + stat_prefix: ingress_http + access_log: + - name: envoy.access_loggers.stdout + typed_config: + "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog + log_format: + json_format: + time: "%START_TIME%" + protocol: "%PROTOCOL%" + request_method: "%REQ(:METHOD)%" + request_path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%" + response_code: "%RESPONSE_CODE%" + response_flags: "%RESPONSE_FLAGS%" + bytes_received: "%BYTES_RECEIVED%" + bytes_sent: "%BYTES_SENT%" + duration: "%DURATION%" + upstream_host: "%UPSTREAM_HOST%" + upstream_cluster: "%UPSTREAM_CLUSTER%" + upstream_local_address: "%UPSTREAM_LOCAL_ADDRESS%" + request_id: "%REQ(X-REQUEST-ID)%" + selected_model: "%REQ(X-SELECTED-MODEL)%" + selected_endpoint: "%REQ(X-GATEWAY-DESTINATION-ENDPOINT)%" + route_config: + name: local_route + virtual_hosts: + - name: local_service + domains: ["*"] + routes: + # Route /v1/models to semantic router for model aggregation + - match: + path: "/v1/models" + route: + cluster: semantic_router_cluster + timeout: 300s + # Dynamic route - destination determined by x-gateway-destination-endpoint header + - match: + prefix: "/" + route: + cluster: kserve_dynamic_cluster + timeout: 300s + http_filters: + - name: envoy.filters.http.ext_proc + typed_config: + "@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor + grpc_service: + envoy_grpc: + cluster_name: extproc_service + allow_mode_override: true + processing_mode: + request_header_mode: "SEND" + response_header_mode: "SEND" + request_body_mode: "BUFFERED" + response_body_mode: "BUFFERED" + request_trailer_mode: "SKIP" + response_trailer_mode: "SKIP" + failure_mode_allow: true + message_timeout: 300s + - name: envoy.filters.http.router + typed_config: + "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router + suppress_envoy_headers: true + http2_protocol_options: + max_concurrent_streams: 100 + initial_stream_window_size: 65536 + initial_connection_window_size: 1048576 + stream_idle_timeout: "300s" + request_timeout: "300s" + common_http_protocol_options: + idle_timeout: "300s" + + clusters: + - name: extproc_service + connect_timeout: 300s + per_connection_buffer_limit_bytes: 52428800 + type: STATIC + lb_policy: ROUND_ROBIN + typed_extension_protocol_options: + envoy.extensions.upstreams.http.v3.HttpProtocolOptions: + "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions + explicit_http_config: + http2_protocol_options: + connection_keepalive: + interval: 300s + timeout: 300s + load_assignment: + cluster_name: extproc_service + endpoints: + - lb_endpoints: + - endpoint: + address: + socket_address: + address: 127.0.0.1 + port_value: 50051 + + # Static cluster for semantic router API + - name: semantic_router_cluster + connect_timeout: 300s + per_connection_buffer_limit_bytes: 52428800 + type: STATIC + lb_policy: ROUND_ROBIN + load_assignment: + cluster_name: semantic_router_cluster + endpoints: + - lb_endpoints: + - endpoint: + address: + socket_address: + address: 127.0.0.1 + port_value: 8080 + typed_extension_protocol_options: + envoy.extensions.upstreams.http.v3.HttpProtocolOptions: + "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions + explicit_http_config: + http_protocol_options: {} + + # DNS-based cluster for KServe InferenceService (headless service) + # Uses service DNS name with container port (8080) for Istio routing + # Template variables: {{INFERENCESERVICE_NAME}}, {{NAMESPACE}} + - name: kserve_dynamic_cluster + connect_timeout: 300s + per_connection_buffer_limit_bytes: 52428800 + type: STRICT_DNS + lb_policy: ROUND_ROBIN + dns_lookup_family: V4_ONLY + load_assignment: + cluster_name: kserve_dynamic_cluster + endpoints: + - lb_endpoints: + - endpoint: + address: + socket_address: + address: {{INFERENCESERVICE_NAME}}-predictor.{{NAMESPACE}}.svc.cluster.local + port_value: 8080 + typed_extension_protocol_options: + envoy.extensions.upstreams.http.v3.HttpProtocolOptions: + "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions + explicit_http_config: + http_protocol_options: {} + + admin: + address: + socket_address: + address: "127.0.0.1" + port_value: 19000 diff --git a/deploy/kserve/configmap-router-config.yaml b/deploy/kserve/configmap-router-config.yaml new file mode 100644 index 00000000..83328f19 --- /dev/null +++ b/deploy/kserve/configmap-router-config.yaml @@ -0,0 +1,235 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: semantic-router-kserve-config + labels: + app: semantic-router + component: config +data: + config.yaml: | + bert_model: + model_id: models/{{EMBEDDING_MODEL}} + threshold: 0.6 + use_cpu: true + + semantic_cache: + enabled: true + backend_type: "memory" + similarity_threshold: 0.8 + max_entries: 1000 + ttl_seconds: 3600 + eviction_policy: "fifo" + use_hnsw: true + hnsw_m: 16 + hnsw_ef_construction: 200 + embedding_model: "bert" + + tools: + enabled: false # Disabled - tools_db.json not included in KServe deployment + top_k: 3 + similarity_threshold: 0.2 + tools_db_path: "config/tools_db.json" + fallback_to_empty: true + + prompt_guard: + enabled: true + use_modernbert: true + model_id: "models/jailbreak_classifier_modernbert-base_model" + threshold: 0.7 + use_cpu: true + jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json" + + # vLLM Endpoints Configuration - Using KServe InferenceService with Istio + # IMPORTANT: Using stable ClusterIP service (not pod IP or headless service) + # - KServe creates headless service by default (no stable ClusterIP) + # - deploy.sh creates a stable ClusterIP service for consistent routing + # - Service ClusterIP remains stable even when predictor pods restart + # - Use HTTP (not HTTPS) on port 8080 - Istio handles mTLS + # Template variables: {{INFERENCESERVICE_NAME}}, {{PREDICTOR_SERVICE_IP}} + vllm_endpoints: + - name: "{{INFERENCESERVICE_NAME}}-endpoint" + address: "{{PREDICTOR_SERVICE_IP}}" # Stable service ClusterIP (auto-populated by deploy script) + port: 8080 # Container port (HTTP - Istio provides mTLS) + weight: 1 + + model_config: + # KServe InferenceService model configuration + # Template variable: {{MODEL_NAME}}, {{INFERENCESERVICE_NAME}} + "{{MODEL_NAME}}": + reasoning_family: "qwen3" # Adjust based on model family: qwen3, deepseek, gpt, gpt-oss + preferred_endpoints: ["{{INFERENCESERVICE_NAME}}-endpoint"] + pii_policy: + allow_by_default: true + pii_types_allowed: ["EMAIL_ADDRESS"] + + # Classifier configuration + classifier: + category_model: + model_id: "models/category_classifier_modernbert-base_model" + use_modernbert: true + threshold: 0.6 + use_cpu: true + category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json" + pii_model: + model_id: "models/pii_classifier_modernbert-base_presidio_token_model" + use_modernbert: true + threshold: 0.7 + use_cpu: true + pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json" + + # Categories with model scoring + categories: + - name: business + system_prompt: "You are a senior business consultant and strategic advisor with expertise in corporate strategy, operations management, financial analysis, marketing, and organizational development. Provide practical, actionable business advice backed by proven methodologies and industry best practices." + model_scores: + - model: {{MODEL_NAME}} + score: 0.7 + use_reasoning: false + - name: law + system_prompt: "You are a knowledgeable legal expert with comprehensive understanding of legal principles, case law, statutory interpretation, and legal procedures across multiple jurisdictions." + model_scores: + - model: granite32-8b + score: 0.4 + use_reasoning: false + - name: psychology + system_prompt: "You are a psychology expert with deep knowledge of cognitive processes, behavioral patterns, mental health, developmental psychology, social psychology, and therapeutic approaches." + semantic_cache_enabled: true + semantic_cache_similarity_threshold: 0.92 + model_scores: + - model: granite32-8b + score: 0.6 + use_reasoning: false + - name: biology + system_prompt: "You are a biology expert with comprehensive knowledge spanning molecular biology, genetics, cell biology, ecology, evolution, anatomy, physiology, and biotechnology." + model_scores: + - model: granite32-8b + score: 0.9 + use_reasoning: false + - name: chemistry + system_prompt: "You are a chemistry expert specializing in chemical reactions, molecular structures, and laboratory techniques. Provide detailed, step-by-step explanations." + model_scores: + - model: granite32-8b + score: 0.6 + use_reasoning: true + - name: history + system_prompt: "You are a historian with expertise across different time periods and cultures. Provide accurate historical context and analysis." + model_scores: + - model: {{MODEL_NAME}} + score: 0.7 + use_reasoning: false + - name: other + system_prompt: "You are a helpful and knowledgeable assistant. Provide accurate, helpful responses across a wide range of topics." + semantic_cache_enabled: true + semantic_cache_similarity_threshold: 0.75 + model_scores: + - model: {{MODEL_NAME}} + score: 0.7 + use_reasoning: false + - name: health + system_prompt: "You are a health and medical information expert with knowledge of anatomy, physiology, diseases, treatments, preventive care, nutrition, and wellness." + semantic_cache_enabled: true + semantic_cache_similarity_threshold: 0.95 + model_scores: + - model: granite32-8b + score: 0.5 + use_reasoning: false + - name: economics + system_prompt: "You are an economics expert with deep understanding of microeconomics, macroeconomics, econometrics, financial markets, monetary policy, fiscal policy, international trade, and economic theory." + model_scores: + - model: granite32-8b + score: 1.0 + use_reasoning: false + - name: math + system_prompt: "You are a mathematics expert. Provide step-by-step solutions, show your work clearly, and explain mathematical concepts in an understandable way." + model_scores: + - model: granite32-8b + score: 1.0 + use_reasoning: true + - name: physics + system_prompt: "You are a physics expert with deep understanding of physical laws and phenomena. Provide clear explanations with mathematical derivations when appropriate." + model_scores: + - model: granite32-8b + score: 0.7 + use_reasoning: true + - name: computer science + system_prompt: "You are a computer science expert with knowledge of algorithms, data structures, programming languages, and software engineering. Provide clear, practical solutions with code examples when helpful." + model_scores: + - model: granite32-8b + score: 0.6 + use_reasoning: false + - name: philosophy + system_prompt: "You are a philosophy expert with comprehensive knowledge of philosophical traditions, ethical theories, logic, metaphysics, epistemology, political philosophy, and the history of philosophical thought." + model_scores: + - model: granite32-8b + score: 0.5 + use_reasoning: false + - name: engineering + system_prompt: "You are an engineering expert with knowledge across multiple engineering disciplines including mechanical, electrical, civil, chemical, software, and systems engineering." + model_scores: + - model: {{MODEL_NAME}} + score: 0.7 + use_reasoning: false + + default_model: {{MODEL_NAME}} + + # Reasoning family configurations + reasoning_families: + deepseek: + type: "chat_template_kwargs" + parameter: "thinking" + qwen3: + type: "chat_template_kwargs" + parameter: "enable_thinking" + gpt-oss: + type: "reasoning_effort" + parameter: "reasoning_effort" + gpt: + type: "reasoning_effort" + parameter: "reasoning_effort" + + default_reasoning_effort: high + + # API Configuration + api: + batch_classification: + max_batch_size: 100 + concurrency_threshold: 5 + max_concurrency: 8 + metrics: + enabled: true + detailed_goroutine_tracking: true + high_resolution_timing: false + sample_rate: 1.0 + duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30] + size_buckets: [1, 2, 5, 10, 20, 50, 100, 200] + + # Embedding Models Configuration (Optional) + # These are SEPARATE from the bert_model above and are used for the /v1/embeddings API endpoint. + # The bert_model (configured above) is used for semantic caching and tools similarity. + # + # To enable the embeddings API with Qwen3/Gemma models: + # 1. Uncomment the section below + # 2. Update the deployment init container to download these models + # 3. Note: These models are large (~600MB each) and not required for routing functionality + # + # embedding_models: + # qwen3_model_path: "models/Qwen3-Embedding-0.6B" + # gemma_model_path: "models/embeddinggemma-300m" + # use_cpu: true + + # Observability Configuration + observability: + tracing: + enabled: false + provider: "opentelemetry" + exporter: + type: "stdout" + endpoint: "localhost:4317" + insecure: true + sampling: + type: "always_on" + rate: 1.0 + resource: + service_name: "vllm-semantic-router" + service_version: "v0.1.0" + deployment_environment: "production" diff --git a/deploy/kserve/deploy.sh b/deploy/kserve/deploy.sh new file mode 100755 index 00000000..61f35562 --- /dev/null +++ b/deploy/kserve/deploy.sh @@ -0,0 +1,414 @@ +#!/bin/bash +# Semantic Router KServe Deployment Helper Script +# This script simplifies deploying the semantic router to work with OpenShift AI KServe InferenceServices +# It handles variable substitution, validation, and deployment + +set -e + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Script directory +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Default values +NAMESPACE="" +INFERENCESERVICE_NAME="" +MODEL_NAME="" +STORAGE_CLASS="" +MODELS_PVC_SIZE="10Gi" +CACHE_PVC_SIZE="5Gi" +# Embedding model for semantic caching and tools similarity +# Common options from sentence-transformers: +# - all-MiniLM-L12-v2 (default, balanced speed/quality) +# - all-mpnet-base-v2 (higher quality, slower) +# - all-MiniLM-L6-v2 (faster, lower quality) +# - paraphrase-multilingual-MiniLM-L12-v2 (multilingual) +EMBEDDING_MODEL="all-MiniLM-L12-v2" +DRY_RUN=false +SKIP_VALIDATION=false + +# Usage function +usage() { + cat << EOF +Usage: $0 [OPTIONS] + +Deploy vLLM Semantic Router for OpenShift AI KServe InferenceServices + +Required Options: + -n, --namespace NAMESPACE OpenShift namespace to deploy to + -i, --inferenceservice NAME Name of the KServe InferenceService + -m, --model MODEL_NAME Model name as reported by the InferenceService + +Optional: + -s, --storage-class CLASS StorageClass for PVCs (default: cluster default) + --models-pvc-size SIZE Size for models PVC (default: 10Gi) + --cache-pvc-size SIZE Size for cache PVC (default: 5Gi) + --embedding-model MODEL BERT embedding model (default: all-MiniLM-L12-v2) + --dry-run Generate manifests without applying + --skip-validation Skip pre-deployment validation + -h, --help Show this help message + +Examples: + # Deploy to namespace 'semantic' with granite32-8b model + $0 -n semantic -i granite32-8b -m granite32-8b + + # Deploy with custom storage class and embedding model + $0 -n myproject -i llama3-70b -m llama3-70b -s gp3-csi --embedding-model all-mpnet-base-v2 + + # Dry run to see what will be deployed + $0 -n semantic -i granite32-8b -m granite32-8b --dry-run + +Prerequisites: + - OpenShift CLI (oc) installed and logged in + - OpenShift AI (RHOAI) with KServe installed + - InferenceService already deployed + - Cluster admin or namespace admin permissions + +For more information, see README.md +EOF + exit 1 +} + +# Parse arguments +while [[ $# -gt 0 ]]; do + case $1 in + -n|--namespace) + NAMESPACE="$2" + shift 2 + ;; + -i|--inferenceservice) + INFERENCESERVICE_NAME="$2" + shift 2 + ;; + -m|--model) + MODEL_NAME="$2" + shift 2 + ;; + -s|--storage-class) + STORAGE_CLASS="$2" + shift 2 + ;; + --models-pvc-size) + MODELS_PVC_SIZE="$2" + shift 2 + ;; + --cache-pvc-size) + CACHE_PVC_SIZE="$2" + shift 2 + ;; + --embedding-model) + EMBEDDING_MODEL="$2" + shift 2 + ;; + --dry-run) + DRY_RUN=true + shift + ;; + --skip-validation) + SKIP_VALIDATION=true + shift + ;; + -h|--help) + usage + ;; + *) + echo -e "${RED}Unknown option: $1${NC}" + usage + ;; + esac +done + +# Validate required arguments +if [ -z "$NAMESPACE" ] || [ -z "$INFERENCESERVICE_NAME" ] || [ -z "$MODEL_NAME" ]; then + echo -e "${RED}Error: Missing required arguments${NC}" + usage +fi + +# Banner +echo "" +echo "==================================================" +echo " vLLM Semantic Router - KServe Deployment" +echo "==================================================" +echo "" + +# Display configuration +echo -e "${BLUE}Configuration:${NC}" +echo " Namespace: $NAMESPACE" +echo " InferenceService: $INFERENCESERVICE_NAME" +echo " Model Name: $MODEL_NAME" +echo " Embedding Model: $EMBEDDING_MODEL" +echo " Storage Class: ${STORAGE_CLASS:-}" +echo " Models PVC Size: $MODELS_PVC_SIZE" +echo " Cache PVC Size: $CACHE_PVC_SIZE" +echo " Dry Run: $DRY_RUN" +echo "" + +# Pre-deployment validation +if [ "$SKIP_VALIDATION" = false ]; then + echo -e "${BLUE}Step 1: Validating prerequisites...${NC}" + + # Check oc command + if ! command -v oc &> /dev/null; then + echo -e "${RED}✗ Error: 'oc' command not found. Please install OpenShift CLI.${NC}" + exit 1 + fi + echo -e "${GREEN}✓${NC} OpenShift CLI found" + + # Check if logged in + if ! oc whoami &> /dev/null; then + echo -e "${RED}✗ Error: Not logged in to OpenShift. Run 'oc login' first.${NC}" + exit 1 + fi + echo -e "${GREEN}✓${NC} Logged in as $(oc whoami)" + + # Check if namespace exists + if ! oc get namespace "$NAMESPACE" &> /dev/null; then + echo -e "${YELLOW}⚠ Warning: Namespace '$NAMESPACE' does not exist.${NC}" + read -p "Create namespace? (y/n) " -n 1 -r + echo + if [[ $REPLY =~ ^[Yy]$ ]]; then + oc create namespace "$NAMESPACE" + echo -e "${GREEN}✓${NC} Created namespace: $NAMESPACE" + else + echo -e "${RED}✗ Aborted${NC}" + exit 1 + fi + else + echo -e "${GREEN}✓${NC} Namespace exists: $NAMESPACE" + fi + + # Check if InferenceService exists + if ! oc get inferenceservice "$INFERENCESERVICE_NAME" -n "$NAMESPACE" &> /dev/null; then + echo -e "${RED}✗ Error: InferenceService '$INFERENCESERVICE_NAME' not found in namespace '$NAMESPACE'${NC}" + echo " Please deploy your InferenceService first." + exit 1 + fi + echo -e "${GREEN}✓${NC} InferenceService exists: $INFERENCESERVICE_NAME" + + # Check if InferenceService is ready + ISVC_READY=$(oc get inferenceservice "$INFERENCESERVICE_NAME" -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}') + if [ "$ISVC_READY" != "True" ]; then + echo -e "${YELLOW}⚠ Warning: InferenceService '$INFERENCESERVICE_NAME' is not ready yet${NC}" + echo " Status: $(oc get inferenceservice "$INFERENCESERVICE_NAME" -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}')" + read -p "Continue anyway? (y/n) " -n 1 -r + echo + if [[ ! $REPLY =~ ^[Yy]$ ]]; then + exit 1 + fi + else + echo -e "${GREEN}✓${NC} InferenceService is ready" + fi + + # Get predictor service URL + PREDICTOR_URL=$(oc get inferenceservice "$INFERENCESERVICE_NAME" -n "$NAMESPACE" -o jsonpath='{.status.components.predictor.address.url}' 2>/dev/null || echo "") + if [ -n "$PREDICTOR_URL" ]; then + echo -e "${GREEN}✓${NC} Predictor URL: $PREDICTOR_URL" + fi + + # Create stable ClusterIP service for predictor (KServe creates headless service by default) + echo "Creating stable ClusterIP service for predictor..." + + # Use template file for stable service + if [ -f "$SCRIPT_DIR/service-predictor-stable.yaml" ]; then + substitute_vars "$SCRIPT_DIR/service-predictor-stable.yaml" "$TEMP_DIR/service-predictor-stable.yaml.tmp" + oc apply -f "$TEMP_DIR/service-predictor-stable.yaml.tmp" -n "$NAMESPACE" > /dev/null 2>&1 + else + # Fallback to inline creation if template not found + cat < /dev/null 2>&1 +apiVersion: v1 +kind: Service +metadata: + name: ${INFERENCESERVICE_NAME}-predictor-stable + labels: + app: ${INFERENCESERVICE_NAME} + component: predictor-stable + managed-by: semantic-router-deploy + annotations: + description: "Stable ClusterIP service for semantic router (KServe headless service doesn't provide stable IP)" +spec: + type: ClusterIP + selector: + serving.kserve.io/inferenceservice: ${INFERENCESERVICE_NAME} + ports: + - name: http + port: 8080 + targetPort: 8080 + protocol: TCP +EOF + fi + + # Get the stable ClusterIP + PREDICTOR_SERVICE_IP=$(oc get svc "${INFERENCESERVICE_NAME}-predictor-stable" -n "$NAMESPACE" -o jsonpath='{.spec.clusterIP}' 2>/dev/null || echo "") + if [ -z "$PREDICTOR_SERVICE_IP" ]; then + echo -e "${RED}✗ Error: Could not get predictor service ClusterIP${NC}" + echo " The stable service was not created properly." + exit 1 + fi + echo -e "${GREEN}✓${NC} Predictor service ClusterIP: $PREDICTOR_SERVICE_IP (stable across pod restarts)" + + echo "" +fi + +# Generate manifests +echo -e "${BLUE}Step 2: Generating manifests...${NC}" + +TEMP_DIR=$(mktemp -d) +trap 'rm -rf "$TEMP_DIR"' EXIT + +# Function to substitute variables in a file +substitute_vars() { + local input_file="$1" + local output_file="$2" + + sed -e "s/{{NAMESPACE}}/$NAMESPACE/g" \ + -e "s/{{INFERENCESERVICE_NAME}}/$INFERENCESERVICE_NAME/g" \ + -e "s/{{MODEL_NAME}}/$MODEL_NAME/g" \ + -e "s|{{EMBEDDING_MODEL}}|$EMBEDDING_MODEL|g" \ + -e "s/{{PREDICTOR_SERVICE_IP}}/${PREDICTOR_SERVICE_IP:-10.0.0.1}/g" \ + -e "s/{{MODELS_PVC_SIZE}}/$MODELS_PVC_SIZE/g" \ + -e "s/{{CACHE_PVC_SIZE}}/$CACHE_PVC_SIZE/g" \ + "$input_file" > "$output_file" + + # Handle storage class (optional) + if [ -n "$STORAGE_CLASS" ]; then + sed -i.bak "s/# storageClassName:.*/storageClassName: $STORAGE_CLASS/g" "$output_file" + rm -f "${output_file}.bak" + fi +} + +# Process each manifest file +for file in serviceaccount.yaml pvc.yaml configmap-router-config.yaml configmap-envoy-config.yaml peerauthentication.yaml deployment.yaml service.yaml route.yaml; do + if [ -f "$SCRIPT_DIR/$file" ]; then + substitute_vars "$SCRIPT_DIR/$file" "$TEMP_DIR/$file" + echo -e "${GREEN}✓${NC} Generated: $file" + else + echo -e "${YELLOW}⚠ Skipping missing file: $file${NC}" + fi +done + +echo "" + +# Dry run - just show what would be deployed +if [ "$DRY_RUN" = true ]; then + echo -e "${BLUE}Dry run mode - Generated manifests:${NC}" + echo "" + for file in "$TEMP_DIR"/*.yaml; do + echo "--- $(basename "$file") ---" + cat "$file" + echo "" + done + + echo -e "${YELLOW}Dry run complete. No resources were created.${NC}" + echo "To deploy for real, run without --dry-run flag." + exit 0 +fi + +# Deploy manifests +echo -e "${BLUE}Step 3: Deploying to OpenShift...${NC}" + +oc apply -f "$TEMP_DIR/serviceaccount.yaml" -n "$NAMESPACE" +oc apply -f "$TEMP_DIR/pvc.yaml" -n "$NAMESPACE" +oc apply -f "$TEMP_DIR/configmap-router-config.yaml" -n "$NAMESPACE" +oc apply -f "$TEMP_DIR/configmap-envoy-config.yaml" -n "$NAMESPACE" +oc apply -f "$TEMP_DIR/peerauthentication.yaml" -n "$NAMESPACE" +oc apply -f "$TEMP_DIR/deployment.yaml" -n "$NAMESPACE" +oc apply -f "$TEMP_DIR/service.yaml" -n "$NAMESPACE" +oc apply -f "$TEMP_DIR/route.yaml" -n "$NAMESPACE" + +echo -e "${GREEN}✓${NC} Resources deployed successfully" +echo "" + +# Wait for deployment +echo -e "${BLUE}Step 4: Waiting for deployment to be ready...${NC}" +echo "This may take a few minutes while models are downloaded..." +echo "" + +# Monitor pod status +for i in {1..60}; do + POD_STATUS=$(oc get pods -l app=semantic-router -n "$NAMESPACE" -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "") + POD_NAME=$(oc get pods -l app=semantic-router -n "$NAMESPACE" -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "") + + if [ "$POD_STATUS" = "Running" ]; then + READY=$(oc get pods -l app=semantic-router -n "$NAMESPACE" -o jsonpath='{.items[0].status.containerStatuses[*].ready}' 2>/dev/null || echo "") + if [[ "$READY" == *"true true"* ]]; then + echo -e "${GREEN}✓${NC} Pod is ready: $POD_NAME" + break + fi + fi + + # Show init container progress + INIT_STATUS=$(oc get pods -l app=semantic-router -n "$NAMESPACE" -o jsonpath='{.items[0].status.initContainerStatuses[0].state.running}' 2>/dev/null || echo "") + if [ -n "$INIT_STATUS" ]; then + echo -ne "\r Initializing... (downloading models - this takes 2-3 minutes)" + else + echo -ne "\r Waiting for pod... ($i/60)" + fi + + sleep 5 +done + +echo "" + +# Check final status +if ! oc get pods -l app=semantic-router -n "$NAMESPACE" -o jsonpath='{.items[0].status.containerStatuses[*].ready}' 2>/dev/null | grep -q "true true"; then + echo -e "${YELLOW}⚠ Warning: Pod may not be fully ready yet${NC}" + echo " Check status with: oc get pods -l app=semantic-router -n $NAMESPACE" + echo " View logs with: oc logs -l app=semantic-router -c semantic-router -n $NAMESPACE" +fi + +echo "" + +# Get route URL +ROUTE_URL=$(oc get route semantic-router-kserve -n "$NAMESPACE" -o jsonpath='{.spec.host}' 2>/dev/null || echo "") +if [ -n "$ROUTE_URL" ]; then + echo -e "${GREEN}✓${NC} External URL: https://$ROUTE_URL" +else + echo -e "${YELLOW}⚠ Could not determine route URL${NC}" +fi + +echo "" +echo "==================================================" +echo " Deployment Complete!" +echo "==================================================" +echo "" +echo "Next steps:" +echo "" +echo "1. Test the deployment:" +echo " curl -k \"https://$ROUTE_URL/v1/models\"" +echo "" +echo "2. Try a chat completion:" +echo " curl -k \"https://$ROUTE_URL/v1/chat/completions\" \\" +echo " -H 'Content-Type: application/json' \\" +echo " -d '{\"model\": \"$MODEL_NAME\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}'" +echo "" +echo "3. Run validation tests:" +echo " NAMESPACE=$NAMESPACE MODEL_NAME=$MODEL_NAME $SCRIPT_DIR/test-semantic-routing.sh" +echo "" +echo "4. View logs:" +echo " oc logs -l app=semantic-router -c semantic-router -n $NAMESPACE -f" +echo "" +echo "5. Monitor metrics:" +echo " oc port-forward -n $NAMESPACE svc/semantic-router-kserve 9190:9190" +echo " curl http://localhost:9190/metrics" +echo "" + +# Offer to run tests +if [ "$SKIP_VALIDATION" = false ]; then + echo "" + read -p "Run validation tests now? (y/n) " -n 1 -r + echo + if [[ $REPLY =~ ^[Yy]$ ]]; then + echo "" + export NAMESPACE MODEL_NAME + bash "$SCRIPT_DIR/test-semantic-routing.sh" || true + fi +fi + +echo "" +echo "For more information, see: $SCRIPT_DIR/README.md" +echo "" diff --git a/deploy/kserve/deployment.yaml b/deploy/kserve/deployment.yaml new file mode 100644 index 00000000..a58a62f8 --- /dev/null +++ b/deploy/kserve/deployment.yaml @@ -0,0 +1,275 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: semantic-router-kserve + labels: + app: semantic-router + component: gateway + annotations: + opendatahub.io/dashboard: "true" +spec: + replicas: 1 + selector: + matchLabels: + app: semantic-router + component: gateway + template: + metadata: + labels: + app: semantic-router + component: gateway + app.kubernetes.io/name: semantic-router + app.kubernetes.io/component: gateway + app.kubernetes.io/part-of: vllm-semantic-router + annotations: + sidecar.istio.io/inject: "true" # Enable Istio injection for service mesh integration with KServe + proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }' # Ensure proxy is ready + spec: + serviceAccountName: semantic-router # Create ServiceAccount if RBAC required + # OpenShift security context - let OpenShift assign UID/GID + securityContext: + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + + # Init container to download models from HuggingFace + initContainers: + - name: model-downloader + image: python:3.11-slim + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + seccompProfile: + type: RuntimeDefault + command: ["/bin/bash", "-c"] + args: + - | + set -e + echo "Installing Hugging Face Hub..." + pip install --no-cache-dir --user huggingface_hub + + echo "Downloading models to persistent volume..." + cd /app/models + + # Use Python API to download models + python3 << 'PYEOF' + import os + from huggingface_hub import snapshot_download + + models = [ + ("LLM-Semantic-Router/category_classifier_modernbert-base_model", "category_classifier_modernbert-base_model"), + ("LLM-Semantic-Router/pii_classifier_modernbert-base_model", "pii_classifier_modernbert-base_model"), + ("LLM-Semantic-Router/jailbreak_classifier_modernbert-base_model", "jailbreak_classifier_modernbert-base_model"), + ("LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model", "pii_classifier_modernbert-base_presidio_token_model"), + ("sentence-transformers/{{EMBEDDING_MODEL}}", "{{EMBEDDING_MODEL}}") + ] + + cache_dir = "/app/cache/hf" + base_dir = "/app/models" + + for repo_id, local_dir_name in models: + local_dir = os.path.join(base_dir, local_dir_name) + + # Check if model weights actually exist (not just the directory) + has_model = False + if os.path.exists(local_dir): + # Look specifically for model weight files + for root, dirs, files in os.walk(local_dir): + for f in files: + if f.endswith('.safetensors') or f.endswith('.bin') or f.startswith('pytorch_model.'): + has_model = True + print(f"Found model weights: {f}") + break + if has_model: + break + + # Clean up incomplete downloads + if os.path.exists(local_dir) and not has_model: + print(f"Removing incomplete download: {local_dir_name}") + import shutil + shutil.rmtree(local_dir, ignore_errors=True) + + if not has_model: + print(f"Downloading {repo_id}...") + snapshot_download( + repo_id=repo_id, + local_dir=local_dir, + cache_dir=cache_dir + ) + print(f"Downloaded {repo_id}") + else: + print(f"{local_dir_name} already exists, skipping...") + + print("All models downloaded successfully!") + PYEOF + + echo "Model download complete!" + ls -la /app/models/ + + echo "Setting proper permissions..." + find /app/models -type f -exec chmod 644 {} \; || true + find /app/models -type d -exec chmod 755 {} \; || true + + echo "Creating cache directories..." + mkdir -p /app/cache/hf /app/cache/transformers /app/cache/sentence_transformers /app/cache/xdg /app/cache/bert + chmod -R 777 /app/cache/ || true + + echo "Model download complete!" + env: + - name: HF_HUB_CACHE + value: /app/cache/hf + - name: HF_HOME + value: /app/cache/hf + - name: TRANSFORMERS_CACHE + value: /app/cache/transformers + - name: PIP_CACHE_DIR + value: /tmp/pip_cache + - name: PYTHONUSERBASE + value: /tmp/python_user + - name: PATH + value: /tmp/python_user/bin:/usr/local/bin:/usr/bin:/bin + resources: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1" + volumeMounts: + - name: models-volume + mountPath: /app/models + - name: cache-volume + mountPath: /app/cache + + containers: + # Semantic Router container + - name: semantic-router + image: ghcr.io/vllm-project/semantic-router/extproc:latest + imagePullPolicy: Always + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + seccompProfile: + type: RuntimeDefault + ports: + - containerPort: 50051 + name: grpc + protocol: TCP + - containerPort: 9190 + name: metrics + protocol: TCP + - containerPort: 8080 + name: classify-api + protocol: TCP + env: + - name: LD_LIBRARY_PATH + value: "/app/lib" + - name: HF_HOME + value: "/app/cache/hf" + - name: TRANSFORMERS_CACHE + value: "/app/cache/transformers" + - name: SENTENCE_TRANSFORMERS_HOME + value: "/app/cache/sentence_transformers" + - name: XDG_CACHE_HOME + value: "/app/cache/xdg" + - name: HOME + value: "/tmp/home" + volumeMounts: + - name: config-volume + mountPath: /app/config + readOnly: true + - name: models-volume + mountPath: /app/models + - name: cache-volume + mountPath: /app/cache + livenessProbe: + tcpSocket: + port: 50051 + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + readinessProbe: + tcpSocket: + port: 50051 + initialDelaySeconds: 90 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + resources: + requests: + memory: "4Gi" + cpu: "1500m" + limits: + memory: "8Gi" + cpu: "3" + + # Envoy proxy container - routes to KServe endpoints + - name: envoy-proxy + image: envoyproxy/envoy:v1.35.3 + ports: + - containerPort: 8801 + name: envoy-http + protocol: TCP + - containerPort: 19000 + name: envoy-admin + protocol: TCP + command: ["/usr/local/bin/envoy"] + args: + - "-c" + - "/etc/envoy/envoy.yaml" + - "--component-log-level" + - "ext_proc:info,router:info,http:info" + env: + - name: loglevel + value: "info" + volumeMounts: + - name: envoy-config-volume + mountPath: /etc/envoy + readOnly: true + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + seccompProfile: + type: RuntimeDefault + livenessProbe: + tcpSocket: + port: 8801 + initialDelaySeconds: 30 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + readinessProbe: + tcpSocket: + port: 8801 + initialDelaySeconds: 10 + periodSeconds: 15 + timeoutSeconds: 10 + failureThreshold: 3 + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" + + volumes: + - name: config-volume + configMap: + name: semantic-router-kserve-config + - name: envoy-config-volume + configMap: + name: semantic-router-envoy-kserve-config + - name: models-volume + persistentVolumeClaim: + claimName: semantic-router-models + - name: cache-volume + persistentVolumeClaim: + claimName: semantic-router-cache diff --git a/deploy/kserve/example-multi-model-config.yaml b/deploy/kserve/example-multi-model-config.yaml new file mode 100644 index 00000000..4faea00f --- /dev/null +++ b/deploy/kserve/example-multi-model-config.yaml @@ -0,0 +1,294 @@ +# Example configuration for multiple KServe InferenceServices +# This shows how to configure the semantic router to route between multiple models +# based on query category and complexity + +apiVersion: v1 +kind: ConfigMap +metadata: + name: semantic-router-kserve-config + labels: + app: semantic-router + component: config +data: + config.yaml: | + bert_model: + model_id: models/all-MiniLM-L12-v2 + threshold: 0.6 + use_cpu: true + + semantic_cache: + enabled: true + backend_type: "memory" + similarity_threshold: 0.85 + max_entries: 5000 + ttl_seconds: 7200 + eviction_policy: "lru" + use_hnsw: true + hnsw_m: 16 + hnsw_ef_construction: 200 + embedding_model: "bert" + + tools: + enabled: true + top_k: 5 + similarity_threshold: 0.2 + tools_db_path: "config/tools_db.json" + fallback_to_empty: true + + prompt_guard: + enabled: true + use_modernbert: true + model_id: "models/jailbreak_classifier_modernbert-base_model" + threshold: 0.7 + use_cpu: true + jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json" + + # Multiple vLLM Endpoints - KServe InferenceServices + # Example: Small model for simple queries, large model for complex ones + # Replace with your actual namespace + vllm_endpoints: + # Small, fast model (e.g., Granite 3.2 8B) + - name: "granite32-8b-endpoint" + address: "granite32-8b-predictor..svc.cluster.local" + port: 80 + weight: 1 + + # Larger, more capable model (e.g., Granite 3.2 78B or Llama 3.1 70B) + # - name: "granite32-78b-endpoint" + # address: "granite32-78b-predictor..svc.cluster.local" + # port: 80 + # weight: 1 + + # Specialized coding model (e.g., CodeLlama or Granite Code) + # - name: "granite-code-endpoint" + # address: "granite-code-predictor..svc.cluster.local" + # port: 80 + # weight: 1 + + model_config: + # Small model - good for general queries, fast + "granite32-8b": + reasoning_family: "qwen3" + preferred_endpoints: ["granite32-8b-endpoint"] + pii_policy: + allow_by_default: true + pii_types_allowed: ["EMAIL_ADDRESS"] + + # Large model - better for complex reasoning + # "granite32-78b": + # reasoning_family: "qwen3" + # preferred_endpoints: ["granite32-78b-endpoint"] + # pii_policy: + # allow_by_default: true + # pii_types_allowed: ["EMAIL_ADDRESS"] + + # Code-specialized model + # "granite-code": + # reasoning_family: "qwen3" + # preferred_endpoints: ["granite-code-endpoint"] + # pii_policy: + # allow_by_default: true + + classifier: + category_model: + model_id: "models/category_classifier_modernbert-base_model" + use_modernbert: true + threshold: 0.6 + use_cpu: true + category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json" + pii_model: + model_id: "models/pii_classifier_modernbert-base_presidio_token_model" + use_modernbert: true + threshold: 0.7 + use_cpu: true + pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json" + + # Category-based routing strategy + # Higher scores route to that model for the category + categories: + # Simple categories → small model + - name: business + system_prompt: "You are a senior business consultant and strategic advisor." + model_scores: + - model: granite32-8b + score: 0.8 + use_reasoning: false + # - model: granite32-78b + # score: 0.6 + # use_reasoning: false + + - name: other + system_prompt: "You are a helpful assistant." + semantic_cache_enabled: true + semantic_cache_similarity_threshold: 0.75 + model_scores: + - model: granite32-8b + score: 1.0 + use_reasoning: false + + # Complex reasoning categories → large model + - name: math + system_prompt: "You are a mathematics expert." + model_scores: + - model: granite32-8b + score: 0.7 + use_reasoning: true + # - model: granite32-78b + # score: 1.0 + # use_reasoning: true + + - name: physics + system_prompt: "You are a physics expert." + model_scores: + - model: granite32-8b + score: 0.7 + use_reasoning: true + # - model: granite32-78b + # score: 0.9 + # use_reasoning: true + + # Coding → specialized code model + - name: computer science + system_prompt: "You are a computer science expert." + model_scores: + # - model: granite-code + # score: 1.0 + # use_reasoning: false + - model: granite32-8b + score: 0.8 + use_reasoning: false + # - model: granite32-78b + # score: 0.6 + # use_reasoning: false + + # Other categories + - name: law + system_prompt: "You are a knowledgeable legal expert." + model_scores: + - model: granite32-8b + score: 0.5 + use_reasoning: false + # - model: granite32-78b + # score: 0.9 + # use_reasoning: false + + - name: psychology + system_prompt: "You are a psychology expert." + semantic_cache_enabled: true + semantic_cache_similarity_threshold: 0.92 + model_scores: + - model: granite32-8b + score: 0.7 + use_reasoning: false + + - name: biology + system_prompt: "You are a biology expert." + model_scores: + - model: granite32-8b + score: 0.9 + use_reasoning: false + + - name: chemistry + system_prompt: "You are a chemistry expert." + model_scores: + - model: granite32-8b + score: 0.7 + use_reasoning: true + # - model: granite32-78b + # score: 0.9 + # use_reasoning: true + + - name: history + system_prompt: "You are a historian." + model_scores: + - model: granite32-8b + score: 0.8 + use_reasoning: false + + - name: health + system_prompt: "You are a health and medical information expert." + semantic_cache_enabled: true + semantic_cache_similarity_threshold: 0.95 + model_scores: + - model: granite32-8b + score: 0.6 + use_reasoning: false + # - model: granite32-78b + # score: 0.8 + # use_reasoning: false + + - name: economics + system_prompt: "You are an economics expert." + model_scores: + - model: granite32-8b + score: 0.9 + use_reasoning: false + + - name: philosophy + system_prompt: "You are a philosophy expert." + model_scores: + - model: granite32-8b + score: 0.6 + use_reasoning: false + # - model: granite32-78b + # score: 0.8 + # use_reasoning: false + + - name: engineering + system_prompt: "You are an engineering expert." + model_scores: + - model: granite32-8b + score: 0.8 + use_reasoning: false + + default_model: granite32-8b + + reasoning_families: + deepseek: + type: "chat_template_kwargs" + parameter: "thinking" + qwen3: + type: "chat_template_kwargs" + parameter: "enable_thinking" + gpt-oss: + type: "reasoning_effort" + parameter: "reasoning_effort" + gpt: + type: "reasoning_effort" + parameter: "reasoning_effort" + + default_reasoning_effort: high + + api: + batch_classification: + max_batch_size: 100 + concurrency_threshold: 5 + max_concurrency: 8 + metrics: + enabled: true + detailed_goroutine_tracking: true + high_resolution_timing: false + sample_rate: 1.0 + duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30] + size_buckets: [1, 2, 5, 10, 20, 50, 100, 200] + + embedding_models: + qwen3_model_path: "models/Qwen3-Embedding-0.6B" + gemma_model_path: "models/embeddinggemma-300m" + use_cpu: true + + observability: + tracing: + enabled: false + provider: "opentelemetry" + exporter: + type: "stdout" + endpoint: "localhost:4317" + insecure: true + sampling: + type: "always_on" + rate: 1.0 + resource: + service_name: "vllm-semantic-router" + service_version: "v0.1.0" + deployment_environment: "production" diff --git a/deploy/kserve/inference-examples/README.md b/deploy/kserve/inference-examples/README.md new file mode 100644 index 00000000..a38e6398 --- /dev/null +++ b/deploy/kserve/inference-examples/README.md @@ -0,0 +1,23 @@ +# KServe InferenceService Examples + +This directory contains example KServe resource configurations for deploying vLLM models on OpenShift AI. + +## Files + +- `servingruntime-granite32-8b.yaml` - ServingRuntime configuration for vLLM with Granite 3.2 8B +- `inferenceservice-granite32-8b.yaml` - InferenceService to deploy the Granite 3.2 8B model + +## Usage + +```bash +# Deploy the ServingRuntime +oc apply -f servingruntime-granite32-8b.yaml + +# Deploy the InferenceService +oc apply -f inferenceservice-granite32-8b.yaml + +# Get the internal service URL for use in semantic router config +oc get inferenceservice granite32-8b -o jsonpath='{.status.components.predictor.address.url}' +``` + +These examples can be customized for your specific models and resource requirements. diff --git a/deploy/kserve/inference-examples/inferenceservice-granite32-8b.yaml b/deploy/kserve/inference-examples/inferenceservice-granite32-8b.yaml new file mode 100644 index 00000000..dcd0c102 --- /dev/null +++ b/deploy/kserve/inference-examples/inferenceservice-granite32-8b.yaml @@ -0,0 +1,32 @@ +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + annotations: + openshift.io/display-name: granite3.2-8b + serving.knative.openshift.io/enablePassthrough: "true" + sidecar.istio.io/inject: "true" + sidecar.istio.io/rewriteAppHTTPProbers: "true" + serving.kserve.io/deploymentMode: RawDeployment + labels: + opendatahub.io/dashboard: "true" + name: granite32-8b +spec: + predictor: + containerConcurrency: 1 + maxReplicas: 1 + minReplicas: 1 + model: + modelFormat: + name: vLLM + name: "" + resources: + limits: + cpu: "2" + memory: 16Gi + nvidia.com/gpu: "1" + requests: + cpu: "2" + memory: 8Gi + nvidia.com/gpu: "1" + runtime: granite32-8b + storageUri: oci://quay.io/redhat-ai-services/modelcar-catalog:granite-3.2-8b-instruct diff --git a/deploy/kserve/inference-examples/servingruntime-granite32-8b.yaml b/deploy/kserve/inference-examples/servingruntime-granite32-8b.yaml new file mode 100644 index 00000000..aa54e4b8 --- /dev/null +++ b/deploy/kserve/inference-examples/servingruntime-granite32-8b.yaml @@ -0,0 +1,52 @@ +apiVersion: serving.kserve.io/v1alpha1 +kind: ServingRuntime +metadata: + annotations: + opendatahub.io/apiProtocol: REST + opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' + opendatahub.io/template-display-name: vLLM ServingRuntime for KServe + opendatahub.io/template-name: vllm-runtime + openshift.io/display-name: granite32-8b + labels: + opendatahub.io/dashboard: "true" + name: granite32-8b +spec: + annotations: + prometheus.io/path: /metrics + prometheus.io/port: "8080" + containers: + - args: + - --port=8080 + - --model=/mnt/models + - --served-model-name={{.Name}} + - --enable-auto-tool-choice + - --tool-call-parser + - granite + - --chat-template + - /app/data/template/tool_chat_template_granite.jinja + - --max-model-len + - "120000" + command: + - python + - -m + - vllm.entrypoints.openai.api_server + env: + - name: HF_HOME + value: /tmp/hf_home + image: quay.io/modh/vllm@sha256:4f550996130e7d16cacb24ca9a2865e7cf51eddaab014ceaf31a1ea6ef86d4ec + name: kserve-container + ports: + - containerPort: 8080 + protocol: TCP + volumeMounts: + - mountPath: /dev/shm + name: shm + multiModel: false + supportedModelFormats: + - autoSelect: true + name: vLLM + volumes: + - emptyDir: + medium: Memory + sizeLimit: 2Gi + name: shm diff --git a/deploy/kserve/kustomization.yaml b/deploy/kserve/kustomization.yaml new file mode 100644 index 00000000..79897e56 --- /dev/null +++ b/deploy/kserve/kustomization.yaml @@ -0,0 +1,23 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +# Set your namespace here or use: oc apply -k . -n +# namespace: your-namespace + +resources: + - serviceaccount.yaml + - pvc.yaml + - configmap-router-config.yaml + - configmap-envoy-config.yaml + - peerauthentication.yaml + - deployment.yaml + - service.yaml + - route.yaml + +commonLabels: + app.kubernetes.io/name: semantic-router + app.kubernetes.io/component: gateway + app.kubernetes.io/part-of: vllm-semantic-router + +# Optional: Add namespace creation if needed +# - namespace.yaml diff --git a/deploy/kserve/peerauthentication.yaml b/deploy/kserve/peerauthentication.yaml new file mode 100644 index 00000000..209f21a0 --- /dev/null +++ b/deploy/kserve/peerauthentication.yaml @@ -0,0 +1,15 @@ +apiVersion: security.istio.io/v1beta1 +kind: PeerAuthentication +metadata: + name: semantic-router-kserve-permissive + namespace: {{NAMESPACE}} + labels: + app: semantic-router + component: gateway +spec: + selector: + matchLabels: + app: semantic-router + component: gateway + mtls: + mode: PERMISSIVE # Accept both mTLS and plain HTTP diff --git a/deploy/kserve/pvc.yaml b/deploy/kserve/pvc.yaml new file mode 100644 index 00000000..04a04612 --- /dev/null +++ b/deploy/kserve/pvc.yaml @@ -0,0 +1,33 @@ +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: semantic-router-models + labels: + app: semantic-router + component: storage +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: {{MODELS_PVC_SIZE}} # Adjust based on model size requirements + # storageClassName: gp3-csi # Uncomment and set to your storage class if needed (or use --storage-class flag with deploy.sh) + volumeMode: Filesystem + +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: semantic-router-cache + labels: + app: semantic-router + component: storage +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: {{CACHE_PVC_SIZE}} # Cache storage - adjust as needed + # storageClassName: gp3-csi # Uncomment and set to your storage class if needed (or use --storage-class flag with deploy.sh) + volumeMode: Filesystem diff --git a/deploy/kserve/route.yaml b/deploy/kserve/route.yaml new file mode 100644 index 00000000..4d3fd730 --- /dev/null +++ b/deploy/kserve/route.yaml @@ -0,0 +1,21 @@ +apiVersion: route.openshift.io/v1 +kind: Route +metadata: + name: semantic-router-kserve + labels: + app: semantic-router + component: gateway + annotations: + haproxy.router.openshift.io/timeout: "300s" + haproxy.router.openshift.io/balance: "roundrobin" +spec: + to: + kind: Service + name: semantic-router-kserve + weight: 100 + port: + targetPort: envoy-http + tls: + termination: edge + insecureEdgeTerminationPolicy: Redirect + wildcardPolicy: None diff --git a/deploy/kserve/service-predictor-stable.yaml b/deploy/kserve/service-predictor-stable.yaml new file mode 100644 index 00000000..76cda595 --- /dev/null +++ b/deploy/kserve/service-predictor-stable.yaml @@ -0,0 +1,23 @@ +# yamllint disable rule:line-length rule:syntax-check +# This is a template file with {{VARIABLE}} placeholders - processed by deploy.sh +apiVersion: v1 +kind: Service +metadata: + name: "{{INFERENCESERVICE_NAME}}-predictor-stable" + namespace: "{{NAMESPACE}}" + labels: + app: "{{INFERENCESERVICE_NAME}}" + component: predictor-stable + annotations: + description: "Stable ClusterIP service for semantic router to use (headless service doesn't provide ClusterIP)" +spec: + type: ClusterIP + selector: + serving.kserve.io/inferenceservice: "{{INFERENCESERVICE_NAME}}" + ports: + - name: http + port: 8080 + targetPort: 8080 + protocol: TCP + sessionAffinity: None + diff --git a/deploy/kserve/service.yaml b/deploy/kserve/service.yaml new file mode 100644 index 00000000..6656f099 --- /dev/null +++ b/deploy/kserve/service.yaml @@ -0,0 +1,42 @@ +apiVersion: v1 +kind: Service +metadata: + name: semantic-router-kserve + labels: + app: semantic-router + component: gateway + annotations: + prometheus.io/scrape: "true" + prometheus.io/port: "9190" + prometheus.io/path: "/metrics" +spec: + type: ClusterIP + selector: + app: semantic-router + component: gateway + ports: + - name: envoy-http + port: 80 + targetPort: 8801 + protocol: TCP + - name: envoy-http-direct + port: 8801 + targetPort: 8801 + protocol: TCP + - name: grpc + port: 50051 + targetPort: 50051 + protocol: TCP + - name: metrics + port: 9190 + targetPort: 9190 + protocol: TCP + - name: classify-api + port: 8080 + targetPort: 8080 + protocol: TCP + - name: envoy-admin + port: 19000 + targetPort: 19000 + protocol: TCP + sessionAffinity: None diff --git a/deploy/kserve/serviceaccount.yaml b/deploy/kserve/serviceaccount.yaml new file mode 100644 index 00000000..10277c03 --- /dev/null +++ b/deploy/kserve/serviceaccount.yaml @@ -0,0 +1,6 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: semantic-router + labels: + app: semantic-router diff --git a/deploy/kserve/test-semantic-routing.sh b/deploy/kserve/test-semantic-routing.sh new file mode 100755 index 00000000..fcd00543 --- /dev/null +++ b/deploy/kserve/test-semantic-routing.sh @@ -0,0 +1,254 @@ +#!/bin/bash +# Simple test script to verify semantic routing is working +# Tests different query categories and verifies routing decisions + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Detect kubectl vs oc +if command -v oc &> /dev/null; then + CLI="oc" + DEFAULT_NAMESPACE=$(oc project -q 2>/dev/null || echo "default") +elif command -v kubectl &> /dev/null; then + CLI="kubectl" + DEFAULT_NAMESPACE=$(kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}' 2>/dev/null || echo "default") +else + echo -e "${RED}✗${NC} Neither kubectl nor oc found. Please install one of them." + exit 1 +fi + +# Configuration +NAMESPACE="${NAMESPACE:-$DEFAULT_NAMESPACE}" +ROUTE_NAME="semantic-router-kserve" +# Model name to use for testing - get from configmap or override with MODEL_NAME env var +MODEL_NAME="${MODEL_NAME:-$($CLI get configmap semantic-router-kserve-config -n "$NAMESPACE" -o jsonpath='{.data.config\.yaml}' 2>/dev/null | grep 'default_model:' | awk '{print $2}' || echo 'granite32-8b')}" + +# Get the route URL +echo "Using CLI: $CLI" +echo "Using namespace: $NAMESPACE" +echo "Using model: $MODEL_NAME" +echo "Getting semantic router URL..." + +if [ "$CLI" = "oc" ]; then + ROUTER_URL=$($CLI get route "$ROUTE_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.host}' 2>/dev/null) + + if [ -z "$ROUTER_URL" ]; then + echo -e "${RED}✗${NC} Could not find route '$ROUTE_NAME' in namespace '$NAMESPACE'" + echo "Make sure the semantic router is deployed" + exit 1 + fi + + # Determine protocol + if $CLI get route "$ROUTE_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.tls.termination}' 2>/dev/null | grep -q .; then + ROUTER_URL="https://$ROUTER_URL" + else + ROUTER_URL="http://$ROUTER_URL" + fi +else + # For kubectl, try to get the service + SVC_TYPE=$($CLI get svc "$ROUTE_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.type}' 2>/dev/null) + + if [ "$SVC_TYPE" = "LoadBalancer" ]; then + ROUTER_URL=$($CLI get svc "$ROUTE_NAME" -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' 2>/dev/null) + if [ -z "$ROUTER_URL" ]; then + ROUTER_URL=$($CLI get svc "$ROUTE_NAME" -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null) + fi + ROUTER_URL="http://$ROUTER_URL" + else + # Port-forward or ClusterIP - use localhost + echo -e "${YELLOW}Note:${NC} Service is ClusterIP type. You may need to port-forward:" + echo " kubectl port-forward -n $NAMESPACE svc/$ROUTE_NAME 8801:8801" + ROUTER_URL="${ROUTER_URL:-http://localhost:8801}" + fi +fi + +if [ -z "$ROUTER_URL" ] || [ "$ROUTER_URL" = "http://" ]; then + echo -e "${RED}✗${NC} Could not determine router URL" + echo "Set ROUTER_URL environment variable manually" + exit 1 +fi + +echo -e "${GREEN}✓${NC} Semantic router URL: $ROUTER_URL" +echo "" + +# Function to test classification via API endpoint +test_classification_api() { + local query="$1" + local expected_category="$2" + + echo -e "${BLUE}Testing classification API:${NC} \"$query\"" + echo -n "Expected category: $expected_category ... " + + # Call classification endpoint (port 8080) + response=$(curl -s -k -X POST "$ROUTER_URL:8080/api/v1/classify" \ + -H "Content-Type: application/json" \ + -d "{\"text\": \"$query\"}" 2>/dev/null) + + if [ -z "$response" ]; then + echo -e "${YELLOW}SKIP${NC} - Classification API not responding (may not be exposed)" + return 0 + fi + + # Extract category from response + category=$(echo "$response" | grep -o '"category":"[^"]*"' | cut -d'"' -f4) + + if [ -z "$category" ]; then + echo -e "${YELLOW}SKIP${NC} - Classification API not available" + return 0 + fi + + if [ "$category" == "$expected_category" ]; then + echo -e "${GREEN}PASS${NC} - Category: $category" + return 0 + else + echo -e "${YELLOW}PARTIAL${NC} - Got: $category (expected: $expected_category)" + return 0 + fi +} + +# Function to test chat completion +test_chat_completion() { + local query="$1" + local model="${2:-$MODEL_NAME}" + + echo -e "${BLUE}Testing chat completion:${NC} \"$query\"" + echo -n "Sending request to model: $model ... " + + response=$(curl -s -k -X POST "$ROUTER_URL/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d "{\"model\": \"$model\", \"messages\": [{\"role\": \"user\", \"content\": \"$query\"}], \"max_tokens\": 50}" 2>/dev/null) + + if [ -z "$response" ]; then + echo -e "${RED}FAIL${NC} - No response" + return 1 + fi + + # Check for error in response + if echo "$response" | grep -q '"error"'; then + echo -e "${RED}FAIL${NC}" + echo "Error: $(echo "$response" | grep -o '"message":"[^"]*"' | cut -d'"' -f4)" + return 1 + fi + + # Check for completion + if echo "$response" | grep -q '"choices"'; then + echo -e "${GREEN}PASS${NC}" + # Extract first few words of response + content=$(echo "$response" | grep -o '"content":"[^"]*"' | head -1 | cut -d'"' -f4 | cut -c1-100) + echo " Response preview: $content..." + return 0 + else + echo -e "${RED}FAIL${NC} - Invalid response format" + return 1 + fi +} + +echo "==================================================" +echo "Semantic Routing Validation Tests" +echo "==================================================" +echo "" + +# Test 1: Check /v1/models endpoint +echo -e "${BLUE}Test 1:${NC} Checking /v1/models endpoint" +models_response=$(curl -s -k "$ROUTER_URL/v1/models" 2>/dev/null) +if echo "$models_response" | grep -q '"object":"list"'; then + echo -e "${GREEN}✓${NC} Models endpoint responding correctly" + echo "Available models: $(echo "$models_response" | grep -o '"id":"[^"]*"' | cut -d'"' -f4 | tr '\n' ', ' | sed 's/,$//')" +else + echo -e "${RED}✗${NC} Models endpoint not responding correctly" + echo "Response: $models_response" +fi +echo "" + +# Test 2: Classification tests (optional - API may not be exposed) +echo -e "${BLUE}Test 2:${NC} Testing category classification API (optional)" +echo "" + +test_classification_api "What is the derivative of x squared?" "math" +test_classification_api "Explain quantum entanglement in physics" "physics" +test_classification_api "Write a function to reverse a string in Python" "computer science" + +echo "" + +# Test 3: End-to-end chat completion +echo -e "${BLUE}Test 3:${NC} Testing end-to-end chat completion" +echo "" + +test_chat_completion "What is 2+2? Answer briefly." +test_chat_completion "Tell me a joke" + +echo "" + +# Test 4: PII detection (if enabled) +echo -e "${BLUE}Test 4:${NC} Testing PII detection" +echo "" + +echo -e "${BLUE}Testing:${NC} Query with PII (SSN)" +response=$(curl -s -k -X POST "$ROUTER_URL/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d "{\"model\": \"$MODEL_NAME\", \"messages\": [{\"role\": \"user\", \"content\": \"My SSN is 123-45-6789\"}], \"max_tokens\": 50}" 2>/dev/null) + +if echo "$response" | grep -qi "pii\|blocked\|detected"; then + echo -e "${GREEN}✓${NC} PII detection working - request blocked or flagged" +elif echo "$response" | grep -q '"error"'; then + echo -e "${GREEN}✓${NC} PII protection active - request rejected" + echo " Message: $(echo "$response" | grep -o '"message":"[^"]*"' | cut -d'"' -f4)" +else + echo -e "${YELLOW}⚠${NC} PII may have passed through (check if PII policy allows it)" +fi + +echo "" + +# Test 5: Semantic caching +echo -e "${BLUE}Test 5:${NC} Testing semantic caching" +echo "" + +CACHE_QUERY="What is the capital of France?" + +echo "First request (cache miss expected)..." +time1_start=$(date +%s%N) +response1=$(curl -s -k -X POST "$ROUTER_URL/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d "{\"model\": \"$MODEL_NAME\", \"messages\": [{\"role\": \"user\", \"content\": \"$CACHE_QUERY\"}], \"max_tokens\": 20}" 2>/dev/null) +time1_end=$(date +%s%N) +time1=$(((time1_end - time1_start) / 1000000)) + +sleep 1 + +echo "Second request (cache hit expected)..." +time2_start=$(date +%s%N) +response2=$(curl -s -k -X POST "$ROUTER_URL/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d "{\"model\": \"$MODEL_NAME\", \"messages\": [{\"role\": \"user\", \"content\": \"$CACHE_QUERY\"}], \"max_tokens\": 20}" 2>/dev/null) +time2_end=$(date +%s%N) +time2=$(((time2_end - time2_start) / 1000000)) + +echo "First request: ${time1}ms" +echo "Second request: ${time2}ms" + +if [ "$time2" -lt "$time1" ]; then + speedup=$(((time1 - time2) * 100 / time1)) + echo -e "${GREEN}✓${NC} Cache appears to be working (${speedup}% faster)" +else + echo -e "${YELLOW}⚠${NC} Cache behavior unclear or not significant" +fi + +echo "" +echo "==================================================" +echo "Validation Complete" +echo "==================================================" +echo "" +echo "Semantic routing is operational!" +echo "" +echo "Next steps:" +echo " • Review the test results above" +echo " • Check logs: $CLI logs -n $NAMESPACE -l app=semantic-router -c semantic-router" +echo " • View metrics: $CLI port-forward -n $NAMESPACE svc/$ROUTE_NAME 9190:9190" +echo " • Test with your own queries: curl -k \"$ROUTER_URL/v1/chat/completions\" \\" +echo " -H 'Content-Type: application/json' \\" +echo " -d '{\"model\": \"$MODEL_NAME\", \"messages\": [{\"role\": \"user\", \"content\": \"Your query here\"}]}'" +echo ""