|
| 1 | +# Integration with vLLM Semantic Router |
| 2 | + |
| 3 | +- [Integration with vLLM Semantic Router](#integration-with-vllm-semantic-router) |
| 4 | + - [What is vLLM Semantic Router?](#what-is-vllm-semantic-router) |
| 5 | + - [What are the benefits of integration?](#what-are-the-benefits-of-integration) |
| 6 | + - [Prerequisites](#prerequisites) |
| 7 | + - [Step 1: Deploy the vLLM Production Stack using your Helm values](#step-1-deploy-the-vllm-production-stack-using-your-helm-values) |
| 8 | + - [Step 2: Deploy vLLM Semantic Router and point it at your vLLM router Service](#step-2-deploy-vllm-semantic-router-and-point-it-at-your-vllm-router-service) |
| 9 | + - [Step 3: Test the deployment](#step-3-test-the-deployment) |
| 10 | + - [Troubleshooting](#troubleshooting) |
| 11 | + |
| 12 | +> This tutorial is adapted from [vLLM production stack tutorials](https://github.com/vllm-project/production-stack/blob/main/tutorials/24-semantic-router-integration.md) |
| 13 | +
|
| 14 | +## What is vLLM Semantic Router? |
| 15 | + |
| 16 | +The vLLM Semantic Router is an intelligent Mixture-of-Models (MoM) router that operates as an Envoy External Processor to semantically route OpenAI API–compatible requests to the most suitable backend model. Using BERT-based classification, it improves both quality and cost efficiency by matching requests (e.g., math, code, creative, general) to specialized models. |
| 17 | + |
| 18 | +- **Auto-selection of models**: Routes math, creative writing, code, and general queries to the best-fit models. |
| 19 | +- **Security & privacy**: PII detection, prompt guard, and safe routing for sensitive prompts. |
| 20 | +- **Performance optimizations**: Semantic cache and better tool selection to cut latency and tokens. |
| 21 | +- **Architecture**: Tight Envoy ExtProc integration; dual Go and Python implementations; production-ready and scalable. |
| 22 | +- **Monitoring**: Grafana dashboards, Prometheus metrics, and tracing for full visibility. |
| 23 | + |
| 24 | +Learn more: [vLLM Semantic Router](https://vllm-semantic-router.com/docs/intro) |
| 25 | + |
| 26 | +## What are the benefits of integration? |
| 27 | + |
| 28 | +The vLLM Production Stack provides several deployment ways that spin up vLLM servers which can direct traffic to different models, perform service discovery and fault tolerance through the Kubernetes API, and support round‑robin, session‑based, prefix‑aware, KV-aware and disaggregated-prefill routing with LMCache native support. The Semantic Router adds a system‑intelligence layer that classifies each user request, selects the most suitable model from a pool, injects domain‑specific system prompts, performs semantic caching and enforces enterprise‑grade security checks such as PII and jailbreak detection. |
| 29 | + |
| 30 | +By combining these two systems we obtain a unified inference stack. Semantic routing ensures that each request is answered by the best possible model. Production‑Stack routing maximizes infrastructure and inference efficiency, and exposes rich metrics. |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +This tutorial will guide you: |
| 35 | + |
| 36 | +- Deploy a minimal vLLM Production Stack |
| 37 | +- Deploy vLLM Semantic Router and point it to your vLLM router Service |
| 38 | +- Test the endpoint via the Envoy AI Gateway |
| 39 | + |
| 40 | +## Prerequisites |
| 41 | + |
| 42 | +- kubectl |
| 43 | +- Helm |
| 44 | +- A Kubernetes cluster (kind, minikube, GKE, etc.) |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## Step 1: Deploy the vLLM Production Stack using your Helm values |
| 49 | + |
| 50 | +Use your chart and the provided values file at `tutorials/assets/values-23-SR.yaml`. |
| 51 | + |
| 52 | +```bash |
| 53 | +helm repo add vllm-production-stack https://vllm-project.github.io/production-stack |
| 54 | +helm install vllm-stack vllm-production-stack/vllm-stack -f ./tutorials/assets/values-23-SR.yaml |
| 55 | +``` |
| 56 | + |
| 57 | +For reference, the following is the sample value file: |
| 58 | + |
| 59 | +```yaml |
| 60 | +servingEngineSpec: |
| 61 | + runtimeClassName: "" |
| 62 | + strategy: |
| 63 | + type: Recreate |
| 64 | + modelSpec: |
| 65 | + - name: "qwen3" |
| 66 | + repository: "lmcache/vllm-openai" |
| 67 | + tag: "v0.3.7" |
| 68 | + modelURL: "Qwen/Qwen3-8B" |
| 69 | + pvcStorage: "50Gi" |
| 70 | + vllmConfig: |
| 71 | + # maxModelLen: 131072 |
| 72 | + extraArgs: ["--served-model-name", "Qwen/Qwen3-8B", "qwen3"] |
| 73 | + |
| 74 | + replicaCount: 2 |
| 75 | + |
| 76 | + requestCPU: 8 |
| 77 | + requestMemory: "16Gi" |
| 78 | + requestGPU: 1 |
| 79 | + |
| 80 | +routerSpec: |
| 81 | + repository: lmcache/lmstack-router |
| 82 | + tag: "latest" |
| 83 | + resources: |
| 84 | + requests: |
| 85 | + cpu: "1" |
| 86 | + memory: "2G" |
| 87 | + limits: |
| 88 | + cpu: "1" |
| 89 | + memory: "2G" |
| 90 | + routingLogic: "roundrobin" |
| 91 | + sessionKey: "x-user-id" |
| 92 | +``` |
| 93 | +
|
| 94 | +Identify the ClusterIP and port of your router Service created by the chart (name may vary): |
| 95 | +
|
| 96 | +```bash |
| 97 | +kubectl get svc vllm-router-service |
| 98 | +# Note the router service ClusterIP and port (e.g., 10.97.254.122:80) |
| 99 | +``` |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## Step 2: Deploy vLLM Semantic Router and point it at your vLLM router Service |
| 104 | + |
| 105 | +Follow the official guide from the official website with **the updated config file as the following**: [Install in Kubernetes](https://vllm-semantic-router.com/docs/installation/kubernetes). |
| 106 | + |
| 107 | +Remember to update the semantic router config to include your vLLM router service as an endpoint. Edit `deploy/kubernetes/config.yaml` and set `vllm_endpoints` like this (replace the IP/port with your router Service ClusterIP/port from step 1): |
| 108 | + |
| 109 | +```yaml |
| 110 | +vllm_endpoints: |
| 111 | + - name: "endpoint1" |
| 112 | + address: <YOUR ROUTER SERVICE CLUSTERIP> |
| 113 | + port: <YOUR ROUTER SERVICE PORT> |
| 114 | + weight: 1 |
| 115 | +``` |
| 116 | +
|
| 117 | +Minimal sequence (same as the guide): |
| 118 | +
|
| 119 | +```bash |
| 120 | +# Deploy vLLM Semantic Router manifests |
| 121 | +kubectl apply -k deploy/kubernetes/ |
| 122 | +kubectl wait --for=condition=Available deployment/semantic-router \ |
| 123 | + -n vllm-semantic-router-system --timeout=600s |
| 124 | + |
| 125 | +# Install Envoy Gateway |
| 126 | +helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \ |
| 127 | + --version v0.0.0-latest \ |
| 128 | + --namespace envoy-gateway-system \ |
| 129 | + --create-namespace |
| 130 | +kubectl wait --timeout=300s -n envoy-gateway-system \ |
| 131 | + deployment/envoy-gateway --for=condition=Available |
| 132 | + |
| 133 | +# Install Envoy AI Gateway |
| 134 | +helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \ |
| 135 | + --version v0.0.0-latest \ |
| 136 | + --namespace envoy-ai-gateway-system \ |
| 137 | + --create-namespace |
| 138 | +kubectl wait --timeout=300s -n envoy-ai-gateway-system \ |
| 139 | + deployment/ai-gateway-controller --for=condition=Available |
| 140 | + |
| 141 | +# Install Gateway API Inference Extension CRDs |
| 142 | +kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml |
| 143 | +kubectl get crd | grep inference |
| 144 | +``` |
| 145 | + |
| 146 | +Apply AI Gateway configuration and create the inference pool per the guide: |
| 147 | + |
| 148 | +```bash |
| 149 | +# Apply AI Gateway configuration |
| 150 | +kubectl apply -f deploy/kubernetes/ai-gateway/configuration |
| 151 | + |
| 152 | +# Restart controllers to pick up new config |
| 153 | +kubectl rollout restart -n envoy-gateway-system deployment/envoy-gateway |
| 154 | +kubectl rollout restart -n envoy-ai-gateway-system deployment/ai-gateway-controller |
| 155 | +kubectl wait --timeout=120s -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available |
| 156 | +kubectl wait --timeout=120s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available |
| 157 | + |
| 158 | +# Create inference pool |
| 159 | +kubectl apply -f deploy/kubernetes/ai-gateway/inference-pool |
| 160 | +sleep 30 |
| 161 | + |
| 162 | +# Verify inference pool |
| 163 | +kubectl get inferencepool vllm-semantic-router -n vllm-semantic-router-system -o yaml |
| 164 | +``` |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## Step 3: Test the deployment |
| 169 | + |
| 170 | +Port-forward to the Envoy service and send a test request, following the guide: |
| 171 | + |
| 172 | +```bash |
| 173 | +export GATEWAY_IP="localhost:8080" |
| 174 | +export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \ |
| 175 | + --selector=gateway.envoyproxy.io/owning-gateway-namespace=vllm-semantic-router-system,gateway.envoyproxy.io/owning-gateway-name=vllm-semantic-router \ |
| 176 | + -o jsonpath='{.items[0].metadata.name}') |
| 177 | +kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80 |
| 178 | +``` |
| 179 | + |
| 180 | +Send a chat completions request: |
| 181 | + |
| 182 | +```bash |
| 183 | +curl -i -X POST http://localhost:8080/v1/chat/completions \ |
| 184 | + -H "Content-Type: application/json" \ |
| 185 | + -d '{ |
| 186 | + "model": "MoM", |
| 187 | + "messages": [ |
| 188 | + {"role": "user", "content": "What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?"} |
| 189 | + ] |
| 190 | + }' |
| 191 | +``` |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +## Troubleshooting |
| 196 | + |
| 197 | +- If the gateway is not accessible, check the Gateway and Envoy service per the guide. |
| 198 | +- If the inference pool is not ready, `kubectl describe` the `InferencePool` and check controller logs. |
| 199 | +- If the semantic router is not responding, check its pod status and logs. |
| 200 | +- If it is returning error code, check the production stack router log. |
0 commit comments