Skip to content

Commit 1110d58

Browse files
sync ps tutorial (#592)
Signed-off-by: Rui Zhang <[email protected]>
1 parent 2d0338d commit 1110d58

File tree

1 file changed

+200
-0
lines changed

1 file changed

+200
-0
lines changed
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# Integration with vLLM Semantic Router
2+
3+
- [Integration with vLLM Semantic Router](#integration-with-vllm-semantic-router)
4+
- [What is vLLM Semantic Router?](#what-is-vllm-semantic-router)
5+
- [What are the benefits of integration?](#what-are-the-benefits-of-integration)
6+
- [Prerequisites](#prerequisites)
7+
- [Step 1: Deploy the vLLM Production Stack using your Helm values](#step-1-deploy-the-vllm-production-stack-using-your-helm-values)
8+
- [Step 2: Deploy vLLM Semantic Router and point it at your vLLM router Service](#step-2-deploy-vllm-semantic-router-and-point-it-at-your-vllm-router-service)
9+
- [Step 3: Test the deployment](#step-3-test-the-deployment)
10+
- [Troubleshooting](#troubleshooting)
11+
12+
> This tutorial is adapted from [vLLM production stack tutorials](https://github.com/vllm-project/production-stack/blob/main/tutorials/24-semantic-router-integration.md)
13+
14+
## What is vLLM Semantic Router?
15+
16+
The vLLM Semantic Router is an intelligent Mixture-of-Models (MoM) router that operates as an Envoy External Processor to semantically route OpenAI API–compatible requests to the most suitable backend model. Using BERT-based classification, it improves both quality and cost efficiency by matching requests (e.g., math, code, creative, general) to specialized models.
17+
18+
- **Auto-selection of models**: Routes math, creative writing, code, and general queries to the best-fit models.
19+
- **Security & privacy**: PII detection, prompt guard, and safe routing for sensitive prompts.
20+
- **Performance optimizations**: Semantic cache and better tool selection to cut latency and tokens.
21+
- **Architecture**: Tight Envoy ExtProc integration; dual Go and Python implementations; production-ready and scalable.
22+
- **Monitoring**: Grafana dashboards, Prometheus metrics, and tracing for full visibility.
23+
24+
Learn more: [vLLM Semantic Router](https://vllm-semantic-router.com/docs/intro)
25+
26+
## What are the benefits of integration?
27+
28+
The vLLM Production Stack provides several deployment ways that spin up vLLM servers which can direct traffic to different models, perform service discovery and fault tolerance through the Kubernetes API, and support round‑robin, session‑based, prefix‑aware, KV-aware and disaggregated-prefill routing with LMCache native support. The Semantic Router adds a system‑intelligence layer that classifies each user request, selects the most suitable model from a pool, injects domain‑specific system prompts, performs semantic caching and enforces enterprise‑grade security checks such as PII and jailbreak detection.
29+
30+
By combining these two systems we obtain a unified inference stack. Semantic routing ensures that each request is answered by the best possible model. Production‑Stack routing maximizes infrastructure and inference efficiency, and exposes rich metrics.
31+
32+
---
33+
34+
This tutorial will guide you:
35+
36+
- Deploy a minimal vLLM Production Stack
37+
- Deploy vLLM Semantic Router and point it to your vLLM router Service
38+
- Test the endpoint via the Envoy AI Gateway
39+
40+
## Prerequisites
41+
42+
- kubectl
43+
- Helm
44+
- A Kubernetes cluster (kind, minikube, GKE, etc.)
45+
46+
---
47+
48+
## Step 1: Deploy the vLLM Production Stack using your Helm values
49+
50+
Use your chart and the provided values file at `tutorials/assets/values-23-SR.yaml`.
51+
52+
```bash
53+
helm repo add vllm-production-stack https://vllm-project.github.io/production-stack
54+
helm install vllm-stack vllm-production-stack/vllm-stack -f ./tutorials/assets/values-23-SR.yaml
55+
```
56+
57+
For reference, the following is the sample value file:
58+
59+
```yaml
60+
servingEngineSpec:
61+
runtimeClassName: ""
62+
strategy:
63+
type: Recreate
64+
modelSpec:
65+
- name: "qwen3"
66+
repository: "lmcache/vllm-openai"
67+
tag: "v0.3.7"
68+
modelURL: "Qwen/Qwen3-8B"
69+
pvcStorage: "50Gi"
70+
vllmConfig:
71+
# maxModelLen: 131072
72+
extraArgs: ["--served-model-name", "Qwen/Qwen3-8B", "qwen3"]
73+
74+
replicaCount: 2
75+
76+
requestCPU: 8
77+
requestMemory: "16Gi"
78+
requestGPU: 1
79+
80+
routerSpec:
81+
repository: lmcache/lmstack-router
82+
tag: "latest"
83+
resources:
84+
requests:
85+
cpu: "1"
86+
memory: "2G"
87+
limits:
88+
cpu: "1"
89+
memory: "2G"
90+
routingLogic: "roundrobin"
91+
sessionKey: "x-user-id"
92+
```
93+
94+
Identify the ClusterIP and port of your router Service created by the chart (name may vary):
95+
96+
```bash
97+
kubectl get svc vllm-router-service
98+
# Note the router service ClusterIP and port (e.g., 10.97.254.122:80)
99+
```
100+
101+
---
102+
103+
## Step 2: Deploy vLLM Semantic Router and point it at your vLLM router Service
104+
105+
Follow the official guide from the official website with **the updated config file as the following**: [Install in Kubernetes](https://vllm-semantic-router.com/docs/installation/kubernetes).
106+
107+
Remember to update the semantic router config to include your vLLM router service as an endpoint. Edit `deploy/kubernetes/config.yaml` and set `vllm_endpoints` like this (replace the IP/port with your router Service ClusterIP/port from step 1):
108+
109+
```yaml
110+
vllm_endpoints:
111+
- name: "endpoint1"
112+
address: <YOUR ROUTER SERVICE CLUSTERIP>
113+
port: <YOUR ROUTER SERVICE PORT>
114+
weight: 1
115+
```
116+
117+
Minimal sequence (same as the guide):
118+
119+
```bash
120+
# Deploy vLLM Semantic Router manifests
121+
kubectl apply -k deploy/kubernetes/
122+
kubectl wait --for=condition=Available deployment/semantic-router \
123+
-n vllm-semantic-router-system --timeout=600s
124+
125+
# Install Envoy Gateway
126+
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
127+
--version v0.0.0-latest \
128+
--namespace envoy-gateway-system \
129+
--create-namespace
130+
kubectl wait --timeout=300s -n envoy-gateway-system \
131+
deployment/envoy-gateway --for=condition=Available
132+
133+
# Install Envoy AI Gateway
134+
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
135+
--version v0.0.0-latest \
136+
--namespace envoy-ai-gateway-system \
137+
--create-namespace
138+
kubectl wait --timeout=300s -n envoy-ai-gateway-system \
139+
deployment/ai-gateway-controller --for=condition=Available
140+
141+
# Install Gateway API Inference Extension CRDs
142+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml
143+
kubectl get crd | grep inference
144+
```
145+
146+
Apply AI Gateway configuration and create the inference pool per the guide:
147+
148+
```bash
149+
# Apply AI Gateway configuration
150+
kubectl apply -f deploy/kubernetes/ai-gateway/configuration
151+
152+
# Restart controllers to pick up new config
153+
kubectl rollout restart -n envoy-gateway-system deployment/envoy-gateway
154+
kubectl rollout restart -n envoy-ai-gateway-system deployment/ai-gateway-controller
155+
kubectl wait --timeout=120s -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
156+
kubectl wait --timeout=120s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available
157+
158+
# Create inference pool
159+
kubectl apply -f deploy/kubernetes/ai-gateway/inference-pool
160+
sleep 30
161+
162+
# Verify inference pool
163+
kubectl get inferencepool vllm-semantic-router -n vllm-semantic-router-system -o yaml
164+
```
165+
166+
---
167+
168+
## Step 3: Test the deployment
169+
170+
Port-forward to the Envoy service and send a test request, following the guide:
171+
172+
```bash
173+
export GATEWAY_IP="localhost:8080"
174+
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
175+
--selector=gateway.envoyproxy.io/owning-gateway-namespace=vllm-semantic-router-system,gateway.envoyproxy.io/owning-gateway-name=vllm-semantic-router \
176+
-o jsonpath='{.items[0].metadata.name}')
177+
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
178+
```
179+
180+
Send a chat completions request:
181+
182+
```bash
183+
curl -i -X POST http://localhost:8080/v1/chat/completions \
184+
-H "Content-Type: application/json" \
185+
-d '{
186+
"model": "MoM",
187+
"messages": [
188+
{"role": "user", "content": "What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?"}
189+
]
190+
}'
191+
```
192+
193+
---
194+
195+
## Troubleshooting
196+
197+
- If the gateway is not accessible, check the Gateway and Envoy service per the guide.
198+
- If the inference pool is not ready, `kubectl describe` the `InferencePool` and check controller logs.
199+
- If the semantic router is not responding, check its pod status and logs.
200+
- If it is returning error code, check the production stack router log.

0 commit comments

Comments
 (0)