Skip to content

Commit 1295a6a

Browse files
committed
Add Semantc Rtr support for Istio-Envoy ExtProc gw
Signed-off-by: Sanjeev Rampal <[email protected]>
1 parent a0f0581 commit 1295a6a

File tree

12 files changed

+974
-11
lines changed

12 files changed

+974
-11
lines changed

deploy/kubernetes/istio/README.md

Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
# vLLM Semantic Router as ExtProc server for Istio Gateway
2+
3+
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vsr with it. There are multiple topologies possible to combine Istio Gateway with vsr. This document describes one of the common topologies.
4+
5+
## Architecture Overview
6+
7+
The deployment consists of:
8+
9+
- **vLLM Semantic Router**: Provides intelligent request routing and classification
10+
- **Istio Gateway**: Istio Gateway that uses an Envoy proxy under the covers
11+
- **Gateway API Inference Extension**: Additional control and data plane for endpoint picking that can optionally attach to the same Istio gateway as vLLM Semantic Router.
12+
- **Two instances of vLLM serving 1 model each**: Example backend LLMs for illustrating semantic routing in this topology
13+
14+
## Prerequisites
15+
16+
Before starting, ensure you have the following tools installed:
17+
18+
- [Docker](https://docs.docker.com/get-docker/) - Container runtime
19+
- [minikube](https://minikube.sigs.k8s.io/docs/start/) - Local Kubernetes
20+
- [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) - Kubernetes in Docker
21+
- [kubectl](https://kubernetes.io/docs/tasks/tools/) - Kubernetes CLI
22+
23+
Either minikube or kind works to deploy a local kubernetes cluster needed for this exercise so you only need one of these two. We use minikube in the description below but the same steps should work with a Kind cluster once the cluster is created in Step 1.
24+
25+
We will also deploy two different LLMs in this exercise to illustrate the semantic routing and model routing function more clearly so you ideally you should run this on a machine that has GPU support to run the two models used in this exercise and adequate memory and storage for these models. You can also use equivalent steps on a smaller server that runs smaller LLMs on a CPU based server without GPUs.
26+
27+
## Step 1: Create Minikube Cluster
28+
29+
Create a local Kubernetes cluster via minikube (or equivalently via Kind).
30+
31+
```bash
32+
# Create cluster
33+
$ minikube start \
34+
--driver docker \
35+
--container-runtime docker \
36+
--gpus all \
37+
--memory no-limit \
38+
--cpus no-limit
39+
40+
# Verify cluster is ready
41+
$ kubectl wait --for=condition=Ready nodes --all --timeout=300s
42+
```
43+
44+
## Step 2: Deploy LLM models service
45+
46+
As noted earlier in this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). In this exercise we chose to serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. For this exercise you may choose to use any inference server to serve these models but we have provided manifests to run these in vLLM containers as a reference.
47+
48+
```bash
49+
# Create vLLM service running llama3-8b
50+
kubectl apply -f deploy/kubernetes/istio/vLlama3.yaml
51+
```
52+
53+
This may take several (10+) minutes the first time this is run to download the model up until the vLLM pod running this model is in READY state. Similarly also deploy the second LLM (phi4-mini) and wait for several minutes until the pod is in READY state..
54+
55+
```bash
56+
# Create vLLM service running phi4-mini
57+
kubectl apply -f deploy/kubernetes/istio/vPhi4.yaml
58+
```
59+
60+
At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services explosing the IP/ port on which these models are being served. In th example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80.
61+
62+
```bash
63+
# Verify that vLLM pods running the two LLMs are READY and serving
64+
65+
kubectl get pods
66+
NAME READY STATUS RESTARTS AGE
67+
llama-8b-57b95475bd-ph7s4 1/1 Running 0 9d
68+
phi4-mini-887476b56-74twv 1/1 Running 0 9d
69+
70+
# View the IP/port of the Kubernetes services on which these models are being served
71+
72+
kubectl get service
73+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
74+
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 36d
75+
llama-8b ClusterIP 10.108.250.109 <none> 80/TCP 18d
76+
phi4-mini ClusterIP 10.97.252.33 <none> 80/TCP 9d
77+
```
78+
79+
## Step 3: Update vsr config if needed
80+
81+
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. The example config file provided already in this repo should work f you use the same LLMs as in this exercise but you can choose to play with this config to enable or disable individual vsr features. Ensure that your vllm_endpoints in the file match the ip/ port of the llm services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features as described elsewhere in the vsr documentation.
82+
83+
## Step 4: Deploy vLLM Semantic Router
84+
85+
Deploy the semantic router service with all required components:
86+
87+
```bash
88+
# Deploy semantic router using Kustomize
89+
kubectl apply -k deploy/kubernetes/istio/
90+
91+
# Wait for deployment to be ready (this may take several minutes for model downloads)
92+
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
93+
94+
# Verify deployment status
95+
kubectl get pods -n vllm-semantic-router-system
96+
```
97+
98+
## Step 5: Install Istio Gateway, Gateway API, Inference Extension
99+
100+
We will use a recent build of Istio for this exercise so that we have the option of also using the v1.0.0 GA version of the Gateway API Inference Extension CRDs and EPP functionality.
101+
102+
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
103+
104+
```bash
105+
kubectl get crds | grep gateway
106+
```
107+
108+
```bash
109+
kubectl get crds | grep inference
110+
```
111+
112+
```bash
113+
kubectl get pods | grep istio
114+
```
115+
116+
```bash
117+
kubectl get pods -n istio-system
118+
```
119+
120+
## Step 6: Install additional Istio configuration
121+
122+
Install the destinationrule and envoy filter needed for Istio gateway to use ExtProc based interface with vLLM Semantic router
123+
124+
```bash
125+
kubectl apply -f deploy/kubernetes/istio/destinationrule.yaml
126+
kubectl apply -f deploy/kubernetes/istio/envoyfilter.yaml
127+
```
128+
129+
## Step 7: Install gateway routes
130+
131+
Install HTTPRoutes in the Istio gateway.
132+
133+
```bash
134+
kubectl apply -f deploy/kubernetes/istio/httproute-llama3-8b.yaml
135+
kubectl apply -f deploy/kubernetes/istio/httproute-phi4-mini.yaml
136+
```
137+
138+
## Testing the Deployment
139+
To expose the IP on which the Istio gateway listens to client requests from outside the cluster, you can choose any standard kubernetes option for external load balancing. We tested our feature by [deploying and configuring metallb](https://metallb.universe.tf/installation/) into the cluster to be the LoadBalancer provider. Please refer to metallb documentation for installation procedures if needed. Finally, for the minikube case, we get the external url as shown below.
140+
141+
```bash
142+
minikube service inference-gateway-istio --url
143+
http://192.168.49.2:30913
144+
```
145+
146+
Now we can send LLM prompts via curl to http://192.168.49.2:30913 to access the Istio gateway which will then use information from vLLM semantic router to dynamically route to one of the two LLMs we are using as backends in this case.
147+
148+
### Send Test Requests
149+
150+
Try the following cases with and without model "auto" selection to confirm that Istio + vsr together are able to route queries to the appropriate model. The query responses will include information about which model was used to serve that request.
151+
152+
Example queries to try include the following
153+
154+
```bash
155+
# Model name llama3-8b provided explicitly, should route to this backend
156+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
157+
"model": "llama3-8b",
158+
"messages": [
159+
{"role": "user", "content": "Linux is said to be an open source kernel because "}
160+
],
161+
"max_tokens": 100,
162+
"temperature": 0
163+
}'
164+
```
165+
166+
```bash
167+
# Model name set to "auto", should categorize to "computer science" & route to llama3-8b
168+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
169+
"model": "auto",
170+
"messages": [
171+
{"role": "user", "content": "Linux is said to be an open source kernel because "}
172+
],
173+
"max_tokens": 100,
174+
"temperature": 0
175+
}'
176+
```
177+
178+
```bash
179+
# Model name phi4-mini provided explicitly, should route to this backend
180+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
181+
"model": "phi4-mini",
182+
"messages": [
183+
{"role": "user", "content": "2+2 is "}
184+
],
185+
"max_tokens": 100,
186+
"temperature": 0
187+
}'
188+
```
189+
190+
```bash
191+
# Model name set to "auto", should categorize to "math" & route to phi4-mini
192+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
193+
"model": "auto",
194+
"messages": [
195+
{"role": "user", "content": "2+2 is "}
196+
],
197+
"max_tokens": 100,
198+
"temperature": 0
199+
}'
200+
```
201+
202+
## Troubleshooting
203+
204+
### Common Issues
205+
206+
**Gateway/ Front end not working:**
207+
208+
```bash
209+
# Check istio gateway status
210+
kubectl get gateway
211+
212+
# Check istio gw service status
213+
kubectl get svc inference-gateway-istio
214+
215+
# Check Istio's Envoy logs
216+
kubectl logs deploy/inference-gateway-istio -c istio-proxy
217+
```
218+
219+
**Semantic router not responding:**
220+
221+
```bash
222+
# Check semantic router pod
223+
kubectl get pods -n vllm-semantic-router-system
224+
225+
# Check semantic router service
226+
kubectl get svc -n vllm-semantic-router-system
227+
228+
# Check semantic router logs
229+
kubectl logs -n vllm-semantic-router-system deployment/semantic-router
230+
```
231+
232+
## Cleanup
233+
234+
```bash
235+
236+
# Remove semantic router
237+
kubectl delete -k deploy/kubernetes/istio/
238+
239+
# Remove Istio
240+
istioctl uninstall --purge
241+
242+
# Remove LLMs
243+
kubectl delete -f deploy/kubernetes/istio/vLlama3.yaml
244+
kubectl delete -f deploy/kubernetes/istio/vPhi4.yaml
245+
246+
# Stop minikube cluster
247+
minikube stop
248+
249+
# Delete minikube cluster
250+
minikube delete
251+
```
252+
253+
## Next Steps
254+
255+
- Test/ experiment with different features of vLLM Semantic Router
256+
- Additional use cases/ topologies with Istio Gateway (including with EPP and LLM-D)
257+
- Set up monitoring and observability
258+
- Implement authentication and authorization
259+
- Scale the semantic router deployment for production workloads

0 commit comments

Comments
 (0)