Skip to content

Commit 2a820a6

Browse files
committed
📝 docs(gaie): add Gateway API inference extension docs (#664)
Signed-off-by: samzong <[email protected]>
1 parent 6a4ebf4 commit 2a820a6

File tree

2 files changed

+272
-0
lines changed

2 files changed

+272
-0
lines changed
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
apiVersion: gateway.networking.k8s.io/v1
2+
kind: Gateway
3+
metadata:
4+
name: inference-gateway
5+
namespace: default
6+
spec:
7+
gatewayClassName: istio
8+
listeners:
9+
- name: http
10+
protocol: HTTP
11+
port: 80
12+
allowedRoutes:
13+
namespaces:
14+
from: All
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# Install with Gateway API Inference Extension
2+
3+
This guide provides step-by-step instructions for integrating the vLLM Semantic Router (vSR) with Istio and the Kubernetes Gateway API Inference Extension (GIE). This powerful combination allows you to manage self-hosted, OpenAI-compatible models using Kubernetes-native APIs for advanced, load-aware routing.
4+
5+
## Architecture Overview
6+
7+
The deployment consists of three main components:
8+
9+
- **vLLM Semantic Router**: The brain that classifies incoming requests based on their content.
10+
- **Istio & Gateway API**: The network mesh and the front door for all traffic entering the cluster.
11+
- **Gateway API Inference Extension (GIE)**: A set of Kubernetes-native APIs (`InferencePool`, etc.) for managing and scaling self-hosted model backends.
12+
13+
## Benefits of Integration
14+
15+
Integrating vSR with Istio and GIE provides a robust, Kubernetes-native solution for serving LLMs with several key benefits:
16+
17+
### 1. **Kubernetes-Native LLM Management**
18+
Manage your models, routing, and scaling policies directly through `kubectl` using familiar Custom Resource Definitions (CRDs).
19+
20+
### 2. **Intelligent Model and Replica Routing**
21+
Combine vSR's prompt-based model routing with GIE's smart, load-aware replica selection. This ensures requests are sent not only to the right model but also to the healthiest replica, all in a single, efficient hop.
22+
23+
### 3. **Protect Your Models from Overload**
24+
The built-in scheduler tracks GPU load and request queues, automatically shedding traffic to prevent your model servers from crashing under high demand.
25+
26+
### 4. **Deep Observability**
27+
Gain insights from both high-level Gateway metrics and detailed vSR performance data (like token usage and classification accuracy) to monitor and troubleshoot your entire AI stack.
28+
29+
### 5. **Secure Multi-Tenancy**
30+
Isolate tenant workloads using standard Kubernetes namespaces and `HTTPRoutes`. Apply rate limits and other policies while sharing a common, secure gateway infrastructure.
31+
32+
## Supported Backend Models
33+
34+
This architecture is designed to work with any self-hosted model that exposes an **OpenAI-compatible API**. The demo models in this guide use `vLLM` to serve Llama3 and Phi-3, but you can easily replace them with your own model servers.
35+
36+
## Prerequisites
37+
38+
Before starting, ensure you have the following tools installed:
39+
40+
- [Docker](https://docs.docker.com/get-docker/) or another container runtime.
41+
- [kind](https://kind.sigs.k8s.io/) v0.22+ or any Kubernetes 1.29+ cluster.
42+
- [kubectl](https://kubernetes.io/docs/tasks/tools/) v1.30+.
43+
- [Helm](https://helm.sh/) v3.14+.
44+
- [istioctl](https://istio.io/latest/docs/ops/diagnostic-tools/istioctl/) v1.28+.
45+
- A Hugging Face token stored in the `HF_TOKEN` environment variable, required for the sample vLLM deployments to download models.
46+
47+
You can validate your toolchain versions with the following commands:
48+
49+
```bash
50+
kind version
51+
kubectl version --client --short
52+
helm version --short
53+
istioctl version --remote=false
54+
```
55+
56+
## Step 1: Create a Kind Cluster (Optional)
57+
58+
If you don't have a Kubernetes cluster, create a local one for testing:
59+
60+
```bash
61+
kind create cluster --name vsr-gie
62+
63+
# Verify the cluster is ready
64+
kubectl wait --for=condition=Ready nodes --all --timeout=300s
65+
```
66+
67+
## Step 2: Install Istio
68+
69+
Install Istio with support for the Gateway API and external processing:
70+
71+
```bash
72+
# Download and install Istio
73+
export ISTIO_VERSION=1.29.0
74+
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=$ISTIO_VERSION sh -
75+
export PATH="$PWD/istio-$ISTIO_VERSION/bin:$PATH"
76+
istioctl install -y --set profile=minimal --set values.pilot.env.ENABLE_GATEWAY_API=true
77+
78+
# Verify Istio is ready
79+
kubectl wait --for=condition=Available deployment/istiod -n istio-system --timeout=300s
80+
```
81+
82+
## Step 3: Install Gateway API & GIE CRDs
83+
84+
Install the Custom Resource Definitions (CRDs) for the standard Gateway API and the Inference Extension:
85+
86+
```bash
87+
# Install Gateway API CRDs
88+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml
89+
90+
# Install Gateway API Inference Extension CRDs
91+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml
92+
93+
# Verify CRDs are installed
94+
kubectl get crd | grep 'gateway.networking.k8s.io'
95+
kubectl get crd | grep 'inference.networking.k8s.io'
96+
```
97+
98+
## Step 4: Deploy Demo LLM Servers
99+
100+
Deploy two `vLLM` instances (Llama3 and Phi-3) to act as our backends. These will be automatically downloaded from Hugging Face.
101+
102+
```bash
103+
# Create namespace and secret for the models
104+
kubectl create namespace llm-backends --dry-run=client -o yaml | kubectl apply -f -
105+
kubectl -n llm-backends create secret generic hf-token --from-literal=token=$HF_TOKEN
106+
107+
# Deploy the model servers
108+
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vLlama3.yaml
109+
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vPhi4.yaml
110+
111+
# Wait for models to be ready (this may take several minutes)
112+
kubectl -n llm-backends wait --for=condition=Ready pods --all --timeout=10m
113+
```
114+
115+
## Step 5: Deploy vLLM Semantic Router
116+
117+
Deploy the vLLM Semantic Router using its official Helm chart. This component will run as an `ext_proc` server that Istio calls for routing decisions.
118+
119+
```bash
120+
helm upgrade -i semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
121+
--version v0.0.0-latest \
122+
--namespace vllm-semantic-router-system \
123+
--create-namespace \
124+
-f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml
125+
126+
# Wait for the router to be ready
127+
kubectl -n vllm-semantic-router-system wait --for=condition=Available deploy/semantic-router --timeout=10m
128+
```
129+
130+
## Step 6: Deploy Gateway and Routing Logic
131+
132+
Apply the final set of resources to create the public-facing Gateway and wire everything together. This includes the `Gateway`, `InferencePools` for GIE, `HTTPRoutes` for traffic matching, and Istio's `EnvoyFilter`.
133+
134+
```bash
135+
# Apply all routing and gateway resources
136+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml
137+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml
138+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml
139+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml
140+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml
141+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml
142+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml
143+
144+
# Verify the Gateway is programmed by Istio
145+
kubectl wait --for=condition=Programmed gateway/inference-gateway --timeout=120s
146+
```
147+
148+
## Testing the Deployment
149+
150+
### Method 1: Port Forwarding
151+
152+
Set up port forwarding to access the gateway from your local machine.
153+
154+
```bash
155+
# The Gateway service is named 'inference-gateway-istio' and lives in the default namespace
156+
kubectl port-forward svc/inference-gateway-istio 8080:80
157+
```
158+
159+
### Send Test Requests
160+
161+
Once port forwarding is active, you can send OpenAI-compatible requests to `localhost:8080`.
162+
163+
**Test 1: Explicitly request a model**
164+
This request bypasses the semantic router's logic and goes directly to the specified model pool.
165+
166+
```bash
167+
curl -sS http://localhost:8080/v1/chat/completions \
168+
-H 'Content-Type: application/json' \
169+
-d '{
170+
"model": "llama3-8b",
171+
"messages": [{"role": "user", "content": "Summarize the Kubernetes Gateway API in three sentences."}]
172+
}'
173+
```
174+
175+
**Test 2: Let the Semantic Router choose the model**
176+
By setting `"model": "auto"`, you ask vSR to classify the prompt. It will identify this as a "math" query and add the `x-selected-model: phi4-mini` header, which `HTTPRoute` uses to route the request.
177+
178+
```bash
179+
curl -sS http://localhost:8080/v1/chat/completions \
180+
-H 'Content-Type: application/json' \
181+
-d '{
182+
"model": "auto",
183+
"messages": [{"role": "user", "content": "What is 2+2 * (5-1)?"}],
184+
"max_tokens": 64
185+
}'
186+
```
187+
188+
## Troubleshooting
189+
190+
**Problem: CRDs are missing**
191+
If you see errors like `no matches for kind "InferencePool"`, check that the CRDs are installed.
192+
193+
```bash
194+
# Check for GIE CRDs
195+
kubectl get crd | grep inference.networking.k8s.io
196+
```
197+
198+
**Problem: Gateway is not ready**
199+
If `kubectl port-forward` fails or requests time out, check the Gateway status.
200+
201+
```bash
202+
# The "Programmed" condition should be "True"
203+
kubectl get gateway inference-gateway -o yaml
204+
```
205+
206+
**Problem: vSR is not being called**
207+
If requests work but routing seems incorrect, check the Istio proxy logs for `ext_proc` errors.
208+
209+
```bash
210+
# Get the Istio gateway pod name
211+
export ISTIO_GW_POD=$(kubectl get pod -l istio=ingressgateway -o jsonpath='{.items[0].metadata.name}')
212+
213+
# Check its logs
214+
kubectl logs $ISTIO_GW_POD -c istio-proxy | grep ext_proc
215+
```
216+
217+
**Problem: Requests are failing**
218+
Check the logs for the vLLM Semantic Router and the backend models.
219+
220+
```bash
221+
# Check vSR logs
222+
kubectl logs deploy/semantic-router -n vllm-semantic-router-system
223+
224+
# Check Llama3 backend logs
225+
kubectl logs -n llm-backends -l app=vllm-llama3-8b-instruct
226+
```
227+
228+
## Cleanup
229+
230+
To remove all the resources created in this guide, run the following commands.
231+
232+
```bash
233+
# 1. Delete all applied Kubernetes resources
234+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml
235+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml
236+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml
237+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml
238+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml
239+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml
240+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml
241+
kubectl delete ns llm-backends
242+
243+
# 2. Uninstall Helm releases
244+
helm uninstall semantic-router -n vllm-semantic-router-system
245+
246+
# 3. Uninstall Istio
247+
istioctl uninstall -y --purge
248+
249+
# 4. Delete the kind cluster (Optional)
250+
kind delete cluster --name vsr-gie
251+
```
252+
253+
## Next Steps
254+
255+
- **Customize Routing**: Modify the `values.yaml` file for the `semantic-router` Helm chart to define your own routing categories and rules.
256+
- **Add Your Own Models**: Replace the demo Llama3 and Phi-3 deployments with your own OpenAI-compatible model servers.
257+
- **Explore Advanced GIE Features**: Look into using `InferenceObjective` for more advanced autoscaling and scheduling policies.
258+
- **Monitor Performance**: Integrate your Gateway and vSR with Prometheus and Grafana to build monitoring dashboards.

0 commit comments

Comments
 (0)