Skip to content

Commit 1f5d890

Browse files
committed
📝 docs(gaie): add Gateway API inference extension docs (#664)
Signed-off-by: samzong <[email protected]>
1 parent 6a4ebf4 commit 1f5d890

File tree

2 files changed

+267
-0
lines changed

2 files changed

+267
-0
lines changed
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
apiVersion: gateway.networking.k8s.io/v1
2+
kind: Gateway
3+
metadata:
4+
name: inference-gateway
5+
namespace: default
6+
spec:
7+
gatewayClassName: istio
8+
listeners:
9+
- name: http
10+
protocol: HTTP
11+
port: 80
12+
allowedRoutes:
13+
namespaces:
14+
from: All
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Install with Gateway API Inference Extension
2+
3+
This guide provides step-by-step instructions for integrating the vLLM Semantic Router (vSR) with Istio and the Kubernetes Gateway API Inference Extension (GIE). This powerful combination allows you to manage self-hosted, OpenAI-compatible models using Kubernetes-native APIs for advanced, load-aware routing.
4+
5+
## Architecture Overview
6+
7+
The deployment consists of three main components:
8+
9+
- **vLLM Semantic Router**: The brain that classifies incoming requests based on their content.
10+
- **Istio & Gateway API**: The network mesh and the front door for all traffic entering the cluster.
11+
- **Gateway API Inference Extension (GIE)**: A set of Kubernetes-native APIs (`InferencePool`, etc.) for managing and scaling self-hosted model backends.
12+
13+
## Benefits of Integration
14+
15+
Integrating vSR with Istio and GIE provides a robust, Kubernetes-native solution for serving LLMs with several key benefits:
16+
17+
### 1. **Kubernetes-Native LLM Management**
18+
Manage your models, routing, and scaling policies directly through `kubectl` using familiar Custom Resource Definitions (CRDs).
19+
20+
### 2. **Intelligent Model and Replica Routing**
21+
Combine vSR's prompt-based model routing with GIE's smart, load-aware replica selection. This ensures requests are sent not only to the right model but also to the healthiest replica, all in a single, efficient hop.
22+
23+
### 3. **Protect Your Models from Overload**
24+
The built-in scheduler tracks GPU load and request queues, automatically shedding traffic to prevent your model servers from crashing under high demand.
25+
26+
### 4. **Deep Observability**
27+
Gain insights from both high-level Gateway metrics and detailed vSR performance data (like token usage and classification accuracy) to monitor and troubleshoot your entire AI stack.
28+
29+
### 5. **Secure Multi-Tenancy**
30+
Isolate tenant workloads using standard Kubernetes namespaces and `HTTPRoutes`. Apply rate limits and other policies while sharing a common, secure gateway infrastructure.
31+
32+
## Supported Backend Models
33+
34+
This architecture is designed to work with any self-hosted model that exposes an **OpenAI-compatible API**. The demo models in this guide use `vLLM` to serve Llama3 and Phi-3, but you can easily replace them with your own model servers.
35+
36+
## Prerequisites
37+
38+
Before starting, ensure you have the following tools installed:
39+
40+
- [Docker](https://docs.docker.com/get-docker/) or another container runtime.
41+
- [kind](https://kind.sigs.k8s.io/) v0.22+ or any Kubernetes 1.29+ cluster.
42+
- [kubectl](https://kubernetes.io/docs/tasks/tools/) v1.30+.
43+
- [Helm](https://helm.sh/) v3.14+.
44+
- [istioctl](https://istio.io/latest/docs/ops/diagnostic-tools/istioctl/) v1.28+.
45+
- A Hugging Face token stored in the `HF_TOKEN` environment variable, required for the sample vLLM deployments to download models.
46+
47+
You can validate your toolchain versions with the following commands:
48+
```bash
49+
kind version
50+
kubectl version --client --short
51+
helm version --short
52+
istioctl version --remote=false
53+
```
54+
55+
## Step 1: Create a Kind Cluster (Optional)
56+
57+
If you don't have a Kubernetes cluster, create a local one for testing:
58+
59+
```bash
60+
kind create cluster --name vsr-gie
61+
62+
# Verify the cluster is ready
63+
kubectl wait --for=condition=Ready nodes --all --timeout=300s
64+
```
65+
66+
## Step 2: Install Istio
67+
68+
Install Istio with support for the Gateway API and external processing:
69+
70+
```bash
71+
# Download and install Istio
72+
export ISTIO_VERSION=1.29.0
73+
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=$ISTIO_VERSION sh -
74+
export PATH="$PWD/istio-$ISTIO_VERSION/bin:$PATH"
75+
istioctl install -y --set profile=minimal --set values.pilot.env.ENABLE_GATEWAY_API=true
76+
77+
# Verify Istio is ready
78+
kubectl wait --for=condition=Available deployment/istiod -n istio-system --timeout=300s
79+
```
80+
81+
## Step 3: Install Gateway API & GIE CRDs
82+
83+
Install the Custom Resource Definitions (CRDs) for the standard Gateway API and the Inference Extension:
84+
85+
```bash
86+
# Install Gateway API CRDs
87+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml
88+
89+
# Install Gateway API Inference Extension CRDs
90+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml
91+
92+
# Verify CRDs are installed
93+
kubectl get crd | grep 'gateway.networking.k8s.io'
94+
kubectl get crd | grep 'inference.networking.k8s.io'
95+
```
96+
97+
## Step 4: Deploy Demo LLM Servers
98+
99+
Deploy two `vLLM` instances (Llama3 and Phi-3) to act as our backends. These will be automatically downloaded from Hugging Face.
100+
101+
```bash
102+
# Create namespace and secret for the models
103+
kubectl create namespace llm-backends --dry-run=client -o yaml | kubectl apply -f -
104+
kubectl -n llm-backends create secret generic hf-token --from-literal=token=$HF_TOKEN
105+
106+
# Deploy the model servers
107+
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vLlama3.yaml
108+
kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vPhi4.yaml
109+
110+
# Wait for models to be ready (this may take several minutes)
111+
kubectl -n llm-backends wait --for=condition=Ready pods --all --timeout=10m
112+
```
113+
114+
## Step 5: Deploy vLLM Semantic Router
115+
116+
Deploy the vLLM Semantic Router using its official Helm chart. This component will run as an `ext_proc` server that Istio calls for routing decisions.
117+
118+
```bash
119+
helm upgrade -i semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
120+
--version v0.0.0-latest \
121+
--namespace vllm-semantic-router-system \
122+
--create-namespace \
123+
-f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml
124+
125+
# Wait for the router to be ready
126+
kubectl -n vllm-semantic-router-system wait --for=condition=Available deploy/semantic-router --timeout=10m
127+
```
128+
129+
## Step 6: Deploy Gateway and Routing Logic
130+
131+
Apply the final set of resources to create the public-facing Gateway and wire everything together. This includes the `Gateway`, `InferencePools` for GIE, `HTTPRoutes` for traffic matching, and Istio's `EnvoyFilter`.
132+
133+
```bash
134+
# Apply all routing and gateway resources
135+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml
136+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml
137+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml
138+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml
139+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml
140+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml
141+
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml
142+
143+
# Verify the Gateway is programmed by Istio
144+
kubectl wait --for=condition=Programmed gateway/inference-gateway --timeout=120s
145+
```
146+
147+
## Testing the Deployment
148+
149+
### Method 1: Port Forwarding
150+
151+
Set up port forwarding to access the gateway from your local machine.
152+
153+
```bash
154+
# The Gateway service is named 'inference-gateway-istio' and lives in the default namespace
155+
kubectl port-forward svc/inference-gateway-istio 8080:80
156+
```
157+
158+
### Send Test Requests
159+
160+
Once port forwarding is active, you can send OpenAI-compatible requests to `localhost:8080`.
161+
162+
**Test 1: Explicitly request a model**
163+
This request bypasses the semantic router's logic and goes directly to the specified model pool.
164+
165+
```bash
166+
curl -sS http://localhost:8080/v1/chat/completions \
167+
-H 'Content-Type: application/json' \
168+
-d '{
169+
"model": "llama3-8b",
170+
"messages": [{"role": "user", "content": "Summarize the Kubernetes Gateway API in three sentences."}]
171+
}'
172+
```
173+
174+
**Test 2: Let the Semantic Router choose the model**
175+
By setting `"model": "auto"`, you ask vSR to classify the prompt. It will identify this as a "math" query and add the `x-selected-model: phi4-mini` header, which `HTTPRoute` uses to route the request.
176+
177+
```bash
178+
curl -sS http://localhost:8080/v1/chat/completions \
179+
-H 'Content-Type: application/json' \
180+
-d '{
181+
"model": "auto",
182+
"messages": [{"role": "user", "content": "What is 2+2 * (5-1)?"}],
183+
"max_tokens": 64
184+
}'
185+
```
186+
187+
## Troubleshooting
188+
189+
**Problem: CRDs are missing**
190+
If you see errors like `no matches for kind "InferencePool"`, check that the CRDs are installed.
191+
```bash
192+
# Check for GIE CRDs
193+
kubectl get crd | grep inference.networking.k8s.io
194+
```
195+
196+
**Problem: Gateway is not ready**
197+
If `kubectl port-forward` fails or requests time out, check the Gateway status.
198+
```bash
199+
# The "Programmed" condition should be "True"
200+
kubectl get gateway inference-gateway -o yaml
201+
```
202+
203+
**Problem: vSR is not being called**
204+
If requests work but routing seems incorrect, check the Istio proxy logs for `ext_proc` errors.
205+
```bash
206+
# Get the Istio gateway pod name
207+
export ISTIO_GW_POD=$(kubectl get pod -l istio=ingressgateway -o jsonpath='{.items[0].metadata.name}')
208+
209+
# Check its logs
210+
kubectl logs $ISTIO_GW_POD -c istio-proxy | grep ext_proc
211+
```
212+
213+
**Problem: Requests are failing**
214+
Check the logs for the vLLM Semantic Router and the backend models.
215+
```bash
216+
# Check vSR logs
217+
kubectl logs deploy/semantic-router -n vllm-semantic-router-system
218+
219+
# Check Llama3 backend logs
220+
kubectl logs -n llm-backends -l app=vllm-llama3-8b-instruct
221+
```
222+
223+
## Cleanup
224+
225+
To remove all the resources created in this guide, run the following commands.
226+
227+
```bash
228+
# 1. Delete all applied Kubernetes resources
229+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml
230+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml
231+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml
232+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml
233+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml
234+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml
235+
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml
236+
kubectl delete ns llm-backends
237+
238+
# 2. Uninstall Helm releases
239+
helm uninstall semantic-router -n vllm-semantic-router-system
240+
241+
# 3. Uninstall Istio
242+
istioctl uninstall -y --purge
243+
244+
# 4. Delete the kind cluster (Optional)
245+
kind delete cluster --name vsr-gie
246+
```
247+
248+
## Next Steps
249+
250+
- **Customize Routing**: Modify the `values.yaml` file for the `semantic-router` Helm chart to define your own routing categories and rules.
251+
- **Add Your Own Models**: Replace the demo Llama3 and Phi-3 deployments with your own OpenAI-compatible model servers.
252+
- **Explore Advanced GIE Features**: Look into using `InferenceObjective` for more advanced autoscaling and scheduling policies.
253+
- **Monitor Performance**: Integrate your Gateway and vSR with Prometheus and Grafana to build monitoring dashboards.

0 commit comments

Comments
 (0)