|
| 1 | +# Install with Gateway API Inference Extension |
| 2 | + |
| 3 | +This guide provides step-by-step instructions for integrating the vLLM Semantic Router (vSR) with Istio and the Kubernetes Gateway API Inference Extension (GIE). This powerful combination allows you to manage self-hosted, OpenAI-compatible models using Kubernetes-native APIs for advanced, load-aware routing. |
| 4 | + |
| 5 | +## Architecture Overview |
| 6 | + |
| 7 | +The deployment consists of three main components: |
| 8 | + |
| 9 | +- **vLLM Semantic Router**: The brain that classifies incoming requests based on their content. |
| 10 | +- **Istio & Gateway API**: The network mesh and the front door for all traffic entering the cluster. |
| 11 | +- **Gateway API Inference Extension (GIE)**: A set of Kubernetes-native APIs (`InferencePool`, etc.) for managing and scaling self-hosted model backends. |
| 12 | + |
| 13 | +## Benefits of Integration |
| 14 | + |
| 15 | +Integrating vSR with Istio and GIE provides a robust, Kubernetes-native solution for serving LLMs with several key benefits: |
| 16 | + |
| 17 | +### 1. **Kubernetes-Native LLM Management** |
| 18 | +Manage your models, routing, and scaling policies directly through `kubectl` using familiar Custom Resource Definitions (CRDs). |
| 19 | + |
| 20 | +### 2. **Intelligent Model and Replica Routing** |
| 21 | +Combine vSR's prompt-based model routing with GIE's smart, load-aware replica selection. This ensures requests are sent not only to the right model but also to the healthiest replica, all in a single, efficient hop. |
| 22 | + |
| 23 | +### 3. **Protect Your Models from Overload** |
| 24 | +The built-in scheduler tracks GPU load and request queues, automatically shedding traffic to prevent your model servers from crashing under high demand. |
| 25 | + |
| 26 | +### 4. **Deep Observability** |
| 27 | +Gain insights from both high-level Gateway metrics and detailed vSR performance data (like token usage and classification accuracy) to monitor and troubleshoot your entire AI stack. |
| 28 | + |
| 29 | +### 5. **Secure Multi-Tenancy** |
| 30 | +Isolate tenant workloads using standard Kubernetes namespaces and `HTTPRoutes`. Apply rate limits and other policies while sharing a common, secure gateway infrastructure. |
| 31 | + |
| 32 | +## Supported Backend Models |
| 33 | + |
| 34 | +This architecture is designed to work with any self-hosted model that exposes an **OpenAI-compatible API**. The demo models in this guide use `vLLM` to serve Llama3 and Phi-3, but you can easily replace them with your own model servers. |
| 35 | + |
| 36 | +## Prerequisites |
| 37 | + |
| 38 | +Before starting, ensure you have the following tools installed: |
| 39 | + |
| 40 | +- [Docker](https://docs.docker.com/get-docker/) or another container runtime. |
| 41 | +- [kind](https://kind.sigs.k8s.io/) v0.22+ or any Kubernetes 1.29+ cluster. |
| 42 | +- [kubectl](https://kubernetes.io/docs/tasks/tools/) v1.30+. |
| 43 | +- [Helm](https://helm.sh/) v3.14+. |
| 44 | +- [istioctl](https://istio.io/latest/docs/ops/diagnostic-tools/istioctl/) v1.28+. |
| 45 | +- A Hugging Face token stored in the `HF_TOKEN` environment variable, required for the sample vLLM deployments to download models. |
| 46 | + |
| 47 | +You can validate your toolchain versions with the following commands: |
| 48 | + |
| 49 | +```bash |
| 50 | +kind version |
| 51 | +kubectl version --client --short |
| 52 | +helm version --short |
| 53 | +istioctl version --remote=false |
| 54 | +``` |
| 55 | + |
| 56 | +## Step 1: Create a Kind Cluster (Optional) |
| 57 | + |
| 58 | +If you don't have a Kubernetes cluster, create a local one for testing: |
| 59 | + |
| 60 | +```bash |
| 61 | +kind create cluster --name vsr-gie |
| 62 | + |
| 63 | +# Verify the cluster is ready |
| 64 | +kubectl wait --for=condition=Ready nodes --all --timeout=300s |
| 65 | +``` |
| 66 | + |
| 67 | +## Step 2: Install Istio |
| 68 | + |
| 69 | +Install Istio with support for the Gateway API and external processing: |
| 70 | + |
| 71 | +```bash |
| 72 | +# Download and install Istio |
| 73 | +export ISTIO_VERSION=1.29.0 |
| 74 | +curl -L https://istio.io/downloadIstio | ISTIO_VERSION=$ISTIO_VERSION sh - |
| 75 | +export PATH="$PWD/istio-$ISTIO_VERSION/bin:$PATH" |
| 76 | +istioctl install -y --set profile=minimal --set values.pilot.env.ENABLE_GATEWAY_API=true |
| 77 | + |
| 78 | +# Verify Istio is ready |
| 79 | +kubectl wait --for=condition=Available deployment/istiod -n istio-system --timeout=300s |
| 80 | +``` |
| 81 | + |
| 82 | +## Step 3: Install Gateway API & GIE CRDs |
| 83 | + |
| 84 | +Install the Custom Resource Definitions (CRDs) for the standard Gateway API and the Inference Extension: |
| 85 | + |
| 86 | +```bash |
| 87 | +# Install Gateway API CRDs |
| 88 | +kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml |
| 89 | + |
| 90 | +# Install Gateway API Inference Extension CRDs |
| 91 | +kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml |
| 92 | + |
| 93 | +# Verify CRDs are installed |
| 94 | +kubectl get crd | grep 'gateway.networking.k8s.io' |
| 95 | +kubectl get crd | grep 'inference.networking.k8s.io' |
| 96 | +``` |
| 97 | + |
| 98 | +## Step 4: Deploy Demo LLM Servers |
| 99 | + |
| 100 | +Deploy two `vLLM` instances (Llama3 and Phi-3) to act as our backends. These will be automatically downloaded from Hugging Face. |
| 101 | + |
| 102 | +```bash |
| 103 | +# Create namespace and secret for the models |
| 104 | +kubectl create namespace llm-backends --dry-run=client -o yaml | kubectl apply -f - |
| 105 | +kubectl -n llm-backends create secret generic hf-token --from-literal=token=$HF_TOKEN |
| 106 | + |
| 107 | +# Deploy the model servers |
| 108 | +kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vLlama3.yaml |
| 109 | +kubectl -n llm-backends apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/vPhi4.yaml |
| 110 | + |
| 111 | +# Wait for models to be ready (this may take several minutes) |
| 112 | +kubectl -n llm-backends wait --for=condition=Ready pods --all --timeout=10m |
| 113 | +``` |
| 114 | + |
| 115 | +## Step 5: Deploy vLLM Semantic Router |
| 116 | + |
| 117 | +Deploy the vLLM Semantic Router using its official Helm chart. This component will run as an `ext_proc` server that Istio calls for routing decisions. |
| 118 | + |
| 119 | +```bash |
| 120 | +helm upgrade -i semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \ |
| 121 | + --version v0.0.0-latest \ |
| 122 | + --namespace vllm-semantic-router-system \ |
| 123 | + --create-namespace \ |
| 124 | + -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/ai-gateway/semantic-router-values/values.yaml |
| 125 | + |
| 126 | +# Wait for the router to be ready |
| 127 | +kubectl -n vllm-semantic-router-system wait --for=condition=Available deploy/semantic-router --timeout=10m |
| 128 | +``` |
| 129 | + |
| 130 | +## Step 6: Deploy Gateway and Routing Logic |
| 131 | + |
| 132 | +Apply the final set of resources to create the public-facing Gateway and wire everything together. This includes the `Gateway`, `InferencePools` for GIE, `HTTPRoutes` for traffic matching, and Istio's `EnvoyFilter`. |
| 133 | + |
| 134 | +```bash |
| 135 | +# Apply all routing and gateway resources |
| 136 | +kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml |
| 137 | +kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml |
| 138 | +kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml |
| 139 | +kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml |
| 140 | +kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml |
| 141 | +kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml |
| 142 | +kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml |
| 143 | + |
| 144 | +# Verify the Gateway is programmed by Istio |
| 145 | +kubectl wait --for=condition=Programmed gateway/inference-gateway --timeout=120s |
| 146 | +``` |
| 147 | + |
| 148 | +## Testing the Deployment |
| 149 | + |
| 150 | +### Method 1: Port Forwarding |
| 151 | + |
| 152 | +Set up port forwarding to access the gateway from your local machine. |
| 153 | + |
| 154 | +```bash |
| 155 | +# The Gateway service is named 'inference-gateway-istio' and lives in the default namespace |
| 156 | +kubectl port-forward svc/inference-gateway-istio 8080:80 |
| 157 | +``` |
| 158 | + |
| 159 | +### Send Test Requests |
| 160 | + |
| 161 | +Once port forwarding is active, you can send OpenAI-compatible requests to `localhost:8080`. |
| 162 | + |
| 163 | +**Test 1: Explicitly request a model** |
| 164 | +This request bypasses the semantic router's logic and goes directly to the specified model pool. |
| 165 | + |
| 166 | +```bash |
| 167 | +curl -sS http://localhost:8080/v1/chat/completions \ |
| 168 | + -H 'Content-Type: application/json' \ |
| 169 | + -d '{ |
| 170 | + "model": "llama3-8b", |
| 171 | + "messages": [{"role": "user", "content": "Summarize the Kubernetes Gateway API in three sentences."}] |
| 172 | + }' |
| 173 | +``` |
| 174 | + |
| 175 | +**Test 2: Let the Semantic Router choose the model** |
| 176 | +By setting `"model": "auto"`, you ask vSR to classify the prompt. It will identify this as a "math" query and add the `x-selected-model: phi4-mini` header, which `HTTPRoute` uses to route the request. |
| 177 | + |
| 178 | +```bash |
| 179 | +curl -sS http://localhost:8080/v1/chat/completions \ |
| 180 | + -H 'Content-Type: application/json' \ |
| 181 | + -d '{ |
| 182 | + "model": "auto", |
| 183 | + "messages": [{"role": "user", "content": "What is 2+2 * (5-1)?"}], |
| 184 | + "max_tokens": 64 |
| 185 | + }' |
| 186 | +``` |
| 187 | + |
| 188 | +## Troubleshooting |
| 189 | + |
| 190 | +**Problem: CRDs are missing** |
| 191 | +If you see errors like `no matches for kind "InferencePool"`, check that the CRDs are installed. |
| 192 | + |
| 193 | +```bash |
| 194 | +# Check for GIE CRDs |
| 195 | +kubectl get crd | grep inference.networking.k8s.io |
| 196 | +``` |
| 197 | + |
| 198 | +**Problem: Gateway is not ready** |
| 199 | +If `kubectl port-forward` fails or requests time out, check the Gateway status. |
| 200 | + |
| 201 | +```bash |
| 202 | +# The "Programmed" condition should be "True" |
| 203 | +kubectl get gateway inference-gateway -o yaml |
| 204 | +``` |
| 205 | + |
| 206 | +**Problem: vSR is not being called** |
| 207 | +If requests work but routing seems incorrect, check the Istio proxy logs for `ext_proc` errors. |
| 208 | + |
| 209 | +```bash |
| 210 | +# Get the Istio gateway pod name |
| 211 | +export ISTIO_GW_POD=$(kubectl get pod -l istio=ingressgateway -o jsonpath='{.items[0].metadata.name}') |
| 212 | + |
| 213 | +# Check its logs |
| 214 | +kubectl logs $ISTIO_GW_POD -c istio-proxy | grep ext_proc |
| 215 | +``` |
| 216 | + |
| 217 | +**Problem: Requests are failing** |
| 218 | +Check the logs for the vLLM Semantic Router and the backend models. |
| 219 | + |
| 220 | +```bash |
| 221 | +# Check vSR logs |
| 222 | +kubectl logs deploy/semantic-router -n vllm-semantic-router-system |
| 223 | + |
| 224 | +# Check Llama3 backend logs |
| 225 | +kubectl logs -n llm-backends -l app=vllm-llama3-8b-instruct |
| 226 | +``` |
| 227 | + |
| 228 | +## Cleanup |
| 229 | + |
| 230 | +To remove all the resources created in this guide, run the following commands. |
| 231 | + |
| 232 | +```bash |
| 233 | +# 1. Delete all applied Kubernetes resources |
| 234 | +kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/gateway.yaml |
| 235 | +kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-llama.yaml |
| 236 | +kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/inferencepool-phi4.yaml |
| 237 | +kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-llama-pool.yaml |
| 238 | +kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml |
| 239 | +kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/destinationrule.yaml |
| 240 | +kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/kubernetes/istio/envoyfilter.yaml |
| 241 | +kubectl delete ns llm-backends |
| 242 | + |
| 243 | +# 2. Uninstall Helm releases |
| 244 | +helm uninstall semantic-router -n vllm-semantic-router-system |
| 245 | + |
| 246 | +# 3. Uninstall Istio |
| 247 | +istioctl uninstall -y --purge |
| 248 | + |
| 249 | +# 4. Delete the kind cluster (Optional) |
| 250 | +kind delete cluster --name vsr-gie |
| 251 | +``` |
| 252 | + |
| 253 | +## Next Steps |
| 254 | + |
| 255 | +- **Customize Routing**: Modify the `values.yaml` file for the `semantic-router` Helm chart to define your own routing categories and rules. |
| 256 | +- **Add Your Own Models**: Replace the demo Llama3 and Phi-3 deployments with your own OpenAI-compatible model servers. |
| 257 | +- **Explore Advanced GIE Features**: Look into using `InferenceObjective` for more advanced autoscaling and scheduling policies. |
| 258 | +- **Monitor Performance**: Integrate your Gateway and vSR with Prometheus and Grafana to build monitoring dashboards. |
0 commit comments