You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@
39
39
40
40
#### Auto-Selection of Models and LoRA Adapters
41
41
42
-
An**Mixture-of-Models** (MoM) router that intelligently directs OpenAI API requests to the most suitable models or LoRA adapters from a defined pool based on **Semantic Understanding** of the request's intent (Complexity, Task, Tools).
42
+
A**Mixture-of-Models** (MoM) router that intelligently directs OpenAI API requests to the most suitable models or LoRA adapters from a defined pool based on **Semantic Understanding** of the request's intent (Complexity, Task, Tools).
Copy file name to clipboardExpand all lines: website/docs/installation/k8s/istio.md
+23-22Lines changed: 23 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
-
# Install with Istio Gateway
1
+
# Install with Istio Gateway
2
2
3
3
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vsr with it. However there are differences between how different Envoy based Gateways process the ExtProc protocol, hence the deployment described here is different from the deployment of vsr alongwith other types of Envoy based Gateways as described in the other guides in this repo. There are multiple architecture options possible to combine Istio Gateway with vsr. This document describes one of the options.
4
-
4
+
5
5
## Architecture Overview
6
6
7
7
The deployment consists of:
@@ -16,20 +16,20 @@ The deployment consists of:
16
16
Before starting, ensure you have the following tools installed:
Either minikube or kind works to deploy a local kubernetes cluster needed for this exercise so you only need one of these two. We use minikube in the description below but the same steps should work with a Kind cluster once the cluster is created in Step 1.
23
+
Either minikube or kind works to deploy a local kubernetes cluster needed for this exercise so you only need one of these two. We use minikube in the description below but the same steps should work with a Kind cluster once the cluster is created in Step 1.
24
24
25
25
We will also deploy two different LLMs in this exercise to illustrate the semantic routing and model routing function more clearly so you ideally you should run this on a machine that has GPU support to run the two models used in this exercise and adequate memory and storage for these models. You can also use equivalent steps on a smaller server that runs smaller LLMs on a CPU based server without GPUs.
26
26
27
27
## Step 1: Create Minikube Cluster
28
28
29
-
Create a local Kubernetes cluster via minikube (or equivalently via Kind).
29
+
Create a local Kubernetes cluster via minikube (or equivalently via Kind).
This may take several (10+) minutes the first time this is run to download the model up until the vLLM pod running this model is in READY state. Similarly also deploy the second LLM (phi4-mini) and wait for several minutes until the pod is in READY state.
At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services exposing the IP/ port on which these models are being served. In the example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80.
65
65
66
66
```bash
67
-
# Verify that vLLM pods running the two LLMs are READY and serving
67
+
# Verify that vLLM pods running the two LLMs are READY and serving
68
68
69
69
kubectl get pods
70
70
NAME READY STATUS RESTARTS AGE
71
71
llama-8b-57b95475bd-ph7s4 1/1 Running 0 9d
72
72
phi4-mini-887476b56-74twv 1/1 Running 0 9d
73
73
74
74
# View the IP/port of the Kubernetes services on which these models are being served
75
-
75
+
76
76
kubectl get service
77
77
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
78
78
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 36d
@@ -104,7 +104,7 @@ kubectl get pods -n istio-system
104
104
105
105
## Step 4: Update vsr config
106
106
107
-
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. Ensure that the models in the config file match the models you are using and that the vllm_endpoints in the file match the ip/ port of the llm kubernetes services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features such as PromptGuard or ToolCalling.
107
+
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. Ensure that the models in the config file match the models you are using and that the vllm_endpoints in the file match the ip/ port of the llm kubernetes services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features such as PromptGuard or ToolCalling.
To expose the IP on which the Istio gateway listens to client requests from outside the cluster, you can choose any standard kubernetes option for external load balancing. We tested our feature by [deploying and configuring metallb](https://metallb.universe.tf/installation/) into the cluster to be the LoadBalancer provider. Please refer to metallb documentation for installation procedures if needed. Finally, for the minikube case, we get the external url as shown below.
144
145
145
146
```bash
@@ -156,7 +157,7 @@ Try the following cases with and without model "auto" selection to confirm that
156
157
Example queries to try include the following
157
158
158
159
```bash
159
-
# Model name llama3-8b provided explicitly, should route to this backend
160
+
# Model name llama3-8b provided explicitly, should route to this backend
0 commit comments