Skip to content

Commit 1d4986a

Browse files
authored
feat(llm-d): integrate vsr with llm-d (#589)
The LLM-D related changes are in deploy/kubernetes/llm-d. Additionally, aadded some docs cleanup to the Istio guide committed previously. Signed-off-by: Sanjeev Rampal <[email protected]>
1 parent 7486a05 commit 1d4986a

File tree

8 files changed

+806
-26
lines changed

8 files changed

+806
-26
lines changed

deploy/kubernetes/istio/README.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -41,12 +41,12 @@ $ minikube start \
4141
$ kubectl wait --for=condition=Ready nodes --all --timeout=300s
4242
```
4343

44-
## Step 2: Deploy LLM models service
44+
## Step 2: Deploy LLM models
4545

46-
In this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). We serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. You may choose any other inference engines as long as they expose OpenAI API endpoints. First install a secret for your HuggingFace token previously stored in env variable HF_TOKEN and then deploy the models as shown below.
46+
In this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). We serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. You may choose any other inference engines as long as they expose OpenAI API endpoints. First install a secret for your HuggingFace token previously stored in env variable HF_TOKEN and then deploy the models as shown below. Note that the file path names used in the example kubectl clis in this guide are expected to be executed from the top folder of this repo.
4747

4848
```bash
49-
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN
49+
kubectl create secret generic hf-token-secret --from-literal=token=$HF_TOKEN
5050
```
5151

5252
```bash
@@ -61,7 +61,7 @@ This may take several (10+) minutes the first time this is run to download the m
6161
kubectl apply -f deploy/kubernetes/istio/vPhi4.yaml
6262
```
6363

64-
At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services explosing the IP/ port on which these models are being served. In th example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80.
64+
At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services exposing the IP/ port on which these models are being served. In the example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80.
6565

6666
```bash
6767
# Verify that vLLM pods running the two LLMs are READY and serving
@@ -80,30 +80,11 @@ llama-8b ClusterIP 10.108.250.109 <none>
8080
phi4-mini ClusterIP 10.97.252.33 <none> 80/TCP 9d
8181
```
8282

83-
## Step 3: Update vsr config
84-
85-
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. Ensure that the models in the config file match the models you are using and that the vllm_endpoints in the file match the ip/ port of the llm kubernetes services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features such as PromptGuard or ToolCalling.
86-
87-
## Step 4: Deploy vLLM Semantic Router
88-
89-
Deploy the semantic router service with all required components:
90-
91-
```bash
92-
# Deploy semantic router using Kustomize
93-
kubectl apply -k deploy/kubernetes/istio/
94-
95-
# Wait for deployment to be ready (this may take several minutes for model downloads)
96-
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
97-
98-
# Verify deployment status
99-
kubectl get pods -n vllm-semantic-router-system
100-
```
101-
102-
## Step 5: Install Istio Gateway, Gateway API, Inference Extension
83+
## Step 3: Install Istio Gateway, Gateway API, Inference Extension CRDs
10384

10485
We will use a recent build of Istio for this exercise so that we have the option of also using the v1.0.0 GA version of the Gateway API Inference Extension CRDs and EPP functionality.
10586

106-
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio control plane, Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
87+
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio control plane, Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources nor the EndPointPicker from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
10788

10889
```bash
10990
kubectl get crds | grep gateway
@@ -121,6 +102,25 @@ kubectl get pods | grep istio
121102
kubectl get pods -n istio-system
122103
```
123104

105+
## Step 4: Update vsr config
106+
107+
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. Ensure that the models in the config file match the models you are using and that the vllm_endpoints in the file match the ip/ port of the llm kubernetes services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features such as PromptGuard or ToolCalling.
108+
109+
## Step 5: Deploy vLLM Semantic Router
110+
111+
Deploy the semantic router service with all required components:
112+
113+
```bash
114+
# Deploy semantic router using Kustomize
115+
kubectl apply -k deploy/kubernetes/istio/
116+
117+
# Wait for deployment to be ready (this may take several minutes for model downloads)
118+
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
119+
120+
# Verify deployment status
121+
kubectl get pods -n vllm-semantic-router-system
122+
```
123+
124124
## Step 6: Install additional Istio configuration
125125

126126
Install the destinationrule and envoy filter needed for Istio gateway to use ExtProc based interface with vLLM Semantic router
@@ -139,7 +139,7 @@ kubectl apply -f deploy/kubernetes/istio/httproute-llama3-8b.yaml
139139
kubectl apply -f deploy/kubernetes/istio/httproute-phi4-mini.yaml
140140
```
141141

142-
## Testing the Deployment
142+
## Step 8: Testing the Deployment
143143
To expose the IP on which the Istio gateway listens to client requests from outside the cluster, you can choose any standard kubernetes option for external load balancing. We tested our feature by [deploying and configuring metallb](https://metallb.universe.tf/installation/) into the cluster to be the LoadBalancer provider. Please refer to metallb documentation for installation procedures if needed. Finally, for the minikube case, we get the external url as shown below.
144144

145145
```bash

0 commit comments

Comments
 (0)