Skip to content

Commit 0895ffb

Browse files
authored
Update the doc wording, use the official llm-d container image (vllm-project#631)
Signed-off-by: Sanjeev Rampal <[email protected]>
1 parent aca5e40 commit 0895ffb

File tree

3 files changed

+26
-27
lines changed

3 files changed

+26
-27
lines changed

deploy/kubernetes/llmd-base/README.md

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# vLLM Semantic Router with LLM-D
22

3-
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) in combination with [LLM-D](https://github.com/llm-d/llm-d). This will also illustrate a key design pattern namely use of the vsr as a model picker in combination with the use of LLM-D as endpoint picker.
3+
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vSR) in combination with [LLM-D](https://github.com/llm-d/llm-d) and a single Inference gateway. This will also illustrate a key design pattern namely the use of the vSR as an automatic model picker in combination with the use of LLM-D as an endpoint picker.
44

5-
A model picker provides the ability to route an LLM query to one of multiple LLM models that are entirely different from each other, whereas an endpoint picker selects one of multiple endpoints that each serve an equivalent model (and most often the exact same base model). Hence this deployment shows how vLLM Semantic Router in its role as a model picker is perfectly complementary to endpoint picker solutions such as LLM-D.
5+
A model picker provides the ability to route an LLM query to one of multiple LLM models that are entirely different from each other, whereas an endpoint picker selects one of multiple endpoints that each serve the same base model in a scale-out deployment for achieving higher performance. Hence this deployment shows how vSR (vLLM Semantic Router) in its role as a model picker based on semantic prompt analysis is perfectly complementary to endpoint picker solutions such as LLM-D. The combined solution enables optimized model serving with N separate base model types that have M endpoints each while relieving the end user/ LLM client of the burden of model selection or endpoint selection.
66

7-
Since LLM-D has a number of deployment configurations some of which require a larger hardware setup we will demonstrate a baseline version of LLM-D working in combination with vsr to introduce the core concepts. These same core concepts will also apply when using vsr with more complex LLM-D configurations and production grade well-lit paths as described in the LLM-D repo at [this link](https://github.com/llm-d/llm-d/tree/main/guides).
7+
Since LLM-D has a number of deployment configurations some of which require a larger hardware setup we will demonstrate a baseline version of LLM-D working in combination with vSR to introduce the core concepts. These same core concepts will also apply when using vSR with more complex LLM-D configurations and production grade well-lit paths as described in the LLM-D repo at [this link](https://github.com/llm-d/llm-d/tree/main/guides).
88

9-
Also we will use LLM-D with Istio as the Inference Gateway in order to build on the steps and hardware setup from the [Istio deployment example](../istio/README.md) documented in this repo. Istio is also commonly used as the default gateway for LLM-D with or without vsr.
9+
Also we will use LLM-D with Istio as the Inference Gateway in order to build on the steps and hardware setup from the [Istio deployment example](../istio/README.md) already documented in this repo. Istio is also commonly used as the default gateway for LLM-D with or without vSR.
1010

1111
## Architecture Overview
1212

1313
The deployment consists of:
1414

15-
- **vLLM Semantic Router**: Provides intelligent request routing and processing decisions to Envoy based Gateways
15+
- **vLLM Semantic Router (vSR)**: Provides intelligent request routing and processing decisions to Envoy based Gateways
1616
- **LLM-D**: Distributed Inference platform used for scaleout LLM inferencing with SOTA performance.
1717
- **Istio Gateway**: Istio's implementation of Kubernetes Gateway API that uses an Envoy proxy under the covers
1818
- **Gateway API Inference Extension**: Additional APIs to extend the Gateway API for Inference via ExtProc servers
@@ -103,7 +103,7 @@ LLM-D (and Kubernetes IGW) use an API resource called InferencePool alongwith a
103103

104104
Deploy the provided manifests in order to create InferencePool and LLM-D inference schedulers corresponding to the 2 base models used in this exercise.
105105

106-
In order to show a full combination of model picking and endpoint picking, one would normally need at least 2 inferencepools with at least 2 endpoints per pool. Since that would require 4 instances of vllm serving pods and 4 GPUs in our exercise, that would require a more complex hardware setup. This guide deploys 1 model endpoint per each of the two InferencePools in order to show the core design of vsr's model picking working with and complementing LLM-D scheduler's endpoint picking.
106+
In order to show a full combination of model picking and endpoint picking, one would normally need at least 2 inferencepools with at least 2 endpoints per pool. Since that would require 4 instances of vllm serving pods and 4 GPUs in our exercise, that would require a more complex hardware setup. This guide deploys 1 model endpoint per each of the two InferencePools in order to show the core design of vSR's model picking working with and complementing LLM-D scheduler's endpoint picking while requiring a simpler hardware setup.
107107

108108
```bash
109109
# Create the LLM-D scheduler and InferencePool for the Llama3-8b model
@@ -129,9 +129,9 @@ kubectl apply -f deploy/kubernetes/llmd-base/dest-rule-epp-llama.yaml
129129
kubectl apply -f deploy/kubernetes/llmd-base/dest-rule-epp-phi4.yaml
130130
```
131131

132-
## Step 6: Update vsr config
132+
## Step 6: Update vSR config
133133

134-
Since this guide is based on using the same backend models as in the [Istio guide](../istio/README.md), we will reuse the same vsr config as from that guide and hence you do not need to update the file deploy/kubernetes/istio/config.yaml. If you were using different backend models as part of the LLM-D deployment, you would need to update this file.
134+
Since this guide is based on using the same backend models as in the [Istio guide](../istio/README.md), we will reuse the same vSR config as from that guide and hence you do not need to update the file deploy/kubernetes/istio/config.yaml. If you were using different backend models as part of the LLM-D deployment, you would need to update this file.
135135

136136
## Step 7: Deploy vLLM Semantic Router
137137

@@ -148,7 +148,7 @@ kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semant
148148
kubectl get pods -n vllm-semantic-router-system
149149
```
150150

151-
## Step 6: Additional Istio configuration for the VSR connection
151+
## Step 8: Additional Istio configuration for the vSR connection
152152

153153
Install the destinationrule and envoy filter needed for Istio gateway to use ExtProc based interface with vLLM Semantic router.
154154

@@ -157,33 +157,33 @@ kubectl apply -f deploy/kubernetes/istio/destinationrule.yaml
157157
kubectl apply -f deploy/kubernetes/istio/envoyfilter.yaml
158158
```
159159

160-
## Step 7: Install gateway routes
160+
## Step 9: Install gateway routes
161161

162-
Install HTTPRoutes in the Istio gateway. Note a difference here compared to the http routes used in the prior vsr + istio guide, here the backendRefs in the route matches based on point to the InferencePools which in turn point to the LLM-D schedulers for those pools instead of the backendRefs pointing to the vllm service endpoints of the models as was done in the [istio guide without llm-d](../istio/README.md).
162+
Install HTTPRoutes in the Istio gateway. Note a difference here compared to the http routes used in the prior vsr + istio guide, here the backendRefs in the route matches point to the InferencePools which in turn point to the LLM-D schedulers for those pools instead of pointing to the vllm service endpoints of the models as was done in the [istio guide without llm-d](../istio/README.md).
163163

164164
```bash
165165
kubectl apply -f deploy/kubernetes/llmd-base/httproute-llama-pool.yaml
166166
kubectl apply -f deploy/kubernetes/llmd-base/httproute-phi4-pool.yaml
167167
```
168168

169-
## Step 8: Testing the Deployment
169+
## Step 10: Testing the Deployment
170170
To expose the IP on which the Istio gateway listens to client requests from outside the cluster, you can choose any standard kubernetes option for external load balancing. We tested our feature by [deploying and configuring metallb](https://metallb.universe.tf/installation/) into the cluster to be the LoadBalancer provider. Please refer to metallb documentation for installation procedures if needed. Finally, for the minikube case, we get the external url as shown below.
171171

172172
```bash
173173
minikube service inference-gateway-istio --url
174174
http://192.168.49.2:32293
175175
```
176176

177-
Now we can send LLM prompts via curl to http://192.168.49.2:32293 to access the Istio gateway which will then use information from vLLM semantic router to dynamically route to one of the two LLMs we are using as backends in this case. Use the port number that you get as output from your "minikube service" command in the curl examples below.
177+
Now we can send LLM prompts via curl to http://192.168.49.2:32293 to access the Istio gateway which will then use information from vLLM semantic router to dynamically route to one of the two LLMs we are using as backends in this case. Use the port number that you get as output from your "minikube service" command when you try the curl examples below.
178178

179179
### Send Test Requests
180180

181-
Try the following cases with and without model "auto" selection to confirm that Istio + vsr together are able to route queries to the appropriate model. The query responses will include information about which model was used to serve that request.
181+
Try the following cases with and without model "auto" selection to confirm that Istio + vSR + llm-d together are able to route queries to the appropriate model by combining model picking and endpoint picking. The query responses will include information about which model was used to serve that request.
182182

183183
Example queries to try include the following
184184

185185
```bash
186-
# Model name llama3-8b provided explicitly, no model alteration, send to llama EPP for endpoint picking
186+
# Model name llama3-8b provided explicitly, no model alteration, routed via llama EPP for endpoint picking
187187
curl http://192.168.49.2:32293/v1/chat/completions -H "Content-Type: application/json" -d '{
188188
"model": "llama3-8b",
189189
"messages": [
@@ -207,7 +207,7 @@ curl http://192.168.49.2:32293/v1/chat/completions -H "Content-Type: applicati
207207
```
208208

209209
```bash
210-
# Model name phi4-mini provided explicitly, no model alteration, send to phi4-mini EPP for endpoint picking
210+
# Model name phi4-mini provided explicitly, no model alteration, routed via phi4-mini EPP for endpoint picking
211211
curl http://192.168.49.2:32293/v1/chat/completions -H "Content-Type: application/json" -d '{
212212
"model": "phi4-mini",
213213
"messages": [
@@ -237,13 +237,12 @@ curl http://192.168.49.2:32293/v1/chat/completions -H "Content-Type: applicati
237237
If you have followed the above steps, you should see pods similar to below running READY state as a quick initial validation. These include the LLM model pods, Istio gateway pod, LLM-D/EPP scheduler pods, vsr pod and istiod controller pod as shown below. You should also see the InferencePools and HTTPRoute instances as shown below with status showing routes in resolved state.
238238

239239
```bash
240-
$ kubectl get pods -n default
241-
NAME READY STATUS RESTARTS AGE
242-
inference-gateway-istio-6fc8864bfb-gbcz8 1/1 Running 0 14h
243-
llama-8b-6558848cc8-wkkxn 1/1 Running 0 3h26m
244-
phi4-mini-7b94bc69db-rnpkj 1/1 Running 0 17h
245-
vllm-llama3-8b-instruct-epp-7f7ff88677-j7lst 1/1 Running 0 134m
246-
vllm-phi4-mini-epp-6f5dd6bbb9-8pv27 1/1 Running 0 14h
240+
NAME READY STATUS RESTARTS AGE
241+
inference-gateway-istio-6fc8864bfb-gbcz8 1/1 Running 0 30h
242+
llama-8b-6558848cc8-wkkxn 1/1 Running 0 19h
243+
llm-d-inference-scheduler-llama3-8b-74854dcdf6-2kvfq 1/1 Running 0 16m
244+
llm-d-inference-scheduler-phi4-mini-65f7d4d4db-ql7qv 1/1 Running 0 16m
245+
phi4-mini-7b94bc69db-rnpkj 1/1 Running 0 33h
247246
```
248247

249248
```bash

deploy/kubernetes/llmd-base/inferencepool-llama.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ metadata:
3838
apiVersion: apps/v1
3939
kind: Deployment
4040
metadata:
41-
name: vllm-llama3-8b-instruct-epp
41+
name: llm-d-inference-scheduler-llama3-8b
4242
namespace: default
4343
labels:
4444
app: vllm-llama3-8b-instruct-epp
@@ -56,7 +56,7 @@ spec:
5656
terminationGracePeriodSeconds: 130
5757
containers:
5858
- name: epp
59-
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main
59+
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.3.2
6060
imagePullPolicy: Always
6161
args:
6262
- --pool-name

deploy/kubernetes/llmd-base/inferencepool-phi4.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ metadata:
3838
apiVersion: apps/v1
3939
kind: Deployment
4040
metadata:
41-
name: vllm-phi4-mini-epp
41+
name: llm-d-inference-scheduler-phi4-mini
4242
namespace: default
4343
labels:
4444
app: vllm-phi4-mini-epp
@@ -56,7 +56,7 @@ spec:
5656
terminationGracePeriodSeconds: 130
5757
containers:
5858
- name: epp
59-
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main
59+
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.3.2
6060
imagePullPolicy: Always
6161
args:
6262
- --pool-name

0 commit comments

Comments
 (0)