You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The LLM-D related changes are in deploy/kubernetes/llm-d.
Additionally, aadded some docs cleanup to the Istio guide committed previously.
Signed-off-by: Sanjeev Rampal <[email protected]>
In this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). We serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. You may choose any other inference engines as long as they expose OpenAI API endpoints. First install a secret for your HuggingFace token previously stored in env variable HF_TOKEN and then deploy the models as shown below.
46
+
In this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). We serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. You may choose any other inference engines as long as they expose OpenAI API endpoints. First install a secret for your HuggingFace token previously stored in env variable HF_TOKEN and then deploy the models as shown below. Note that the file path names used in the example kubectl clis in this guide are expected to be executed from the top folder of this repo.
At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services explosing the IP/ port on which these models are being served. In th example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80.
64
+
At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services exposing the IP/ port on which these models are being served. In the example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80.
65
65
66
66
```bash
67
67
# Verify that vLLM pods running the two LLMs are READY and serving
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. Ensure that the models in the config file match the models you are using and that the vllm_endpoints in the file match the ip/ port of the llm kubernetes services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features such as PromptGuard or ToolCalling.
86
-
87
-
## Step 4: Deploy vLLM Semantic Router
88
-
89
-
Deploy the semantic router service with all required components:
90
-
91
-
```bash
92
-
# Deploy semantic router using Kustomize
93
-
kubectl apply -k deploy/kubernetes/istio/
94
-
95
-
# Wait for deployment to be ready (this may take several minutes for model downloads)
We will use a recent build of Istio for this exercise so that we have the option of also using the v1.0.0 GA version of the Gateway API Inference Extension CRDs and EPP functionality.
105
86
106
-
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio control plane, Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
87
+
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio control plane, Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources nor the EndPointPicker from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
107
88
108
89
```bash
109
90
kubectl get crds | grep gateway
@@ -121,6 +102,25 @@ kubectl get pods | grep istio
121
102
kubectl get pods -n istio-system
122
103
```
123
104
105
+
## Step 4: Update vsr config
106
+
107
+
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. Ensure that the models in the config file match the models you are using and that the vllm_endpoints in the file match the ip/ port of the llm kubernetes services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features such as PromptGuard or ToolCalling.
108
+
109
+
## Step 5: Deploy vLLM Semantic Router
110
+
111
+
Deploy the semantic router service with all required components:
112
+
113
+
```bash
114
+
# Deploy semantic router using Kustomize
115
+
kubectl apply -k deploy/kubernetes/istio/
116
+
117
+
# Wait for deployment to be ready (this may take several minutes for model downloads)
To expose the IP on which the Istio gateway listens to client requests from outside the cluster, you can choose any standard kubernetes option for external load balancing. We tested our feature by [deploying and configuring metallb](https://metallb.universe.tf/installation/) into the cluster to be the LoadBalancer provider. Please refer to metallb documentation for installation procedures if needed. Finally, for the minikube case, we get the external url as shown below.
0 commit comments