diff --git a/README.md b/README.md index 51aaf2829..19e801b82 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,8 @@ The following specific terms to this project: performance, availability and capabilities to optimize routing. Includes things like [Prefix Cache] status or [LoRA Adapters] availability. - **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal). - +- **Body Based Router(BBR)**: An optional additional [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) server that parses the http body of the inference prompt message and extracts information (currently the model name for OpenAI API style messages) into a format which can then be used by the gateway for routing purposes. Additional info [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/bbr/README.md) and in the documentation [user guides](https://gateway-api-inference-extension.sigs.k8s.io/guides/). + The following are key industry terms that are important to understand for this project: diff --git a/config/manifests/bbr-example/httproute_bbr.yaml b/config/manifests/bbr-example/httproute_bbr.yaml new file mode 100644 index 000000000..8702546dc --- /dev/null +++ b/config/manifests/bbr-example/httproute_bbr.yaml @@ -0,0 +1,51 @@ +--- +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-llama-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - group: inference.networking.k8s.io + kind: InferencePool + name: vllm-llama3-8b-instruct + matches: + - path: + type: PathPrefix + value: / + headers: + - type: Exact + name: X-Gateway-Model-Name + value: 'meta-llama/Llama-3.1-8B-Instruct' + timeouts: + request: 300s +--- +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-phi4-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - group: inference.networking.k8s.io + kind: InferencePool + name: vllm-phi4-mini-instruct + matches: + - path: + type: PathPrefix + value: / + headers: + - type: Exact + name: X-Gateway-Model-Name + value: 'microsoft/Phi-4-mini-instruct' + timeouts: + request: 300s +--- diff --git a/config/manifests/bbr-example/vllm-phi4-mini.yaml b/config/manifests/bbr-example/vllm-phi4-mini.yaml new file mode 100644 index 000000000..7f7827cb9 --- /dev/null +++ b/config/manifests/bbr-example/vllm-phi4-mini.yaml @@ -0,0 +1,88 @@ +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: phi4-mini + namespace: default +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 20Gi + volumeMode: Filesystem +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: phi4-mini + namespace: default + labels: + app: phi4-mini +spec: + replicas: 1 + selector: + matchLabels: + app: phi4-mini + template: + metadata: + labels: + app: phi4-mini + spec: + volumes: + - name: cache-volume + persistentVolumeClaim: + claimName: phi4-mini + containers: + - name: phi4-mini + image: vllm/vllm-openai:latest + command: ["/bin/sh", "-c"] + args: [ + "vllm serve microsoft/Phi-4-mini-instruct --trust-remote-code --enable-chunked-prefill" + ] + env: + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: token + ports: + - containerPort: 8000 + resources: + limits: + nvidia.com/gpu: "1" + requests: + nvidia.com/gpu: "1" + volumeMounts: + - mountPath: /root/.cache/huggingface + name: cache-volume + livenessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 600 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 600 + periodSeconds: 5 +--- +apiVersion: v1 +kind: Service +metadata: + name: phi4-mini + namespace: default +spec: + ports: + - name: http-phi4-mini + port: 80 + protocol: TCP + targetPort: 8000 + # The label selector should match the deployment labels & it is useful for prefix caching feature + selector: + app: phi4-mini + sessionAffinity: None + type: ClusterIP + diff --git a/pkg/bbr/README.md b/pkg/bbr/README.md index b5b6f770d..80ab38354 100644 --- a/pkg/bbr/README.md +++ b/pkg/bbr/README.md @@ -8,7 +8,3 @@ body of the HTTP request. However, most implementations do not support routing based on the request body. This extension helps bridge that gap for clients. This extension works by parsing the request body. If it finds a `model` parameter in the request body, it will copy the value of that parameter into a request header. - -This extension is intended to be paired with an `ext_proc` capable Gateway. There is not -a standard way to represent this kind of extension in Gateway API yet, so we recommend -referring to implementation-specific documentation for how to deploy this extension. diff --git a/site-src/guides/index.md b/site-src/guides/index.md index 9fe5c7a8d..e7dd8611b 100644 --- a/site-src/guides/index.md +++ b/site-src/guides/index.md @@ -101,7 +101,7 @@ Tooling: === "Istio" ```bash - export GATEWAY_PROVIDER=none + export GATEWAY_PROVIDER=istio helm install vllm-llama3-8b-instruct \ --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ --set provider.name=$GATEWAY_PROVIDER \ @@ -319,6 +319,10 @@ Tooling: kubectl get httproute llm-route -o yaml ``` +### Deploy the Body Based Router Extension (Optional) + + This guide shows how to get started with serving only 1 base model type per L7 URL path. If in addition, you wish to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a following section of the guide, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section. + ### Deploy InferenceObjective (Optional) Deploy the sample InferenceObjective which allows you to specify priority of requests. diff --git a/site-src/guides/serve-multiple-genai-models.md b/site-src/guides/serve-multiple-genai-models.md index 1d90767d0..0beee86fb 100644 --- a/site-src/guides/serve-multiple-genai-models.md +++ b/site-src/guides/serve-multiple-genai-models.md @@ -1,10 +1,8 @@ # Serve multiple generative AI models A company wants to deploy multiple large language models (LLMs) to a cluster to serve different workloads. -For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application. -The company needs to ensure optimal serving performance for these LLMs. -By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`. -You can then route requests based on the model name (such as `chatbot` and `recommender`). +For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application (or as in the example in this guide, a combination of a Llama3 model and a smaller Phi4 model).. You may choose to locate these 2 models at 2 different L7 url paths and follow the steps described in the [`Getting started`](index.md) guide for each such model as already described. However you may also need to serve multiple models located at the same L7 url path and rely on parsing information such as +the Model name in the LLM prompt requests as defined in the OpenAI API format which is commonly used by most models. For such Model-aware routing, you can use the Body-Based Routing feature as described in this guide. ## How @@ -13,73 +11,156 @@ The model name is extracted by [Body-Based routing](https://github.com/kubernete from the request body to the header. The header is then matched to dispatch requests to different `InferencePool` (and their EPPs) instances. -### Deploy Body-Based Routing +### Example Model-Aware Routing using Body-Based Routing (BBR) -To enable body-based routing, you need to deploy the Body-Based Routing ExtProc server using Helm. Depending on your Gateway provider, you can use one of the following commands: +This guide assumes you have already setup the cluster for basic model serving as described in the [`Getting started`](index.md) guide and this guide describes the additional steps needed from that point onwards in order to deploy and exercise an example of routing across multiple models. + + +### Deploy Body-Based Routing Extension + +To enable body-based routing, you need to deploy the Body-Based Routing ExtProc server using Helm. This is a separate ExtProc server from the EndPoint Picker and when installed, is automatically inserted at the start of the gateway's ExtProc chain ahead of other EtxProc servers such as EPP. + +First install this server. Depending on your Gateway provider, you can use one of the following commands: === "GKE" ```bash helm install body-based-router \ - --set provider.name=gke \ - --version v0.5.1 \ - oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing + --set provider.name=gke \ + --version v1.0.0 \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing ``` === "Istio" ```bash helm install body-based-router \ - --set provider.name=istio \ - --version v0.5.1 \ - oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing + --set provider.name=istio \ + --version v1.0.0 \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing ``` === "Other" ```bash helm install body-based-router \ - --version v0.5.1 \ - oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing + --version v1.0.0 \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing + ``` + +Once this is installed, verify that the BBR pod is running without errors using the command `kubectl get pods`. + +### Serving a Second Base Model +Next deploy the second base model that will be served from the same L7 path (which is `/`) as the `meta-llama/Llama-3.1-8B-Instruct` model already being served after following the steps from the [`Getting started`](index.md) guide. In this example the 2nd model is `microsoft/Phi-4-mini-instruct` a relatively small model ( about 3B parameters) from HuggingFace. Note that for this exercise, there need to be atleast 2 GPUs available on the system one each for the two models being served. Serve the second model via the following command. + +```bash +kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/heads/main/config/manifests/bbr-example/vllm-phi4-mini.yaml +``` +Once this is installed, and after allowing for model download and startup time which can last several minutes, verify that the pod with this 2nd LLM phi4-mini, is running without errors using the command `kubectl get pods`. + +### Deploy the 2nd InferencePool and Endpoint Picker Extension +We also want to use an InferencePool and EndPoint Picker for this second model in addition to the Body Based Router in order to be able to schedule across multiple endpoints or LORA adapters within each base model. Hence we create these for our second model as follows. + +=== "GKE" + + ```bash + export GATEWAY_PROVIDER=gke + helm install vllm-phi4-mini-instruct \ + --set inferencePool.modelServers.matchLabels.app=phi4-mini \ + --set provider.name=$GATEWAY_PROVIDER \ + --version v1.0.0 \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool + ``` + +=== "Istio" + + ```bash + export GATEWAY_PROVIDER=istio + helm install vllm-phi4-mini-instruct \ + --set inferencePool.modelServers.matchLabels.app=phi4-mini \ + --set provider.name=$GATEWAY_PROVIDER \ + --version v1.0.0 \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool ``` +After executing this, very that you see two InferencePools and two EPP pods, one per base model type, running without errors, using the CLIs `kubectl get inferencepools` and `kubectl get pods`. + ### Configure HTTPRoute -This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`. +Before configuring the httproutes for the models, we need to delete the prior httproute created for the vllm-llama3-8b-instruct model because we will alter the routing to now also match on the model name as determined by the `X-Gateway-Model-Name` http header that will get inserted by the BBR extension after parsing the Model name from the body of the LLM request message. + +```bash +kubectl delete httproute llm-route +``` + +Now configure new HTTPRoutes, one per each model we want to serve via BBR using the following command which configures both routes. Also examine this manifest file, to see how the `X-Gateway-Model-Name` is used for a header match in the Gateway's rules to route requests to the correct Backend based on model name. For convenience the manifest is also listed below in order to view this routing configuration. + +```bash +kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/bbr-example/httproute_bbr.yaml +``` ```yaml +--- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: - name: routes-to-llms + name: llm-llama-route spec: parentRefs: - - name: inference-gateway + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway rules: - - matches: - - headers: - - type: Exact - name: X-Gateway-Model-Name # (1)! - value: chatbot - path: + - backendRefs: + - group: inference.networking.k8s.io + kind: InferencePool + name: vllm-llama3-8b-instruct + matches: + - path: type: PathPrefix value: / - backendRefs: - - name: gemma3 - group: inference.networking.x-k8s.io + headers: + - type: Exact + name: X-Gateway-Model-Name # (1)! + value: 'meta-llama/Llama-3.1-8B-Instruct' + timeouts: + request: 300s +--- +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-phi4-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - group: inference.networking.k8s.io kind: InferencePool - - matches: - - headers: - - type: Exact - name: X-Gateway-Model-Name # (2)! - value: recommender - path: + name: vllm-phi4-mini-instruct + matches: + - path: type: PathPrefix value: / - backendRefs: - - name: deepseek-r1 - group: inference.networking.x-k8s.io - kind: InferencePool + headers: + - type: Exact + name: X-Gateway-Model-Name # (2)! + value: 'microsoft/Phi-4-mini-instruct' + timeouts: + request: 300s +--- +``` + +Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True` for both routes: + +```bash +kubectl get httproute llm-llama-route -o yaml +``` + +```bash +kubectl get httproute llm-phi4-route -o yaml ``` 1. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances. @@ -88,58 +169,59 @@ spec: ## Try it out 1. Get the gateway IP: -```bash -IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 -``` + ```bash + IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 + ``` === "Chat Completions API" - 1. Send a few requests to model `chatbot` as follows: - ```bash - curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "chatbot", - "messages": [{"role": "user", "content": "What is the color of the sky?"}], - "max_tokens": 100, - "temperature": 0 - }' - ``` - - 2. Send a few requests to model `recommender` as follows: - ```bash - curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "recommender", - "messages": [{"role": "user", "content": "Give me restaurant recommendations in Paris"}], - "max_tokens": 100, - "temperature": 0 - }' - ``` + 1. Send a few requests to Llama model as follows: + ```bash + curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "prompt": "Linux is said to be an open source kernel because ", + "max_tokens": 100, + "temperature": 0 + }' + ``` + + 2. Send a few requests to the Phi4 as follows: + ```bash + curl -X POST -i ${IP}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "microsoft/Phi-4-mini-instruct", + "prompt": "2+2 is ", + "max_tokens": 20, + "temperature": 0 + }' + ``` === "Completions API" - 1. Send a few requests to model `chatbot` as follows: - ```bash - curl -X POST -i ${IP}:${PORT}/v1/completions \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "chatbot", - "prompt": "What is the color of the sky", - "max_tokens": 100, - "temperature": 0 - }' - ``` - - 2. Send a few requests to model `recommender` as follows: - ```bash - curl -X POST -i ${IP}:${PORT}/v1/completions \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "recommender", - "prompt": "Give me restaurant recommendations in Paris", - "max_tokens": 100, - "temperature": 0 - }' - ``` + 1. Send a few requests to Llama model as follows: + ```bash + curl -X POST -i ${IP}:${PORT}/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "prompt": "Linux is said to be an open source kernel because ", + "max_tokens": 100, + "temperature": 0 + }' + ``` + + 2. Send a few requests to the Phi4 as follows: + ```bash + curl -X POST -i ${IP}:${PORT}/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "microsoft/Phi-4-mini-instruct", + "prompt": "2+2 is ", + "max_tokens": 20, + "temperature": 0 + }' + ``` + diff --git a/site-src/index.md b/site-src/index.md index cf1ddb32a..0fbb338f8 100644 --- a/site-src/index.md +++ b/site-src/index.md @@ -25,6 +25,7 @@ The following specific terms to this project: performance, availability and capabilities to optimize routing. Includes things like [Prefix Cache](https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html) status or [LoRA Adapters](https://docs.vllm.ai/en/stable/features/lora.html) availability. - **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal). +- **Body Based Router(BBR)**: An additional (and optional) implementation of an extension that extracts information from the body portion of the inference request, currently the model name attribute from the body of an OpenAI API request, which can then be used by the gateway to perform model-aware functions such as routing/scheduling. This may be used along with the EPP in order to have a combination of model picking and endpoint picking functionality. [Inference Gateway]:#concepts-and-definitions