Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ The following specific terms to this project:
performance, availability and capabilities to optimize routing. Includes
things like [Prefix Cache] status or [LoRA Adapters] availability.
- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).

- **Body Based Router(BBR)**: An optional additional [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) server that parses the http body of the inference prompt message and extracts information (currently the model name for OpenAI API style messages) into a format which can then be used by the gateway for routing purposes. Additional info [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/bbr/README.md) and in the documentation [user guides](https://gateway-api-inference-extension.sigs.k8s.io/guides/).


The following are key industry terms that are important to understand for
this project:
Expand Down
51 changes: 51 additions & 0 deletions config/manifests/bbr-example/httproute_bbr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-llama-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'meta-llama/Llama-3.1-8B-Instruct'
timeouts:
request: 300s
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-phi4-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-phi4-mini-instruct
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'microsoft/Phi-4-mini-instruct'
timeouts:
request: 300s
---
88 changes: 88 additions & 0 deletions config/manifests/bbr-example/vllm-phi4-mini.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: phi4-mini
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: phi4-mini
namespace: default
labels:
app: phi4-mini
spec:
replicas: 1
selector:
matchLabels:
app: phi4-mini
template:
metadata:
labels:
app: phi4-mini
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: phi4-mini
containers:
- name: phi4-mini
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve microsoft/Phi-4-mini-instruct --trust-remote-code --enable-chunked-prefill"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: phi4-mini
namespace: default
spec:
ports:
- name: http-phi4-mini
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: phi4-mini
sessionAffinity: None
type: ClusterIP

4 changes: 0 additions & 4 deletions pkg/bbr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,3 @@ body of the HTTP request. However, most implementations do not support routing
based on the request body. This extension helps bridge that gap for clients.
This extension works by parsing the request body. If it finds a `model` parameter in the
request body, it will copy the value of that parameter into a request header.

This extension is intended to be paired with an `ext_proc` capable Gateway. There is not
a standard way to represent this kind of extension in Gateway API yet, so we recommend
referring to implementation-specific documentation for how to deploy this extension.
6 changes: 5 additions & 1 deletion site-src/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Tooling:
=== "Istio"

```bash
export GATEWAY_PROVIDER=none
export GATEWAY_PROVIDER=istio
helm install vllm-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
--set provider.name=$GATEWAY_PROVIDER \
Expand Down Expand Up @@ -319,6 +319,10 @@ Tooling:
kubectl get httproute llm-route -o yaml
```

### Deploy the Body Based Router Extension (Optional)

This guide shows how to get started with serving only 1 base model type per L7 URL path. If in addition, you wish to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a following section of the guide, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section.

### Deploy InferenceObjective (Optional)

Deploy the sample InferenceObjective which allows you to specify priority of requests.
Expand Down
Loading