Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ theme:
favicon: images/favicon-64.png
features:
- content.code.annotate
- content.code.copy
- search.highlight
- navigation.tabs
- navigation.top
Expand Down
11 changes: 5 additions & 6 deletions site-src/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -319,19 +319,14 @@ Tooling:
kubectl get httproute llm-route -o yaml
```

### Deploy the Body Based Router Extension (Optional)

This guide shows how to get started with serving only 1 base model type per L7 URL path. If in addition, you wish to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a following section of the guide, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section.

### Deploy InferenceObjective (Optional)

Deploy the sample InferenceObjective which allows you to specify priority of requests.
Deploy the sample InferenceObjective which allows you to specify priority of requests.

```bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferenceobjective.yaml
```


### Try it out

Wait until the gateway is ready.
Expand All @@ -348,6 +343,10 @@ Tooling:
}'
```

### Deploy the Body Based Router Extension (Optional)

This guide has shown how to get started with serving a single base model type per L7 URL path. If after this exercise, you wish to continue on to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a separate section of the documentation, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section. If you wish to exercise that function, then retain the setup you have so implemented far from this guide and move on to the additional steps described in [that guide](serve-multiple-genai-models.md) or else move on to the following section to cleanup your setup.

### Cleanup

The following instructions assume you would like to cleanup ALL resources that were created in this quickstart guide.
Expand Down
16 changes: 8 additions & 8 deletions site-src/guides/serve-multiple-genai-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ We also want to use an InferencePool and EndPoint Picker for this second model i
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
```

After executing this, very that you see two InferencePools and two EPP pods, one per base model type, running without errors, using the CLIs `kubectl get inferencepools` and `kubectl get pods`.
After executing this, verify that you see two InferencePools and two EPP pods, one per base model type, running without errors, using the CLIs `kubectl get inferencepools` and `kubectl get pods`.

### Configure HTTPRoute

Expand All @@ -100,7 +100,7 @@ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extens
```

```yaml
---
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
Expand All @@ -121,11 +121,12 @@ spec:
value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
name: X-Gateway-Model-Name # (1)!
value: 'meta-llama/Llama-3.1-8B-Instruct'
timeouts:
request: 300s
---
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
Expand All @@ -146,14 +147,15 @@ spec:
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name # (2)!
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
name: X-Gateway-Model-Name
value: 'microsoft/Phi-4-mini-instruct'
timeouts:
request: 300s
---
---
```

Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True` for both routes:
Before testing the setup, confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True` for both routes using the following commands.

```bash
kubectl get httproute llm-llama-route -o yaml
Expand All @@ -163,8 +165,6 @@ kubectl get httproute llm-llama-route -o yaml
kubectl get httproute llm-phi4-route -o yaml
```

[BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.

## Try it out

1. Get the gateway IP:
Expand Down