Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ theme:
favicon: images/favicon-64.png
features:
- content.code.annotate
- content.code.copy
- search.highlight
- navigation.tabs
- navigation.top
Expand Down
11 changes: 5 additions & 6 deletions site-src/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -319,19 +319,14 @@ Tooling:
kubectl get httproute llm-route -o yaml
```

### Deploy the Body Based Router Extension (Optional)

This guide shows how to get started with serving only 1 base model type per L7 URL path. If in addition, you wish to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a following section of the guide, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section.

### Deploy InferenceObjective (Optional)

Deploy the sample InferenceObjective which allows you to specify priority of requests.
Deploy the sample InferenceObjective which allows you to specify priority of requests.

```bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferenceobjective.yaml
```


### Try it out

Wait until the gateway is ready.
Expand All @@ -348,6 +343,10 @@ Tooling:
}'
```

### Deploy the Body Based Router Extension (Optional)

This guide has shown how to get started with serving a single base model type per L7 URL path. If after this exercise, you wish to continue on to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a separate section of the documentation, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section. If you wish to exercise that function, then retain the setup you have deployed so far from this guide and move on to the additional steps described in [that guide](serve-multiple-genai-models.md) or else move on to the following section to cleanup your setup.

### Cleanup

The following instructions assume you would like to cleanup ALL resources that were created in this quickstart guide.
Expand Down
16 changes: 8 additions & 8 deletions site-src/guides/serve-multiple-genai-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ We also want to use an InferencePool and EndPoint Picker for this second model i
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
```

After executing this, very that you see two InferencePools and two EPP pods, one per base model type, running without errors, using the CLIs `kubectl get inferencepools` and `kubectl get pods`.
After executing this, verify that you see two InferencePools and two EPP pods, one per base model type, running without errors, using the CLIs `kubectl get inferencepools` and `kubectl get pods`.

### Configure HTTPRoute

Expand All @@ -100,7 +100,7 @@ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extens
```

```yaml
---
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
Expand All @@ -121,11 +121,12 @@ spec:
value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
name: X-Gateway-Model-Name # (1)!
value: 'meta-llama/Llama-3.1-8B-Instruct'
timeouts:
request: 300s
---
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
Expand All @@ -146,14 +147,15 @@ spec:
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name # (2)!
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
name: X-Gateway-Model-Name
value: 'microsoft/Phi-4-mini-instruct'
timeouts:
request: 300s
---
---
```

Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True` for both routes:
Before testing the setup, confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True` for both routes using the following commands.

```bash
kubectl get httproute llm-llama-route -o yaml
Expand All @@ -163,8 +165,6 @@ kubectl get httproute llm-llama-route -o yaml
kubectl get httproute llm-phi4-route -o yaml
```

[BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.

## Try it out

1. Get the gateway IP:
Expand Down