generated from nginx/template-repository
-
Notifications
You must be signed in to change notification settings - Fork 118
NGF: Gateway API Inference Extension doc #1290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bjee19
wants to merge
12
commits into
nginx:ngf-release-2.2
Choose a base branch
from
bjee19:ngf/inferece-extension-docs
base: ngf-release-2.2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+208
−0
Open
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
2f7649c
Add inference extension doc
bjee19 f35510b
Add yaml files
bjee19 98e353b
Add end of file new line
bjee19 902ac92
Revert unintentional changes
bjee19 f7de29c
Fix go.sum
bjee19 1f4ef63
Add new line
bjee19 0e14a80
Fix typo and change gateway api crds command
bjee19 b0d5671
Fix crd kustomize path
bjee19 0245c2f
Add describe for EPP
bjee19 1e026fc
Fix typo
bjee19 0ecf24e
Add docs suggestions
bjee19 b8c1b72
Remove colon
bjee19 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,208 @@ | ||||||||||||||||||
--- | ||||||||||||||||||
title: Gateway API Inference Extension | ||||||||||||||||||
weight: 800 | ||||||||||||||||||
toc: true | ||||||||||||||||||
nd-type: how-to | ||||||||||||||||||
nd-product: NGF | ||||||||||||||||||
nd-docs: DOCS-0000 | ||||||||||||||||||
--- | ||||||||||||||||||
|
||||||||||||||||||
Learn how to use NGINX Gateway Fabric with the Gateway API Inference Extension to optimize traffic routing to self-hosting Generative AI Models on Kubernetes. | ||||||||||||||||||
|
||||||||||||||||||
## Overview | ||||||||||||||||||
|
||||||||||||||||||
The [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) is an official Kubernetes project that aims to provide optimized load-balancing for self-hosted Generative AI Models on Kubernetes. | ||||||||||||||||||
The project's goal is to improve and standardize routing to inference workloads across the ecosystem. | ||||||||||||||||||
|
||||||||||||||||||
Coupled with the provided Endpoint Picker Service, NGINX Gateway Fabric becomes an [Inference Gateway](https://gateway-api-inference-extension.sigs.k8s.io/#concepts-and-definitions), with additional AI specific traffic management features such as model-aware routing, serving priority for models, model rollouts, and more. | ||||||||||||||||||
|
||||||||||||||||||
{{< call-out "warning" >}} The Gateway API Inference Extension is still in alpha status and should not be used in production yet.{{< /call-out >}} | ||||||||||||||||||
|
||||||||||||||||||
## Setup | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
or even better
Suggested change
|
||||||||||||||||||
|
||||||||||||||||||
Install the Gateway API Inference Extension CRDs: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl kustomize "https://github.com/nginx/nginx-gateway-fabric/config/crd/inference-extension/?ref=v{{< version-ngf >}}" | kubectl apply -f - | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
To enable the Gateway API Inference Extension, [install]({{< ref "/ngf/install/" >}}) NGINX Gateway Fabric with these modifications: | ||||||||||||||||||
- Using Helm: set the `nginxGateway.gwAPIInferenceExtension.enable=true` Helm value. | ||||||||||||||||||
- Using Kubernetes manifests: set the `--gateway-api-inference-extension` flag in the nginx-gateway container argument, update the ClusterRole RBAC to add the `inferencepools`: | ||||||||||||||||||
Comment on lines
+29
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
leave blank lines between content types in Markdown |
||||||||||||||||||
```yaml | ||||||||||||||||||
- apiGroups: | ||||||||||||||||||
- inference.networking.k8s.io | ||||||||||||||||||
resources: | ||||||||||||||||||
- inferencepools | ||||||||||||||||||
verbs: | ||||||||||||||||||
- get | ||||||||||||||||||
- list | ||||||||||||||||||
- watch | ||||||||||||||||||
- apiGroups: | ||||||||||||||||||
- inference.networking.k8s.io | ||||||||||||||||||
resources: | ||||||||||||||||||
- inferencepools/status | ||||||||||||||||||
verbs: | ||||||||||||||||||
- update | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
See this [example manifest](https://raw.githubusercontent.com/nginx/nginx-gateway-fabric/main/deploy/inference/deploy.yaml) for clarification. | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
## Deploy a sample model server | ||||||||||||||||||
|
||||||||||||||||||
The [vLLM simulator](https://github.com/llm-d/llm-d-inference-sim/tree/main) model server does not use GPUs and is ideal for test/dev environments. This sample is configured to simulate the [meta-llama/LLama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model. To deploy the vLLM simulator, run the following command: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/sim-deployment.yaml | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
## Deploy the InferencePool and Endpoint Picker Extension | ||||||||||||||||||
|
||||||||||||||||||
The InferencePool is a Gateway API Inference Extension resource that represents a set of Inference-focused Pods. With InferencePool, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to the Gateway API Inference Extension [InferencePool documentation](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/). | ||||||||||||||||||
|
||||||||||||||||||
Install an InferencePool named `vllm-llama3-8b-instruct` that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port 8000. The Helm install command automatically installs the Endpoint Picker Extension and InferencePool. | ||||||||||||||||||
bjee19 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
|
||||||||||||||||||
NGINX will query the Endpoint Picker Extension to determine the appropriate pod endpoint to route traffic to. These pods are selected from a pool of ready pods designated by the assigned InferencePool's Selector field. For more information on the [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/README.md). | ||||||||||||||||||
|
||||||||||||||||||
{{< call-out "warning" >}} The Endpoint Picker Extension is a third-party application written and provided by the Gateway API Inference Extension project. The communication between NGINX and the Endpoint Picker Extension does not currently have TLS support, so it is an insecure connection. The Gateway API Inference Extension is still in alpha status and should not be used in production yet. NGINX Gateway Fabric is not responsible for any threats or risks associated with using this third-party Endpoint Picker Extension application. {{< /call-out >}} | ||||||||||||||||||
salonichf5 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
export IGW_CHART_VERSION=v1.0.1 | ||||||||||||||||||
helm install vllm-llama3-8b-instruct \ | ||||||||||||||||||
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ | ||||||||||||||||||
--version $IGW_CHART_VERSION \ | ||||||||||||||||||
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Confirm that the Endpoint Picker was deployed and is running: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl describe deployment vllm-llama3-8b-instruct-epp | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
## Deploy an Inference Gateway | ||||||||||||||||||
|
||||||||||||||||||
```yaml | ||||||||||||||||||
kubectl apply -f - <<EOF | ||||||||||||||||||
apiVersion: gateway.networking.k8s.io/v1 | ||||||||||||||||||
kind: Gateway | ||||||||||||||||||
metadata: | ||||||||||||||||||
name: inference-gateway | ||||||||||||||||||
spec: | ||||||||||||||||||
gatewayClassName: nginx | ||||||||||||||||||
listeners: | ||||||||||||||||||
- name: http | ||||||||||||||||||
port: 80 | ||||||||||||||||||
protocol: HTTP | ||||||||||||||||||
EOF | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl describe gateway inference-gateway | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Save the public IP address and port of the NGINX Service into shell variables: | ||||||||||||||||||
|
||||||||||||||||||
```text | ||||||||||||||||||
GW_IP=XXX.YYY.ZZZ.III | ||||||||||||||||||
GW_PORT=<port number> | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
## Deploy a HTTPRoute | ||||||||||||||||||
|
||||||||||||||||||
```yaml | ||||||||||||||||||
kubectl apply -f - <<EOF | ||||||||||||||||||
apiVersion: gateway.networking.k8s.io/v1 | ||||||||||||||||||
kind: HTTPRoute | ||||||||||||||||||
metadata: | ||||||||||||||||||
name: llm-route | ||||||||||||||||||
spec: | ||||||||||||||||||
parentRefs: | ||||||||||||||||||
- group: gateway.networking.k8s.io | ||||||||||||||||||
kind: Gateway | ||||||||||||||||||
name: inference-gateway | ||||||||||||||||||
rules: | ||||||||||||||||||
- backendRefs: | ||||||||||||||||||
- group: inference.networking.k8s.io | ||||||||||||||||||
kind: InferencePool | ||||||||||||||||||
name: vllm-llama3-8b-instruct | ||||||||||||||||||
port: 3000 | ||||||||||||||||||
matches: | ||||||||||||||||||
- path: | ||||||||||||||||||
type: PathPrefix | ||||||||||||||||||
value: / | ||||||||||||||||||
EOF | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True`: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl describe httproute llm-route | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
## Try it out | ||||||||||||||||||
|
||||||||||||||||||
Send traffic to the Gateway: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
curl -i $GW_IP:$GW_PORT/v1/completions -H 'Content-Type: application/json' -d '{ | ||||||||||||||||||
"model": "food-review-1", | ||||||||||||||||||
"prompt": "Write as if you were a critic: San Francisco", | ||||||||||||||||||
"max_tokens": 100, | ||||||||||||||||||
"temperature": 0 | ||||||||||||||||||
}' | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
## Cleanup | ||||||||||||||||||
|
||||||||||||||||||
Uninstall the InferencePool, InferenceObjective, and model server resources: | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
helm uninstall vllm-llama3-8b-instruct | ||||||||||||||||||
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferenceobjective.yaml --ignore-not-found | ||||||||||||||||||
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml --ignore-not-found | ||||||||||||||||||
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --ignore-not-found | ||||||||||||||||||
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/sim-deployment.yaml --ignore-not-found | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Uninstall the Gateway API Inference Extension CRDs: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl delete -k https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd --ignore-not-found | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Uninstall Inference Gateway and HTTPRoute: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl delete gateway inference-gateway | ||||||||||||||||||
kubectl delete httproute llm-route | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Uninstall NGINX Gateway Fabric: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
helm uninstall ngf -n nginx-gateway | ||||||||||||||||||
``` | ||||||||||||||||||
If needed, replace ngf with your chosen release name. | ||||||||||||||||||
|
||||||||||||||||||
Remove namespace and NGINX Gateway Fabric CRDs: | ||||||||||||||||||
|
||||||||||||||||||
```shell | ||||||||||||||||||
kubectl delete ns nginx-gateway | ||||||||||||||||||
kubectl delete -f https://raw.githubusercontent.com/nginx/nginx-gateway-fabric/v{{< version-ngf >}}/deploy/crds.yaml | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
Remove the Gateway API CRDs: | ||||||||||||||||||
|
||||||||||||||||||
{{< include "/ngf/installation/uninstall-gateway-api-resources.md" >}} | ||||||||||||||||||
|
||||||||||||||||||
## See also | ||||||||||||||||||
|
||||||||||||||||||
- [Gateway API Inference Exntension Introduction](https://gateway-api-inference-extension.sigs.k8s.io/): for introductory details to the project. | ||||||||||||||||||
- [Gateway API Inference Extension API Overview](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/): for an API overview. | ||||||||||||||||||
- [Gateway API Inference Extension User Guides](https://gateway-api-inference-extension.sigs.k8s.io/guides/): for additional use cases and guides. | ||||||||||||||||||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.