-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[WIP] GKE Inference Gateway example #2448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SinaChavoshi
wants to merge
5
commits into
terraform-google-modules:main
Choose a base branch
from
SinaChavoshi:inference_gateway
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
71b2638
copy of gke standard cluster
SinaChavoshi 8ebbc56
update to ga release
SinaChavoshi dc362ed
fix curl command in readme
SinaChavoshi 50b791f
Merge branch 'main' into inference_gateway
SinaChavoshi 895772c
Merge branch 'main' into inference_gateway
apeabody File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # GKE Inference Gateway Example | ||
|
|
||
| This example provisions a GKE Standard cluster and a node pool with H100 GPUs, suitable for deploying and serving Large Language Models (LLMs) using the GKE Inference Gateway. | ||
|
|
||
| The cluster is configured with: | ||
| - GKE Gateway API enabled. | ||
| - Managed Prometheus for monitoring. | ||
| - DCGM for GPU monitoring. | ||
| - A dedicated node pool with NVIDIA H100 80GB GPUs. | ||
|
|
||
| This Terraform script automates the deployment of all necessary Kubernetes resources, including: | ||
| - Authorization for metrics scraping. | ||
| - A vLLM model server for a Llama3.1 model. | ||
| - GKE Inference Gateway CRDs. | ||
| - GKE Inference Gateway resources (`InferencePool`, `InferenceObjective`, `Gateway`, `HTTPRoute`). | ||
|
|
||
| ## Usage | ||
|
|
||
| 1. **Enable APIs** | ||
|
|
||
| ```bash | ||
| gcloud services enable container.googleapis.com | ||
| ``` | ||
|
|
||
| 2. **Set up your environment** | ||
|
|
||
| You will need to set the following environment variables. You may also need to create a `terraform.tfvars` file to provide values for the variables in `variables.tf`. | ||
|
|
||
| ```bash | ||
| export PROJECT_ID="your-project-id" | ||
| export REGION="us-central1" | ||
| export CLUSTER_NAME="inference-gateway-cluster" | ||
| export HF_TOKEN="your-hugging-face-token" | ||
| ``` | ||
|
|
||
| 3. **Run Terraform** | ||
|
|
||
| The `terraform apply` command will provision the GKE cluster and deploy all the necessary Kubernetes resources. | ||
|
|
||
| ```bash | ||
| terraform init | ||
| terraform apply | ||
| ``` | ||
|
|
||
| 4. **Configure kubectl** | ||
|
|
||
| After the apply is complete, configure `kubectl` to communicate with your new cluster. | ||
|
|
||
| ```bash | ||
| gcloud container clusters get-credentials $(terraform output -raw cluster_name) --region $(terraform output -raw location) | ||
| ``` | ||
|
|
||
| 5. **Send an inference request** | ||
|
|
||
| Get the Gateway IP address: | ||
| ```bash | ||
| IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') | ||
| PORT=80 | ||
| ``` | ||
|
|
||
| Send a request: | ||
| ```bash | ||
| curl -i -X POST http://${IP}:${PORT}/v1/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "food-review", | ||
| "prompt": "What is a good recipe for a chicken curry?", | ||
| "max_tokens": 100, | ||
| "temperature": "0.7" | ||
| }' | ||
| ``` | ||
|
|
||
| ## Cleanup | ||
|
|
||
| Running `terraform destroy` will deprovision the GKE cluster and all associated Kubernetes resources. | ||
|
|
||
| ```bash | ||
| terraform destroy | ||
| ``` | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
temperatureparameter in the JSON payload is specified as a string ("0.7"). While some servers might be lenient, the OpenAI API specification (which vLLM aims to be compatible with) expects this to be a number. It's better to provide it as a numeric value for correctness and broader compatibility.