Skip to content

Commit 183c9c8

Browse files
Add Gemma2 9B on Cloud Run example (#113)
* Renamed `examples/cloud-run/tgi-deployment` Rename required since most of the use-cases / examples are around the `tgi-deployment` so the naming can be vague, and instead we're following the same formatting for the naming as in `examples/vertex-ai/notebooks` * Add `examples/cloud-run/deploy-gemma-2-on-cloud-run/*` (WIP) * Update `deploy-llama-3-1-on-cloud-run/README.md` * Rename `SERVICE_ACCOUNT_NAME` to `tgi-invoker` Co-authored-by: Wietse Venema <[email protected]> * Add `imgs` to `deploy-gemma-2-on-cloud-run` example * Update `deploy-gemma-2-on-cloud-run/README.md` * Add notes on Cloud NAT and VPC network * Update example listings Automatically updated via `python scripts/internal/update_example_tables.py` * Add `current_git_branch` fn in `auto-generate-examples.py` * Update and fix `current_git_branch` function * Update and fix `current_git_branch` function * Temporary patch for docs to be live with images * Update `docs/scripts/auto-generate-examples.py` --------- Co-authored-by: Wietse Venema <[email protected]>
1 parent 631b103 commit 183c9c8

File tree

9 files changed

+401
-11
lines changed

9 files changed

+401
-11
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,8 @@ The [`examples`](./examples) directory contains examples for using the container
6767
| GKE | [examples/gke/tgi-deployment](./examples/gke/tgi-deployment) | Deploy Meta Llama 3 8B with TGI DLC on GKE |
6868
| GKE | [examples/gke/tgi-from-gcs-deployment](./examples/gke/tgi-from-gcs-deployment) | Deploy Qwen2 7B with TGI DLC from GCS on GKE |
6969
| GKE | [examples/gke/tei-deployment](./examples/gke/tei-deployment) | Deploy Snowflake's Arctic Embed with TEI DLC on GKE |
70-
| Cloud Run | [examples/cloud-run/tgi-deployment](./examples/cloud-run/tgi-deployment) | Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run |
70+
| Cloud Run | [examples/cloud-run/deploy-gemma-2-on-cloud-run](./examples/cloud-run/deploy-gemma-2-on-cloud-run) | Deploy Gemma2 9B with TGI DLC on Cloud Run |
71+
| Cloud Run | [examples/cloud-run/deploy-llama-3-1-on-cloud-run](./examples/cloud-run/deploy-llama-3-1-on-cloud-run) | Deploy Llama 3.1 8B with TGI DLC on Cloud Run |
7172

7273
### Evaluation
7374

docs/source/resources.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,5 @@ Learn how to use Hugging Face in Google Cloud by reading our blog posts, present
6666

6767
- Inference
6868

69-
- [Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/tgi-deployment)
69+
- [Deploy Gemma2 9B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/deploy-gemma-2-on-cloud-run)
70+
- [Deploy Llama 3.1 8B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/deploy-llama-3-1-on-cloud-run)

examples/cloud-run/README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,12 @@ This directory contains usage examples of the Hugging Face Deep Learning Contain
77
88
## Inference Examples
99

10-
| Example | Title |
11-
| ---------------------------------- | ----------------------------------------------- |
12-
| [tgi-deployment](./tgi-deployment) | Deploy Meta Llama 3.1 with TGI DLC on Cloud Run |
10+
| Example | Title |
11+
| ---------------------------------------------------------------- | --------------------------------------------- |
12+
| [deploy-gemma-2-on-cloud-run](./deploy-gemma-2-on-cloud-run) | Deploy Gemma2 9B with TGI DLC on Cloud Run |
13+
| [deploy-llama-3-1-on-cloud-run](./deploy-llama-3-1-on-cloud-run) | Deploy Llama 3.1 8B with TGI DLC on Cloud Run |
14+
15+
## Training Examples
16+
17+
Coming soon!
18+

examples/cloud-run/deploy-gemma-2-on-cloud-run/README.md

Lines changed: 382 additions & 0 deletions
Large diffs are not rendered by default.
545 KB
Loading
1.07 MB
Loading

examples/cloud-run/tgi-deployment/README.md renamed to examples/cloud-run/deploy-llama-3-1-on-cloud-run/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
2-
title: Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run
2+
title: Deploy Llama 3.1 8B with TGI DLC on Cloud Run
33
type: inference
44
---
55

6-
# Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run
6+
# Deploy Llama 3.1 8B with TGI DLC on Cloud Run
77

8-
Meta Llama 3.1 is the latest open LLM from Meta, released in July 2024. Meta Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage.
8+
Llama 3.1 is the latest open LLM from Meta, released in July 2024. Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage.
99

10-
This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Meta Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support ([in preview](https://cloud.google.com/products#product-launch-stages)).
10+
This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support ([in preview](https://cloud.google.com/products#product-launch-stages)).
1111

1212
> [!NOTE]
1313
> GPU support on Cloud Run is only available as a waitlisted public preview. If you're interested in trying out the feature, [request a quota increase](https://cloud.google.com/run/quotas#increase) for `Total Nvidia L4 GPU allocation, per project per region`. At the time of writing this example, NVIDIA L4 GPUs (24GiB VRAM) are the only available GPUs on Cloud Run; enabling automatic scaling up to 7 instances by default (more available via quota), as well as scaling down to zero instances when there are no requests.
@@ -216,7 +216,7 @@ The recommended approach is to use a Service Account (SA), as the access can be
216216
- Set the `SERVICE_ACCOUNT_NAME` environment variable for convenience:
217217

218218
```bash
219-
export SERVICE_ACCOUNT_NAME=text-generation-inference-invoker
219+
export SERVICE_ACCOUNT_NAME=tgi-invoker
220220
```
221221

222222
- Create the Service Account:
@@ -241,7 +241,7 @@ The recommended approach is to use a Service Account (SA), as the access can be
241241
```
242242

243243
> [!WARNING]
244-
> The access token is short-lived and will expire, by default after 1 hour. If you want to extend the token lifetime beyond the default, you must create and organization policy and use the `--lifetime` argument when createing the token. Refer to (Access token lifetime)[[https://cloud.google.com/resource-manager/docs/organization-policy/restricting-service-accounts#extend_oauth_ttl]] to learn more. Otherwise, you can also generate a new token by running the same command again.
244+
> The access token is short-lived and will expire, by default after 1 hour. If you want to extend the token lifetime beyond the default, you must create and organization policy and use the `--lifetime` argument when createing the token. Refer to [Access token lifetime](https://cloud.google.com/resource-manager/docs/organization-policy/restricting-service-accounts#extend_oauth_ttl) to learn more. Otherwise, you can also generate a new token by running the same command again.
245245
246246
Now you can already dive into the different alternatives for sending the requests to the deployed Cloud Run Service using the `SERVICE_URL` AND `ACCESS_TOKEN` as described above.
247247

0 commit comments

Comments
 (0)