You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Renamed `examples/cloud-run/tgi-deployment`
Rename required since most of the use-cases / examples are around the
`tgi-deployment` so the naming can be vague, and instead we're following
the same formatting for the naming as in `examples/vertex-ai/notebooks`
* Add `examples/cloud-run/deploy-gemma-2-on-cloud-run/*` (WIP)
* Update `deploy-llama-3-1-on-cloud-run/README.md`
* Rename `SERVICE_ACCOUNT_NAME` to `tgi-invoker`
Co-authored-by: Wietse Venema <[email protected]>
* Add `imgs` to `deploy-gemma-2-on-cloud-run` example
* Update `deploy-gemma-2-on-cloud-run/README.md`
* Add notes on Cloud NAT and VPC network
* Update example listings
Automatically updated via `python
scripts/internal/update_example_tables.py`
* Add `current_git_branch` fn in `auto-generate-examples.py`
* Update and fix `current_git_branch` function
* Update and fix `current_git_branch` function
* Temporary patch for docs to be live with images
* Update `docs/scripts/auto-generate-examples.py`
---------
Co-authored-by: Wietse Venema <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -67,7 +67,8 @@ The [`examples`](./examples) directory contains examples for using the container
67
67
| GKE |[examples/gke/tgi-deployment](./examples/gke/tgi-deployment)| Deploy Meta Llama 3 8B with TGI DLC on GKE |
68
68
| GKE |[examples/gke/tgi-from-gcs-deployment](./examples/gke/tgi-from-gcs-deployment)| Deploy Qwen2 7B with TGI DLC from GCS on GKE |
69
69
| GKE |[examples/gke/tei-deployment](./examples/gke/tei-deployment)| Deploy Snowflake's Arctic Embed with TEI DLC on GKE |
70
-
| Cloud Run |[examples/cloud-run/tgi-deployment](./examples/cloud-run/tgi-deployment)| Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run |
70
+
| Cloud Run |[examples/cloud-run/deploy-gemma-2-on-cloud-run](./examples/cloud-run/deploy-gemma-2-on-cloud-run)| Deploy Gemma2 9B with TGI DLC on Cloud Run |
71
+
| Cloud Run |[examples/cloud-run/deploy-llama-3-1-on-cloud-run](./examples/cloud-run/deploy-llama-3-1-on-cloud-run)| Deploy Llama 3.1 8B with TGI DLC on Cloud Run |
Copy file name to clipboardExpand all lines: docs/source/resources.mdx
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,4 +66,5 @@ Learn how to use Hugging Face in Google Cloud by reading our blog posts, present
66
66
67
67
- Inference
68
68
69
-
-[Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/tgi-deployment)
69
+
-[Deploy Gemma2 9B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/deploy-gemma-2-on-cloud-run)
70
+
-[Deploy Llama 3.1 8B with TGI DLC on Cloud Run](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/deploy-llama-3-1-on-cloud-run)
Copy file name to clipboardExpand all lines: examples/cloud-run/deploy-llama-3-1-on-cloud-run/README.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
---
2
-
title: Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run
2
+
title: Deploy Llama 3.1 8B with TGI DLC on Cloud Run
3
3
type: inference
4
4
---
5
5
6
-
# Deploy Meta Llama 3.1 8B with TGI DLC on Cloud Run
6
+
# Deploy Llama 3.1 8B with TGI DLC on Cloud Run
7
7
8
-
Meta Llama 3.1 is the latest open LLM from Meta, released in July 2024. Meta Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage.
8
+
Llama 3.1 is the latest open LLM from Meta, released in July 2024. Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation; among other use cases. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Cloud Run is a serverless container platform that allows developers to deploy and manage containerized applications without managing infrastructure, enabling automatic scaling and billing only for usage.
9
9
10
-
This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Meta Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support ([in preview](https://cloud.google.com/products#product-launch-stages)).
10
+
This example showcases how to deploy an LLM from the Hugging Face Hub, in this case Llama 3.1 8B Instruct model quantized to INT4 using AWQ, with the Hugging Face DLC for TGI on Google Cloud Run with GPU support ([in preview](https://cloud.google.com/products#product-launch-stages)).
11
11
12
12
> [!NOTE]
13
13
> GPU support on Cloud Run is only available as a waitlisted public preview. If you're interested in trying out the feature, [request a quota increase](https://cloud.google.com/run/quotas#increase) for `Total Nvidia L4 GPU allocation, per project per region`. At the time of writing this example, NVIDIA L4 GPUs (24GiB VRAM) are the only available GPUs on Cloud Run; enabling automatic scaling up to 7 instances by default (more available via quota), as well as scaling down to zero instances when there are no requests.
@@ -216,7 +216,7 @@ The recommended approach is to use a Service Account (SA), as the access can be
216
216
- Set the `SERVICE_ACCOUNT_NAME` environment variable for convenience:
@@ -241,7 +241,7 @@ The recommended approach is to use a Service Account (SA), as the access can be
241
241
```
242
242
243
243
> [!WARNING]
244
-
> The access token is short-lived and will expire, by default after 1 hour. If you want to extend the token lifetime beyond the default, you must create and organization policy and use the `--lifetime` argument when createing the token. Refer to (Access token lifetime)[[https://cloud.google.com/resource-manager/docs/organization-policy/restricting-service-accounts#extend_oauth_ttl]] to learn more. Otherwise, you can also generate a new token by running the same command again.
244
+
> The access token is short-lived and will expire, by default after 1 hour. If you want to extend the token lifetime beyond the default, you must create and organization policy and use the `--lifetime` argument when createing the token. Refer to [Access token lifetime](https://cloud.google.com/resource-manager/docs/organization-policy/restricting-service-accounts#extend_oauth_ttl) to learn more. Otherwise, you can also generate a new token by running the same command again.
245
245
246
246
Now you can already dive into the different alternatives for sending the requests to the deployed Cloud Run Service using the `SERVICE_URL` AND `ACCESS_TOKEN` as described above.
0 commit comments