Skip to content

Commit 2483647

Browse files
committed
First draft
1 parent 241aae9 commit 2483647

File tree

1 file changed

+7
-6
lines changed

1 file changed

+7
-6
lines changed

docs/source/en/tei_cloud_run.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ On Google Cloud, there are 3 main options for deploying TEI (or any other Docker
2525

2626
This guide explains how to deploy TEI on Cloud Run, a fully managed service by Google. Cloud Run is a so-called serverless offering. This means that the server infrastructure is handled by Google, you only need to provide a Docker container. The benefit of this is that you only pay for compute when there is demand for your application. Cloud Run will automatically spin up servers when there's demand, and scale down to zero when there is no demand.
2727

28-
We will showcase how to deploy the model with or without a GPU.
28+
We will showcase how to deploy any text embedding model with or without a GPU.
2929

3030
> [!NOTE]
31-
> GPU support on Cloud Run was just made generally available. If you're interested in using it, [request a quota increase](https://cloud.google.com/run/quotas#increase) for `Total Nvidia L4 GPU allocation, per project per region`. At the time of writing this example, NVIDIA L4 GPUs (24GiB VRAM) are the only available GPUs on Cloud Run; enabling automatic scaling up to 7 instances by default (more available via quota), as well as scaling down to zero instances when there are no requests.
31+
> At the time of writing, GPU support on Cloud Run is generally available in 4 regions. If you're interested in using it, [request a quota increase](https://cloud.google.com/run/quotas#increase) for `Total Nvidia L4 GPU allocation, per project per region`. So far, NVIDIA L4 GPUs (24GiB VRAM) are the only available GPUs on Cloud Run; enabling automatic scaling up to 7 instances by default (more available via quota), as well as scaling down to zero instances when there are no requests.
3232
3333
## Setup / Configuration
3434

@@ -92,11 +92,12 @@ The command needs you to specify the following parameters:
9292
Finally, you can run the `gcloud run deploy` command to deploy TEI on Cloud Run as:
9393
9494
```bash
95-
export SERVICE_NAME="text-embedding-server" # or choose another name for your service
95+
export SERVICE_NAME="text-embedding-server" # choose a name for your service
96+
export MODEL_ID="ibm-granite/granite-embedding-278m-multilingual" # choose any embedding model
9697
9798
gcloud run deploy $SERVICE_NAME \
9899
--image=$CONTAINER_URI \
99-
--args="--model-id=ibm-granite/granite-embedding-278m-multilingual" \
100+
--args="--model-id=$MODEL_ID,--max-concurrent-requests=64" \
100101
--set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
101102
--port=8080 \
102103
--cpu=8 \
@@ -110,7 +111,7 @@ If you want to deploy with a GPU, run the following command:
110111
```bash
111112
gcloud run deploy $SERVICE_NAME \
112113
--image=$CONTAINER_URI \
113-
--args="--model-id=ibm-granite/granite-embedding-278m-multilingual,--max-concurrent-requests=64" \
114+
--args="--model-id=$MODEL_ID,--max-concurrent-requests=64" \
114115
--set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
115116
--port=8080 \
116117
--cpu=8 \
@@ -129,7 +130,7 @@ Or as it follows if you created the Cloud NAT:
129130
```bash
130131
gcloud beta run deploy $SERVICE_NAME \
131132
--image=$CONTAINER_URI \
132-
--args="--model-id=ibm-granite/granite-embedding-278m-multilingual" \
133+
--args="--model-id=$MODEL_ID,--max-concurrent-requests=64" \
133134
--set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
134135
--port=8080 \
135136
--cpu=8 \

0 commit comments

Comments
 (0)