You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+22-23Lines changed: 22 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -116,7 +116,7 @@ Below are some examples of the currently supported models:
116
116
model=Qwen/Qwen3-Embedding-0.6B
117
117
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
118
118
119
-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
119
+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
120
120
```
121
121
122
122
And then you can make requests like
@@ -130,7 +130,7 @@ curl 127.0.0.1:8080/embed \
130
130
131
131
**Note:** To use GPUs, you need to install
132
132
the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
133
-
NVIDIA drivers on your machine need to be compatible with CUDA version 12.6 or higher.
133
+
NVIDIA drivers on your machine need to be compatible with CUDA version 12.2 or higher.
134
134
135
135
To see all options to serve your models:
136
136
@@ -214,7 +214,7 @@ Options:
214
214
[default: 32]
215
215
216
216
--auto-truncate
217
-
Automatically truncate inputs that are longer than the maximum supported size
217
+
Control automatic truncation of inputs that exceed the model's maximum supported size. Defaults to `true` (truncation enabled). Set to `false` to disable truncation; when disabled and the model's maximum input length exceeds `--max-batch-tokens`, the server will refuse to start with an error instead of silently truncating sequences.
218
218
219
219
Unused for gRPC servers
220
220
@@ -335,17 +335,17 @@ Options:
335
335
336
336
Text Embeddings Inference ships with multiple Docker images that you can use to target a specific backend:
Copy file name to clipboardExpand all lines: docs/source/en/cli_arguments.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,7 +98,7 @@ Options:
98
98
[default: 32]
99
99
100
100
--auto-truncate
101
-
Automatically truncate inputs that are longer than the maximum supported size
101
+
Control automatic truncation of inputs that exceed the model's maximum supported size. Defaults to `true` (truncation enabled). Set to `false` to disable truncation; when disabled and the model's maximum input length exceeds `--max-batch-tokens`, the server will refuse to start with an error instead of silently truncating sequences.
Copy file name to clipboardExpand all lines: docs/source/en/quick_tour.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ The easiest way to get started with TEI is to use one of the official Docker con
24
24
Hence one needs to install Docker following their [installation instructions](https://docs.docker.com/get-docker/).
25
25
26
26
TEI supports inference both on GPU and CPU. If you plan on using a GPU, make sure to check that your hardware is supported by checking [this table](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images).
27
-
Next, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). NVIDIA drivers on your device need to be compatible with CUDA version 12.6 or higher.
27
+
Next, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
28
28
29
29
## Deploy
30
30
@@ -34,7 +34,7 @@ Next it's time to deploy your model. Let's say you want to use [`Qwen/Qwen3-Embe
34
34
model=Qwen/Qwen3-Embedding-0.6B
35
35
volume=$PWD/data
36
36
37
-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
37
+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
38
38
```
39
39
40
40
<Tip>
@@ -120,7 +120,7 @@ Let's say you want to use [`BAAI/bge-reranker-large`](https://huggingface.co/BAA
120
120
model=BAAI/bge-reranker-large
121
121
volume=$PWD/data
122
122
123
-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
123
+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
124
124
```
125
125
126
126
Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list of texts. With `cURL` this can be done like so:
@@ -140,7 +140,7 @@ You can also use classic Sequence Classification models like [`SamLowe/roberta-b
140
140
model=SamLowe/roberta-base-go_emotions
141
141
volume=$PWD/data
142
142
143
-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
143
+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
144
144
```
145
145
146
146
Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
0 commit comments