Skip to content

Commit 5699247

Browse files
authored
Update version to 1.9.0 (#830)
1 parent c78895c commit 5699247

File tree

9 files changed

+52
-53
lines changed

9 files changed

+52
-53
lines changed

Cargo.lock

Lines changed: 8 additions & 8 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ default-members = [
2222
resolver = "2"
2323

2424
[workspace.package]
25-
version = "1.8.3"
25+
version = "1.9.0"
2626
edition = "2021"
2727
authors = ["Olivier Dehaene", "Nicolas Patry", "Alvaro Bartolome"]
2828
homepage = "https://github.com/huggingface/text-embeddings-inference"

README.md

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Below are some examples of the currently supported models:
116116
model=Qwen/Qwen3-Embedding-0.6B
117117
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
118118

119-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
119+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
120120
```
121121

122122
And then you can make requests like
@@ -130,7 +130,7 @@ curl 127.0.0.1:8080/embed \
130130

131131
**Note:** To use GPUs, you need to install
132132
the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
133-
NVIDIA drivers on your machine need to be compatible with CUDA version 12.6 or higher.
133+
NVIDIA drivers on your machine need to be compatible with CUDA version 12.2 or higher.
134134

135135
To see all options to serve your models:
136136

@@ -214,7 +214,7 @@ Options:
214214
[default: 32]
215215

216216
--auto-truncate
217-
Automatically truncate inputs that are longer than the maximum supported size
217+
Control automatic truncation of inputs that exceed the model's maximum supported size. Defaults to `true` (truncation enabled). Set to `false` to disable truncation; when disabled and the model's maximum input length exceeds `--max-batch-tokens`, the server will refuse to start with an error instead of silently truncating sequences.
218218

219219
Unused for gRPC servers
220220

@@ -335,17 +335,17 @@ Options:
335335

336336
Text Embeddings Inference ships with multiple Docker images that you can use to target a specific backend:
337337

338-
| Architecture | Image |
339-
|----------------------------------------|------------------------------------------------------------------------------|
340-
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 |
341-
| Volta | NOT SUPPORTED |
342-
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.8 (experimental) |
343-
| Ampere 8.0 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.8 |
344-
| Ampere 8.6 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.8 |
345-
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.8 |
346-
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.8 |
347-
| Blackwell 10.0 (B200, GB200, ...) | ghcr.io/huggingface/text-embeddings-inference:100-sha-ac69b50 (experimental) |
348-
| Blackwell 12.0 (GeForce RTX 50X0, ...) | ghcr.io/huggingface/text-embeddings-inference:120-sha-ac69b50 (experimental) |
338+
| Architecture | Image |
339+
|----------------------------------------|-------------------------------------------------------------------------|
340+
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 |
341+
| Volta | NOT SUPPORTED |
342+
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.9 (experimental) |
343+
| Ampere 8.0 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.9 |
344+
| Ampere 8.6 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.9 |
345+
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.9 |
346+
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.9 |
347+
| Blackwell 10.0 (B200, GB200, ...) | ghcr.io/huggingface/text-embeddings-inference:100-1.9 (experimental) |
348+
| Blackwell 12.0 (GeForce RTX 50X0, ...) | ghcr.io/huggingface/text-embeddings-inference:120-1.9 (experimental) |
349349

350350
**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
351351
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.
@@ -374,7 +374,7 @@ model=<your private model>
374374
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
375375
token=<your CLI READ token>
376376

377-
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
377+
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
378378
```
379379

380380
### Air gapped deployment
@@ -397,7 +397,7 @@ git clone https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
397397
volume=$PWD
398398

399399
# Mount the models directory inside the container with a volume and set the model ID
400-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id /data/Qwen3-Embedding-0.6B
400+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id /data/Qwen3-Embedding-0.6B
401401
```
402402

403403
### Using Re-rankers models
@@ -414,7 +414,7 @@ downstream performance.
414414
model=BAAI/bge-reranker-large
415415
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
416416

417-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
417+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
418418
```
419419

420420
And then you can rank the similarity between a query and a list of texts with:
@@ -434,7 +434,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
434434
model=SamLowe/roberta-base-go_emotions
435435
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
436436

437-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
437+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
438438
```
439439

440440
Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
@@ -454,7 +454,7 @@ You can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM archi
454454
model=naver/efficient-splade-VI-BT-large-query
455455
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
456456

457-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model --pooling splade
457+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model --pooling splade
458458
```
459459

460460
Once you have deployed the model you can use the `/embed_sparse` endpoint to get the sparse embedding:
@@ -483,7 +483,7 @@ You can use the gRPC API by adding the `-grpc` tag to any TEI Docker image. For
483483
model=Qwen/Qwen3-Embedding-0.6B
484484
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
485485

486-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8-grpc --model-id $model
486+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9-grpc --model-id $model
487487
```
488488

489489
```shell
@@ -532,7 +532,7 @@ sudo apt-get install libssl-dev gcc -y
532532
GPUs with CUDA compute capabilities < 7.5 are not supported (V100, Titan V, GTX 1000 series, ...).
533533

534534
Make sure you have CUDA and the NVIDIA drivers installed. NVIDIA drivers on your device need to be compatible with CUDA
535-
version 12.6 or higher. You also need to add the NVIDIA binaries to your path:
535+
version 12.2 or higher. You also need to add the NVIDIA binaries to your path:
536536

537537
```shell
538538
export PATH=$PATH:/usr/local/cuda/bin
@@ -565,8 +565,7 @@ docker build -f Dockerfile .
565565
```
566566

567567
To build the CUDA containers, you need to know the compute cap of the GPU you will be using
568-
at runtime. If the compute capability is < 10.0 i.e., CUDA architecture is any of
569-
Turing, Ampere, Ada Lovelace, or Hopper; then run the following:
568+
at runtime, to build the image accordingly:
570569

571570
```shell
572571
# Get submodule dependencies

docs/openapi.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"name": "Apache 2.0",
1111
"url": "https://www.apache.org/licenses/LICENSE-2.0"
1212
},
13-
"version": "1.8.3"
13+
"version": "1.9.0"
1414
},
1515
"paths": {
1616
"/decode": {

docs/source/en/cli_arguments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ Options:
9898
[default: 32]
9999

100100
--auto-truncate
101-
Automatically truncate inputs that are longer than the maximum supported size
101+
Control automatic truncation of inputs that exceed the model's maximum supported size. Defaults to `true` (truncation enabled). Set to `false` to disable truncation; when disabled and the model's maximum input length exceeds `--max-batch-tokens`, the server will refuse to start with an error instead of silently truncating sequences.
102102

103103
Unused for gRPC servers
104104

docs/source/en/local_gpu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ To make sure that your hardware is supported, check out the [Supported models an
2121

2222
## Step 1: CUDA and NVIDIA drivers
2323

24-
Make sure you have CUDA and the NVIDIA drivers installed - NVIDIA drivers on your device need to be compatible with CUDA version 12.6 or higher.
24+
Make sure you have CUDA and the NVIDIA drivers installed - NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
2525

2626
Add the NVIDIA binaries to your path:
2727

docs/source/en/private_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,5 +37,5 @@ model=<your private model>
3737
volume=$PWD/data
3838
token=<your cli Hugging Face Hub token>
3939

40-
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
40+
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
4141
```

docs/source/en/quick_tour.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ The easiest way to get started with TEI is to use one of the official Docker con
2424
Hence one needs to install Docker following their [installation instructions](https://docs.docker.com/get-docker/).
2525

2626
TEI supports inference both on GPU and CPU. If you plan on using a GPU, make sure to check that your hardware is supported by checking [this table](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images).
27-
Next, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). NVIDIA drivers on your device need to be compatible with CUDA version 12.6 or higher.
27+
Next, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
2828

2929
## Deploy
3030

@@ -34,7 +34,7 @@ Next it's time to deploy your model. Let's say you want to use [`Qwen/Qwen3-Embe
3434
model=Qwen/Qwen3-Embedding-0.6B
3535
volume=$PWD/data
3636

37-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
37+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
3838
```
3939

4040
<Tip>
@@ -120,7 +120,7 @@ Let's say you want to use [`BAAI/bge-reranker-large`](https://huggingface.co/BAA
120120
model=BAAI/bge-reranker-large
121121
volume=$PWD/data
122122

123-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
123+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
124124
```
125125

126126
Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list of texts. With `cURL` this can be done like so:
@@ -140,7 +140,7 @@ You can also use classic Sequence Classification models like [`SamLowe/roberta-b
140140
model=SamLowe/roberta-base-go_emotions
141141
volume=$PWD/data
142142

143-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
143+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id $model
144144
```
145145

146146
Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
@@ -192,5 +192,5 @@ git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5
192192
volume=$PWD
193193

194194
# Mount the models directory inside the container with a volume and set the model ID
195-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id /data/gte-base-en-v1.5
195+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id /data/gte-base-en-v1.5
196196
```

docs/source/en/supported_models.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -74,21 +74,21 @@ The library does **not** support CUDA compute capabilities < 7.5, which means V1
7474

7575
To leverage your GPUs, make sure to install the
7676
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html), and use
77-
NVIDIA drivers with CUDA version 12.6 or higher.
77+
NVIDIA drivers with CUDA version 12.2 or higher.
7878

7979
Find the appropriate Docker image for your hardware in the following table:
8080

81-
| Architecture | Image |
82-
|----------------------------------------|------------------------------------------------------------------------------|
83-
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 |
84-
| Volta | NOT SUPPORTED |
85-
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.8 (experimental) |
86-
| Ampere 8.0 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.8 |
87-
| Ampere 8.6 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.8 |
88-
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.8 |
89-
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.8 |
90-
| Blackwell 10.0 (B200, GB200, ...) | ghcr.io/huggingface/text-embeddings-inference:100-sha-ac69b50 (experimental) |
91-
| Blackwell 12.0 (GeForce RTX 50X0, ...) | ghcr.io/huggingface/text-embeddings-inference:120-sha-ac69b50 (experimental) |
81+
| Architecture | Image |
82+
|----------------------------------------|-------------------------------------------------------------------------|
83+
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 |
84+
| Volta | NOT SUPPORTED |
85+
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.9 (experimental) |
86+
| Ampere 8.0 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.9 |
87+
| Ampere 8.6 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.9 |
88+
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.9 |
89+
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.9 |
90+
| Blackwell 10.0 (B200, GB200, ...) | ghcr.io/huggingface/text-embeddings-inference:100-1.9 (experimental) |
91+
| Blackwell 12.0 (GeForce RTX 50X0, ...) | ghcr.io/huggingface/text-embeddings-inference:120-1.9 (experimental) |
9292

9393
**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
9494
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.

0 commit comments

Comments
 (0)