Skip to content

Commit 2bff275

Browse files
authored
Update version to 1.8.0 (#686)
1 parent 519ecac commit 2bff275

File tree

6 files changed

+33
-33
lines changed

6 files changed

+33
-33
lines changed

Cargo.lock

Lines changed: 8 additions & 8 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ default-members = [
2222
resolver = "2"
2323

2424
[workspace.package]
25-
version = "1.7.4"
25+
version = "1.8.0"
2626
edition = "2021"
2727
authors = ["Olivier Dehaene", "Nicolas Patry", "Alvaro Bartolome"]
2828
homepage = "https://github.com/huggingface/text-embeddings-inference"

README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ Below are some examples of the currently supported models:
113113
model=Qwen/Qwen3-Embedding-0.6B
114114
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
115115

116-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
116+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
117117
```
118118

119119
And then you can make requests like
@@ -327,13 +327,13 @@ Text Embeddings Inference ships with multiple Docker images that you can use to
327327

328328
| Architecture | Image |
329329
|-------------------------------------|-------------------------------------------------------------------------|
330-
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 |
330+
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 |
331331
| Volta | NOT SUPPORTED |
332-
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.7 (experimental) |
333-
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.7 |
334-
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.7 |
335-
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.7 |
336-
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.7 (experimental) |
332+
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.8 (experimental) |
333+
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.8 |
334+
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.8 |
335+
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.8 |
336+
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.8 (experimental) |
337337

338338
**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
339339
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.
@@ -362,7 +362,7 @@ model=<your private model>
362362
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
363363
token=<your cli READ token>
364364

365-
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
365+
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
366366
```
367367

368368
### Air gapped deployment
@@ -385,7 +385,7 @@ git clone https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
385385
volume=$PWD
386386

387387
# Mount the models directory inside the container with a volume and set the model ID
388-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id /data/Qwen3-Embedding-0.6B
388+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id /data/Qwen3-Embedding-0.6B
389389
```
390390

391391
### Using Re-rankers models
@@ -402,7 +402,7 @@ downstream performance.
402402
model=BAAI/bge-reranker-large
403403
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
404404

405-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
405+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
406406
```
407407

408408
And then you can rank the similarity between a query and a list of texts with:
@@ -422,7 +422,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
422422
model=SamLowe/roberta-base-go_emotions
423423
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
424424

425-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
425+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
426426
```
427427

428428
Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
@@ -442,7 +442,7 @@ You can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM archi
442442
model=naver/efficient-splade-VI-BT-large-query
443443
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
444444

445-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model --pooling splade
445+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model --pooling splade
446446
```
447447

448448
Once you have deployed the model you can use the `/embed_sparse` endpoint to get the sparse embedding:
@@ -471,7 +471,7 @@ You can use the gRPC API by adding the `-grpc` tag to any TEI Docker image. For
471471
model=Qwen/Qwen3-Embedding-0.6B
472472
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
473473

474-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7-grpc --model-id $model
474+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8-grpc --model-id $model
475475
```
476476

477477
```shell

docs/source/en/private_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,5 +37,5 @@ model=<your private model>
3737
volume=$PWD/data
3838
token=<your cli Hugging Face Hub token>
3939

40-
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
40+
docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
4141
```

docs/source/en/quick_tour.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Next it's time to deploy your model. Let's say you want to use [`Qwen/Qwen3-Embe
3434
model=Qwen/Qwen3-Embedding-0.6B
3535
volume=$PWD/data
3636

37-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
37+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
3838
```
3939

4040
<Tip>
@@ -110,7 +110,7 @@ Let's say you want to use [`BAAI/bge-reranker-large`](https://huggingface.co/BAA
110110
model=BAAI/bge-reranker-large
111111
volume=$PWD/data
112112

113-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
113+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
114114
```
115115

116116
Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list of texts. With `cURL` this can be done like so:
@@ -130,7 +130,7 @@ You can also use classic Sequence Classification models like [`SamLowe/roberta-b
130130
model=SamLowe/roberta-base-go_emotions
131131
volume=$PWD/data
132132

133-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
133+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model
134134
```
135135

136136
Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
@@ -182,5 +182,5 @@ git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5
182182
volume=$PWD
183183

184184
# Mount the models directory inside the container with a volume and set the model ID
185-
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id /data/gte-base-en-v1.5
185+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id /data/gte-base-en-v1.5
186186
```

docs/source/en/supported_models.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -77,13 +77,13 @@ Find the appropriate Docker image for your hardware in the following table:
7777

7878
| Architecture | Image |
7979
|-------------------------------------|--------------------------------------------------------------------------|
80-
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 |
80+
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8 |
8181
| Volta | NOT SUPPORTED |
82-
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.7 (experimental) |
83-
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.7 |
84-
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.7 |
85-
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.7 |
86-
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.7 (experimental) |
82+
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.8 (experimental) |
83+
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.8 |
84+
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.8 |
85+
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.8 |
86+
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.8 (experimental) |
8787

8888
**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
8989
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.

0 commit comments

Comments
 (0)