Releases: huggingface/text-embeddings-inference
v1.8.0
Notable Changes
- Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs
- NomicBert MoE support
- JinaAI Re-Rankers V1 support
- Matryoshka Representation Learning (MRL)
- Dense layer module support (after pooling)
Note
Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.
What's Changed
- [Docs] Update quick tour by @NielsRogge in #574
- Update
README.md
andsupported_models.md
by @alvarobartt in #572 - Back with linting. by @Narsil in #577
- [Docs] Add cloud run example by @NielsRogge in #573
- Fixup by @Narsil in #578
- Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
- Removing requirements file. by @Narsil in #585
- Removing candle-extensions to live on crates.io by @Narsil in #583
- Bump
sccache
to 0.10.0 andsccache-action
to 0.0.9 by @alvarobartt in #586 - optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
- Revert "Removing requirements file. (#585)" by @Narsil in #588
- Get opentelemetry trace id from request headers by @kozistr in #425
- Add argument for configuring Prometheus port by @kozistr in #589
- Adding missing
head.
prefix in the weight name inModernBertClassificationHead
by @kozistr in #591 - Fixing the CI (grpc path). by @Narsil in #593
- fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
- enable flash mistral model for HPU device by @kaixuanliu in #594
- remove optimum-habana dependency by @kaixuanliu in #599
- Support NomicBert MoE by @kozistr in #596
- Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
- Update
text-embeddings-router --help
output by @alvarobartt in #603 - Warmup padded models too. by @Narsil in #592
- Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
- Gte diffs by @Narsil in #604
- Fix the weight name in GTEClassificationHead by @kozistr in #606
- upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
- upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
- Patch DistilBERT variants with different weight keys by @alvarobartt in #614
- add offline modeling for model
jinaai/jina-embeddings-v2-base-code
to avoidauto_map
to other repository by @kaixuanliu in #612 - Add mean pooling strategy for Modernbert classifier by @kwnath in #616
- Using serde for pool validation. by @Narsil in #620
- Preparing the update to 1.7.1 by @Narsil in #623
- Adding suggestions to fixing missing ONNX files. by @Narsil in #624
- Add
Qwen3Model
by @alvarobartt in #627 - Add
HiddenAct::Silu
(removeserde
alias) by @alvarobartt in #631 - Add CPU support for Qwen3-Embedding models by @randomm in #632
- refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
- Support Qwen3 w/ fp32 on GPU by @kozistr in #634
- Preparing the release. by @Narsil in #639
- Default to Qwen3 in
README.md
anddocs/
examples by @alvarobartt in #641 - Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update
version
to 1.7.3 by @alvarobartt in #666 - Add last token pooling support for ORT. by @tpendragon in #664
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix
fmt
by re-runningpre-commit
by @alvarobartt in #671 - Update
version
to 1.7.4 by @alvarobartt in #677 - Support MRL (Matryoshka Representation Learning) by @kozistr in #676
- Add
Dense
layer for2_Dense/
modules by @alvarobartt in #660 - Update
version
to 1.8.0 by @alvarobartt in #686
New Contributors
- @NielsRogge made their first contribution in #574
- @cebtenzzre made their first contribution in #602
- @kwnath made their first contribution in #616
- @randomm made their first contribution in #632
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.0...v1.8.0
v1.7.4
Noticeable Changes
Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null
values, as well as a missing to_dtype
call on the attention_bias
when working with batches.
What's Changed
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix
fmt
by re-runningpre-commit
by @alvarobartt in #671 - Update
version
to 1.7.4 by @alvarobartt in #677
Full Changelog: v1.7.3...v1.7.4
v1.7.3
Noticeable Changes
Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.
What's Changed
- Default to Qwen3 in
README.md
anddocs/
examples by @alvarobartt in #641 - Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update
version
to 1.7.3 by @alvarobartt in #666 - Add last token pooling support for ORT. by @tpendragon in #664
New Contributors
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.2...v1.7.3
v1.7.2
Notable change
- Added support for Qwen3 embeddigns
What's Changed
- Adding suggestions to fixing missing ONNX files. by @Narsil in #624
- Add
Qwen3Model
by @alvarobartt in #627 - Add
HiddenAct::Silu
(removeserde
alias) by @alvarobartt in #631 - Add CPU support for Qwen3-Embedding models by @randomm in #632
- refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
- Support Qwen3 w/ fp32 on GPU by @kozistr in #634
- Preparing the release. by @Narsil in #639
New Contributors
Full Changelog: v1.7.1...v1.7.2
v1.7.1
What's Changed
- [Docs] Update quick tour by @NielsRogge in #574
- Update
README.md
andsupported_models.md
by @alvarobartt in #572 - Back with linting. by @Narsil in #577
- [Docs] Add cloud run example by @NielsRogge in #573
- Fixup by @Narsil in #578
- Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
- Removing requirements file. by @Narsil in #585
- Removing candle-extensions to live on crates.io by @Narsil in #583
- Bump
sccache
to 0.10.0 andsccache-action
to 0.0.9 by @alvarobartt in #586 - optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
- Revert "Removing requirements file. (#585)" by @Narsil in #588
- Get opentelemetry trace id from request headers by @kozistr in #425
- Add argument for configuring Prometheus port by @kozistr in #589
- Adding missing
head.
prefix in the weight name inModernBertClassificationHead
by @kozistr in #591 - Fixing the CI (grpc path). by @Narsil in #593
- fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
- enable flash mistral model for HPU device by @kaixuanliu in #594
- remove optimum-habana dependency by @kaixuanliu in #599
- Support NomicBert MoE by @kozistr in #596
- Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
- Update
text-embeddings-router --help
output by @alvarobartt in #603 - Warmup padded models too. by @Narsil in #592
- Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
- Gte diffs by @Narsil in #604
- Fix the weight name in GTEClassificationHead by @kozistr in #606
- upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
- upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
- Patch DistilBERT variants with different weight keys by @alvarobartt in #614
- add offline modeling for model
jinaai/jina-embeddings-v2-base-code
to avoidauto_map
to other repository by @kaixuanliu in #612 - Add mean pooling strategy for Modernbert classifier by @kwnath in #616
- Using serde for pool validation. by @Narsil in #620
- Preparing the update to 1.7.1 by @Narsil in #623
New Contributors
- @NielsRogge made their first contribution in #574
- @cebtenzzre made their first contribution in #602
- @kwnath made their first contribution in #616
Full Changelog: v1.7.0...v1.7.1
v1.7.0
Notable changes
- Upgrade dependencies heavily (candle 0.5 -> 0.8 and related)
- Added ModernBert support by @kozistr !
What's Changed
- Moving cublaslt into TEI extension for easier upgrade of candle globally by @Narsil in #542
- Upgrade candle2 by @Narsil in #543
- Upgrade candle3 by @Narsil in #545
- Fixing the static-linking. by @Narsil in #547
- Fix linking bis by @Narsil in #549
- Make
sliding_window
forQwen2
optional by @alvarobartt in #546 - Optimize the performance of FlashBert on HPU by using fast mode softmax by @kaixuanliu in #555
- Fixing cudarc to the latest unified bindings. by @Narsil in #558
- Fix typos / formatting in CLI args in Markdown files by @alvarobartt in #552
- Use custom
serde
deserializer for JinaBERT models by @alvarobartt in #559 - Implement the
ModernBert
model by @kozistr in #459 - Fixing FlashAttention ModernBert. by @Narsil in #560
- Enable ModernBert on metal by @ivarflakstad in #562
- Fix
{Bert,DistilBert}SpladeHead
when loading from Safetensors by @alvarobartt in #564 - add related docs for intel cpu/xpu/hpu container by @kaixuanliu in #550
- Update the doc for submodule. by @Narsil in #567
- Update
docs/source/en/custom_container.md
by @alvarobartt in #568 - Preparing for release 1.7.0 (candle update + modernbert). by @Narsil in #570
New Contributors
- @ivarflakstad made their first contribution in #562
Full Changelog: v1.6.1...v1.7.0
v1.6.1
What's Changed
- Enable intel devices CPU/XPU/HPU for python backend by @yuanwu2017 in #245
- add reranker model support for python backend by @kaixuanliu in #386
- (FIX): CI Security Fix - branchname injection by @glegendre01 in #479
- Upgrade TEI. by @Narsil in #501
- Pin
cargo-chef
installation to 0.1.62 by @alvarobartt in #469 - add
TRUST_REMOTE_CODE
param to python backend. by @kaixuanliu in #485 - Enable splade embeddings for Python backend by @pi314ever in #493
- Hpu bucketing by @kaixuanliu in #489
- Optimize flash bert path for hpu device by @kaixuanliu in #509
- upgrade ipex to 2.6 version for cpu/xpu by @kaixuanliu in #510
- fix bug for
MaskedLanguageModel
class` by @kaixuanliu in #513 - Fix double incrementing
te_request_count
metric by @kozistr in #486 - Add intel based images to the CI by @baptistecolle in #518
- Fix typo on intel docker image by @baptistecolle in #529
- chore: Upgrade to tokenizers 0.21.0 by @lightsofapollo in #512
- feat: add support for "model_type": "gte" by @anton-pt in #519
- Update
README.md
to include ONNX by @alvarobartt in #507 - Fusing both Gte Configs. by @Narsil in #530
- Add
HF_HUB_USER_AGENT_ORIGIN
by @alvarobartt in #534 - Use
--hf-token
instead of--hf-api-token
by @alvarobartt in #535 - Fixing the tests. by @Narsil in #531
- Support classification head for DistilBERT by @kozistr in #487
- add CLI flag
disable-spans
to toggle span trace logging by @obloomfield in #481 - feat: support HF_ENDPOINT environment when downloading model by @StrayDragon in #505
- Small fixup. by @Narsil in #537
- Fix
VarBuilder
handling in GTE e.g.gte-multilingual-reranker-base
by @Narsil in #538 - make a WA in case Bert model do not have
safetensor
file by @kaixuanliu in #515 - Add missing
match
ononnx/model.onnx
download by @alvarobartt in #472 - Fixing the impure flake devShell to be able to run python code. by @Narsil in #539
- Prepare for release. by @Narsil in #540
New Contributors
- @yuanwu2017 made their first contribution in #245
- @kaixuanliu made their first contribution in #386
- @Narsil made their first contribution in #501
- @pi314ever made their first contribution in #493
- @baptistecolle made their first contribution in #518
- @lightsofapollo made their first contribution in #512
- @anton-pt made their first contribution in #519
- @obloomfield made their first contribution in #481
- @StrayDragon made their first contribution in #505
Full Changelog: v1.6.0...v1.6.1
v1.6.0
What's Changed
- feat: support multiple backends at the same time by @OlivierDehaene in #440
- feat: GTE classification head by @kozistr in #441
- feat: Implement GTE model to support the non-flash-attn version by @kozistr in #446
- feat: Implement MPNet model (#363) by @kozistr in #447
Full Changelog: v1.5.1...v1.6.0
v1.5.1
What's Changed
- Download
model.onnx_data
by @kozistr in #343 - Rename 'Sentence Transformers' to 'sentence-transformers' in docstrings by @Wauplin in #342
- fix: add serde default for truncation direction by @drbh in #399
- fix: metrics unbounded memory by @OlivierDehaene in #409
- Fix to allow health check w/o auth by @kozistr in #360
- Update
ort
crate version to2.0.0-rc.4
to support onnx IR version 10 by @kozistr in #361 - adds curl to fix healthcheck by @WissamAntoun in #376
- fix: use num_cpus::get to check as get_physical does not check cgroups by @OlivierDehaene in #410
- fix: use status code 400 when batch is empty by @OlivierDehaene in #413
- fix: add cls pooling as default for BERT variants by @OlivierDehaene in #426
- feat: auto limit string if truncate is set by @OlivierDehaene in #428
New Contributors
- @Wauplin made their first contribution in #342
- @XciD made their first contribution in #345
- @WissamAntoun made their first contribution in #376
Full Changelog: v1.5.0...v1.5.1
v1.5.0
Notable Changes
- ONNX runtime for CPU deployments: greatly improve CPU deployment throughput
- Add
/similarity
route
What's Changed
- tokenizer max limit on input size by @ErikKaum in #324
- docs: air-gapped deployments by @OlivierDehaene in #326
- feat(onnx): add onnx runtime for better CPU perf by @OlivierDehaene in #328
- feat: add
/similarity
route by @OlivierDehaene in #331 - fix(ort): fix mean pooling by @OlivierDehaene in #332
- chore(candle): update flash attn by @OlivierDehaene in #335
- v1.5.0 by @OlivierDehaene in #336
New Contributors
Full Changelog: v1.4.0...v1.5.0