Skip to content

Releases: embeddings-benchmark/mteb

2.8.1

17 Feb 23:21

Choose a tag to compare

2.8.1 (2026-02-17)

Fix

  • fix: Remove duplicate citations and add test to prevent it going forward (#4032)

  • test: add test to detect duplicate citations

  • quality

  • move changes to task file

  • fix falsepositives

  • update citations

  • add models and benchmarks

  • search close titles

  • fix maeb


Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (377f1b4)

Unknown

  • benchmark: add 6 new VisRAG retrieval tasks and corresponding stats (#4059)

  • dataset: add 6 new VisRAG retrieval tasks and corresponding stats

  • Introduced VisRAGRetArxivQA, VisRAGRetChartQA, VisRAGRetInfoVQA, VisRAGRetMPDocVQA, VisRAGRetPlotQA, and VisRAGRetSlideVQA classes for various retrieval tasks.
  • Added JSON files containing descriptive statistics for each task, including sample counts, image dimensions, and query statistics.
  • Updated the retrieval module's init.py to include the new tasks in the module exports.
  • fix a linter error

  • dataset: introduce VisRAG Retrieval Benchmark

  • fix: metadata update of VisRag

  • Update benchmark metadata

  • Update VisRAG datasets metadata, including one-line description and the domains

  • Update slideVQA domain

  • Add Aliases for VisRAG

  • Fix bibtex format

  • Update dataset metadata to point to the mteb versions

  • Remove redundant data loading


Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> (7a9e653)

  • fix: Remove MAEB+ and MAEB(extended) from leaderboard and add "beta" to all MAEB (#4103)

  • fix: Remove MAEB+ and MAEB(extended)

related to #3470

We could consider keeping the two benchmarks (would still need to be beta as paper is review) so they could change.

  • fix: Remove MAEB+ and MAEB(extended) from leaderboard and make add "beta" to all MAEB

related to #3470

  • Remove MAEB+ and MAEB(extended) from leaderboard
  • Added the beta tag to denote that these might change

Currently implemented it as keeping the two temporary benchmarks. We could consider removing them as well (I am unsure how much of a burden it is for us to maintain them), but I would probably not add them to the leaderboard.

All of these changes should be backward compatible

  • docs: Added whatsnew

  • implement fixes

  • update description to explain beta status (77ac52b)

2.7.30

12 Feb 16:15

Choose a tag to compare

2.7.30 (2026-02-12)

Fix

  • fix: correct reference link for MIRACLVisionRetrieval task (#4092) (d3e9b06)

2.7.29

12 Feb 14:55

Choose a tag to compare

2.7.29 (2026-02-12)

Documentation

  • docs: Improved adding a benchmark docs (#4087)

  • docs: Improved adding a benchmark docs

expanded the explanation of provide more information about the process.

  • Update docs/contributing/adding_a_benchmark.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

  • Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

  • fix

  • fix

  • minor heading change

  • Apply suggestions from code review

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>


Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (8149d4a)

Fix

  • fix: constrain the transformers library version for jina-clip (#4061)

  • fix: constrain the transformers library version for jina-clip to avoid compatibility issue

  • add require package

  • ad to conflicts

  • try to run again

  • upd lock

  • tmp

  • try

  • fix: pylate dependency on outdated version of transformers


Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (eee82cf)

Unknown

  • dataset: Add MTEB(spa) Spanish language benchmark (#4053)

  • dataset: Add MTEB(spa) Spanish language benchmark

Define MTEB(spa, v1) benchmark grouping 23 existing Spanish tasks
across 6 task types: Classification (8), Clustering (3),
PairClassification (2), Reranking (1), Retrieval (5), and STS (4).

  • fix: Replace MIRACLRetrieval with HardNegatives.v2 per review

  • fix: Remove tasks with known issues, add contact, reduce to 16 tasks

  • Apply suggestion from @KennethEnevoldsen


Co-authored-by: Clemente <clemente@Clementes-MacBook-Pro.local>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> (2507bec)

  • Move MIEB datasets to mteb HuggingFace org (#4070)

  • Move 15 MIEB datasets to mteb HuggingFace org

Update dataset paths and revisions for tasks that now use datasets
forked to the mteb org:

  • MMSoc_HatefulMemes, MMSoc_Memotion (Ahren09 -> mteb)
  • blink-it2i, blink-it2i-multi, blink-it2t, blink-it2t-multi (JamieSJS -> mteb)
  • gld-v2-i2t, imagecode, imagecode-multi (JamieSJS -> mteb)
  • imagenet-10, imagenet-dog-15, met (JamieSJS -> mteb)
  • r-oxford-easy-multi, r-oxford-medium-multi, r-oxford-hard-multi (JamieSJS -> mteb)

Part of #4049

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • transfer mrbench

  • Move 8 isaacchung MIEB datasets to mteb HuggingFace org

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • Move 12 MIEB datasets to mteb HuggingFace org

Datasets moved:

  • dpdl-benchmark/sun397
  • ethz/food101
  • floschne/xflickrco
  • floschne/xm3600
  • flwrlabs/ucf101
  • nyu-visionx/CV-Bench
  • tanganke/dtd
  • tanganke/stl10
  • timm/eurosat-rgb
  • timm/resisc45
  • uoft-cs/cifar10
  • uoft-cs/cifar100

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • Move 15 MIEB datasets to mteb HuggingFace org

Datasets moved:

  • JamieSJS/r-paris-easy-multi
  • JamieSJS/r-paris-medium-multi
  • JamieSJS/r-paris-hard-multi
  • JamieSJS/rp2k
  • JamieSJS/sketchy
  • JamieSJS/stanford-online-products
  • JamieSJS/vizwiz
  • JamieSJS/vqa-2
  • Pixel-Linguist/rendered-sts12
  • Pixel-Linguist/rendered-sts13
  • Pixel-Linguist/rendered-sts14
  • Pixel-Linguist/rendered-sts15
  • Pixel-Linguist/rendered-sts16
  • ylecun/mnist
  • zh-plus/tiny-imagenet

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • Move 7 clip-benchmark datasets to mteb HuggingFace org

Datasets moved:

  • clip-benchmark/wds_country211
  • clip-benchmark/wds_fer2013
  • clip-benchmark/wds_gtsrb
  • clip-benchmark/wds_renderedsst2
  • clip-benchmark/wds_vtab-clevr_closest_object_distance
  • clip-benchmark/wds_vtab-clevr_count_all
  • clip-benchmark/wds_vtab-pcam

Note: wds_imagenet1k failed due to storage limits.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • Move 14 MIEB datasets to mteb HuggingFace org

Datasets migrated:

  • clip-benchmark/wds_imagenet1k → mteb/wds_imagenet1k
  • m-a-p/SciMMIR → mteb/SciMMIR
  • yjkimstats/SUGARCREPE_fmt → mteb/SUGARCREPE_fmt
  • nelorth/oxford-flowers → mteb/oxford-flowers
  • vidore/arxivqa_test_subsampled_beir → mteb/arxivqa_test_subsampled_beir
  • vidore/docvqa_test_subsampled_beir → mteb/docvqa_test_subsampled_beir
  • vidore/infovqa_test_subsampled_beir → mteb/infovqa_test_subsampled_beir
  • vidore/shiftproject_test_beir → mteb/shiftproject_test_beir
  • vidore/syntheticDocQA_artificial_intelligence_test_beir → mteb/syntheticDocQA_artificial_intelligence_test_beir
  • vidore/syntheticDocQA_energy_test_beir → mteb/syntheticDocQA_energy_test_beir
  • vidore/syntheticDocQA_government_reports_test_beir → mteb/syntheticDocQA_government_reports_test_beir
  • vidore/syntheticDocQA_healthcare_industry_test_beir → mteb/syntheticDocQA_healthcare_industry_test_beir
  • vidore/tabfquad_test_subsampled_beir → mteb/tabfquad_test_subsampled_beir
  • vidore/tatdqa_test_beir → mteb/tatdqa_test_beir

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • update rest

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (2ef04f8)

  • model: add voyage-4-nano (#4086)

  • model: add voyage-4-nano model implementation

  • Apply suggestion from @Samoed

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>


Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (2ce07c4)

2.7.28

11 Feb 16:04

Choose a tag to compare

2.7.28 (2026-02-11)

Fix

  • fix: Remove task performance by type tab when there is only one type (#4067)

  • Remove task performance by type Tab when the Radar plot can't be generated

  • Apply suggestion (9f95b58)

Unknown

  • Correct Embedding Dimension for paraphrase-multilingual-MiniLM-L12-v2 (#4089) (334d690)

2.7.27

11 Feb 11:27

Choose a tag to compare

2.7.27 (2026-02-11)

Documentation

  • docs: Outline for adding a task documentation (#4082)

  • docs: Outline for adding a task documentation

This is a suggested structure, PR is just to get feedback before I finish it up.

fixes #4077

  • upd docs

  • install dependencies in ci

  • add example with retrieval

  • filled out the missing segments

  • lint and format

  • Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

  • fix numerating and indent

  • add missing imports

  • fix links

  • add full example for retrieval dataset


Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (42b8058)

  • docs: Improve docstring for some of the main abstasks (#4083)

  • docs: fix AbsTaskClassification docstring formatting and improve docstrings for some of the main tasks

  • format (50bd0fa)

Fix

  • fix: Add performance per language tab to more benchmarks (#4066)

Add Performance per language Tab to more benchmarks (4ca1922)

Unknown

  • dataset: add 'law-ir_ko' dataset for IR task (#4052)

  • law_ir_ko

  • Update mteb/tasks/retrieval/kor/law_ir_ko.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

  • law_ir_ko info revision

  • description

  • metadata-info rev

  • metadata-info rev

  • Update mteb/tasks/retrieval/kor/law_ir_ko.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

  • statistics(), reference

  • format citation

  • author & howpublished rev

  • make lint

  • description rev

  • Update mteb/tasks/retrieval/kor/law_ir_ko.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

  • Update mteb/tasks/retrieval/kor/law_ir_ko.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> (1cdc662)

  • model: Add ModelMeta for geoffsee/auto-g-embed-st (#4074)

Add ModelMeta for geoffsee/auto-g-embed-st (81540a2)

  • Add MetaCLIP 2 model integration (#4065)

  • Add MetaCLIP 2 model integration

Add support for facebook/metaclip-2-mt5-worldwide-b32, a multilingual
vision-language model using mT5 tokenizer for worldwide language support.

  • 254M parameters, 512 embedding dimension
  • Supports 99 languages (XLMR language set)
  • Handles MetaCLIP 2's BaseModelOutputWithPooling return format

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • Add n_embedding_parameters for MetaCLIP 2 model

Set n_embedding_parameters to 128,057,344 (mT5 vocab size 250,112 × embed_dim 512)
to fix test_n_embedding_parameters test failure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • Add training metadata for MetaCLIP 2 model

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>


Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (4280dd9)

2.7.26

07 Feb 22:36

Choose a tag to compare

2.7.26 (2026-02-07)

Fix

  • fix: filter corrupted image in Birdsnap (#4068)

  • fix: filter corrupted image in Birdsnap and drop unused splits in zero-shot tasks

  • Filter out corrupted/truncated image at index 3854 in Birdsnap train split
  • Add dataset_transform to AbsTaskZeroShotClassification to keep only eval splits
    (zero-shot tasks don't need train splits)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • fix: handle BaseModelOutputWithPooling in CLIP model wrapper

In transformers 5.x, get_text_features and get_image_features return
BaseModelOutputWithPooling instead of a tensor directly. Extract the
pooler_output when needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • fix: add None check for dataset in zeroshot classification transform

Fixes mypy type errors where self.dataset could be None when accessing
.keys() and deleting splits in dataset_transform method.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (6c00506)

Unknown

  • Backfill missing metadata for historic datasets (#4063)

  • Backfill missing metadata for historic datasets

Fill in missing TaskMetadata fields for ~90 historic datasets as
described in issue #2502. This includes:

  • Classification tasks (Polish, Chinese)
  • Clustering tasks (German, French, Spanish, Swedish, Chinese, Multilingual)
  • Pair classification tasks (Polish, Chinese)
  • Reranking tasks (English, French, Chinese)
  • Retrieval tasks (German, English, Japanese, Korean, Polish, Spanish, Chinese, Multilingual)
  • STS tasks (German, English, French, Korean, Spanish, Chinese)

Fields filled include: date, domains, task_subtypes, license,
annotations_creators, dialect, sample_creation, and bibtex_citation.

The _HISTORIC_DATASETS list is reduced from ~90 entries to just 4
aggregate tasks whose metadata computation has a separate issue
(the compute* methods return None for single-valued fields).

Closes #2502

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

  • Fix type annotation in _compute_license method

Add StrURL to the return type and set type annotation to match
the license field type (Licenses | StrURL | None).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>


Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (b5fb471)

2.7.25

07 Feb 18:14

Choose a tag to compare

2.7.25 (2026-02-07)

Fix

Unknown

  • fix Remove the hardcoded batch_size=1 when generating text and image embeddings for Nemotron-Colembed-v2 models (#4054)

remove hardcoded batch_size 1 (1682b2f)

  • Update nemotron v2 citation (#4051)

update nemotron v2 citation (0be1df3)

2.7.24

05 Feb 10:36

Choose a tag to compare

2.7.24 (2026-02-05)

Fix

  • fix: leaderboard errors (#3969)

  • fix leaderboard

  • fix leaderboard errors

  • simplify

  • upd description (fd37337)

Unknown

2.7.23

04 Feb 11:16

Choose a tag to compare

2.7.23 (2026-02-04)

Fix

  • fix: Fill in embedding and total parameters in ModelMeta (#4031)

  • Filling Embedding/Total Parameters in ModelMeta

  • Add parameter for other models

  • Add parameters for more models

  • Added exact value for n_parameters

  • Fix tests

  • set n_embedding_parameters to None

  • Add results of some more models

  • Add tests

  • Add _HISTORIC_MODELS list in test

  • Update tests/test_models/test_model_meta.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

  • fix tests

  • correct tests

  • fix _HISTORIC_MODELS list


Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (bc6e6cb)

Unknown

  • dataset: Add ERESS reranking task (#3991)

  • dataset: Add ERESS reranking task

  • Add ERESSReranking task for e-commerce product relevance reranking
  • Dataset: thebajajra/eress with ~72k query-product pairs
  • Supports graded relevance (0-100 integer scale)
  • Main metric: nDCG@5
  • Add E-commerce domain and Product Reranking subtypes to TaskMetadata
  • Include descriptive statistics
  • fix: align dataset_transform signature with base class

  • fix: dataset reuploaded, custom transformation removed

  • fix: rev updated with title + text combination

  • description moved away from docstring

  • Update mteb/tasks/reranking/eng/ecommerce_product_relevance_reranking.py


Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> (fe67f8e)

2.7.22

03 Feb 12:59

Choose a tag to compare

2.7.22 (2026-02-03)

Documentation

  • docs: Added changelog (#3741)

  • docs: Added changelog

  • Clean up docs to prepare for adding the changelog. By adding missing links and removing references to documentation that does not exist
  • Added whats new section
  • Added changes from 2.0 upwards. I might be missing some

I think going forward we can just update this as well go.

  • minor fix

  • added autogenerated changelog

  • rename

  • add autogenerated workflows

  • updates

  • update

  • update (2082d3e)

Fix

  • fix: backfilling historic tasks (#4034)

  • fix: backfilling historic tasks

  • Backfilled task metadata
  • extended test to ensure that backfilled tasks are removed from the historic list

addresses #2502

  • back citation, date and task subtypes where only those are missing

  • Update mteb/tasks/pair_classification/pol/polish_pc.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

  • add famteb citation

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (e542519)