Releases: embeddings-benchmark/mteb
2.8.1
2.8.1 (2026-02-17)
Fix
-
fix: Remove duplicate citations and add test to prevent it going forward (#4032)
-
test: add test to detect duplicate citations
-
quality
-
move changes to task file
-
fix falsepositives
-
update citations
-
add models and benchmarks
-
search close titles
-
fix maeb
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (377f1b4)
Unknown
-
benchmark: add 6 new VisRAG retrieval tasks and corresponding stats (#4059)
-
dataset: add 6 new VisRAG retrieval tasks and corresponding stats
- Introduced VisRAGRetArxivQA, VisRAGRetChartQA, VisRAGRetInfoVQA, VisRAGRetMPDocVQA, VisRAGRetPlotQA, and VisRAGRetSlideVQA classes for various retrieval tasks.
- Added JSON files containing descriptive statistics for each task, including sample counts, image dimensions, and query statistics.
- Updated the retrieval module's init.py to include the new tasks in the module exports.
-
fix a linter error
-
dataset: introduce VisRAG Retrieval Benchmark
-
fix: metadata update of VisRag
-
Update benchmark metadata
-
Update VisRAG datasets metadata, including one-line description and the domains
-
Update slideVQA domain
-
Add Aliases for VisRAG
-
Fix bibtex format
-
Update dataset metadata to point to the mteb versions
-
Remove redundant data loading
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> (7a9e653)
-
fix: Remove MAEB+ and MAEB(extended) from leaderboard and add "beta" to all MAEB (#4103)
-
fix: Remove MAEB+ and MAEB(extended)
related to #3470
We could consider keeping the two benchmarks (would still need to be beta as paper is review) so they could change.
- fix: Remove MAEB+ and MAEB(extended) from leaderboard and make add "beta" to all MAEB
related to #3470
- Remove MAEB+ and MAEB(extended) from leaderboard
- Added the beta tag to denote that these might change
Currently implemented it as keeping the two temporary benchmarks. We could consider removing them as well (I am unsure how much of a burden it is for us to maintain them), but I would probably not add them to the leaderboard.
All of these changes should be backward compatible
-
docs: Added whatsnew
-
implement fixes
-
update description to explain beta status (
77ac52b)
2.7.30
2.7.29
2.7.29 (2026-02-12)
Documentation
-
docs: Improved adding a benchmark docs (#4087)
-
docs: Improved adding a benchmark docs
expanded the explanation of provide more information about the process.
- Update docs/contributing/adding_a_benchmark.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
-
fix
-
fix
-
minor heading change
-
Apply suggestions from code review
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (8149d4a)
Fix
-
fix: constrain the transformers library version for jina-clip (#4061)
-
fix: constrain the transformers library version for jina-clip to avoid compatibility issue
-
add require package
-
ad to conflicts
-
try to run again
-
upd lock
-
tmp
-
try
-
fix: pylate dependency on outdated version of transformers
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (eee82cf)
Unknown
-
dataset: Add MTEB(spa) Spanish language benchmark (#4053)
-
dataset: Add MTEB(spa) Spanish language benchmark
Define MTEB(spa, v1) benchmark grouping 23 existing Spanish tasks
across 6 task types: Classification (8), Clustering (3),
PairClassification (2), Reranking (1), Retrieval (5), and STS (4).
-
fix: Replace MIRACLRetrieval with HardNegatives.v2 per review
-
fix: Remove tasks with known issues, add contact, reduce to 16 tasks
-
Apply suggestion from @KennethEnevoldsen
Co-authored-by: Clemente <clemente@Clementes-MacBook-Pro.local>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> (2507bec)
-
Move MIEB datasets to mteb HuggingFace org (#4070)
-
Move 15 MIEB datasets to mteb HuggingFace org
Update dataset paths and revisions for tasks that now use datasets
forked to the mteb org:
- MMSoc_HatefulMemes, MMSoc_Memotion (Ahren09 -> mteb)
- blink-it2i, blink-it2i-multi, blink-it2t, blink-it2t-multi (JamieSJS -> mteb)
- gld-v2-i2t, imagecode, imagecode-multi (JamieSJS -> mteb)
- imagenet-10, imagenet-dog-15, met (JamieSJS -> mteb)
- r-oxford-easy-multi, r-oxford-medium-multi, r-oxford-hard-multi (JamieSJS -> mteb)
Part of #4049
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
-
transfer mrbench
-
Move 8 isaacchung MIEB datasets to mteb HuggingFace org
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move 12 MIEB datasets to mteb HuggingFace org
Datasets moved:
- dpdl-benchmark/sun397
- ethz/food101
- floschne/xflickrco
- floschne/xm3600
- flwrlabs/ucf101
- nyu-visionx/CV-Bench
- tanganke/dtd
- tanganke/stl10
- timm/eurosat-rgb
- timm/resisc45
- uoft-cs/cifar10
- uoft-cs/cifar100
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move 15 MIEB datasets to mteb HuggingFace org
Datasets moved:
- JamieSJS/r-paris-easy-multi
- JamieSJS/r-paris-medium-multi
- JamieSJS/r-paris-hard-multi
- JamieSJS/rp2k
- JamieSJS/sketchy
- JamieSJS/stanford-online-products
- JamieSJS/vizwiz
- JamieSJS/vqa-2
- Pixel-Linguist/rendered-sts12
- Pixel-Linguist/rendered-sts13
- Pixel-Linguist/rendered-sts14
- Pixel-Linguist/rendered-sts15
- Pixel-Linguist/rendered-sts16
- ylecun/mnist
- zh-plus/tiny-imagenet
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move 7 clip-benchmark datasets to mteb HuggingFace org
Datasets moved:
- clip-benchmark/wds_country211
- clip-benchmark/wds_fer2013
- clip-benchmark/wds_gtsrb
- clip-benchmark/wds_renderedsst2
- clip-benchmark/wds_vtab-clevr_closest_object_distance
- clip-benchmark/wds_vtab-clevr_count_all
- clip-benchmark/wds_vtab-pcam
Note: wds_imagenet1k failed due to storage limits.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move 14 MIEB datasets to mteb HuggingFace org
Datasets migrated:
- clip-benchmark/wds_imagenet1k → mteb/wds_imagenet1k
- m-a-p/SciMMIR → mteb/SciMMIR
- yjkimstats/SUGARCREPE_fmt → mteb/SUGARCREPE_fmt
- nelorth/oxford-flowers → mteb/oxford-flowers
- vidore/arxivqa_test_subsampled_beir → mteb/arxivqa_test_subsampled_beir
- vidore/docvqa_test_subsampled_beir → mteb/docvqa_test_subsampled_beir
- vidore/infovqa_test_subsampled_beir → mteb/infovqa_test_subsampled_beir
- vidore/shiftproject_test_beir → mteb/shiftproject_test_beir
- vidore/syntheticDocQA_artificial_intelligence_test_beir → mteb/syntheticDocQA_artificial_intelligence_test_beir
- vidore/syntheticDocQA_energy_test_beir → mteb/syntheticDocQA_energy_test_beir
- vidore/syntheticDocQA_government_reports_test_beir → mteb/syntheticDocQA_government_reports_test_beir
- vidore/syntheticDocQA_healthcare_industry_test_beir → mteb/syntheticDocQA_healthcare_industry_test_beir
- vidore/tabfquad_test_subsampled_beir → mteb/tabfquad_test_subsampled_beir
- vidore/tatdqa_test_beir → mteb/tatdqa_test_beir
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- update rest
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (2ef04f8)
-
model: add voyage-4-nano (#4086)
-
model: add voyage-4-nano model implementation
-
Apply suggestion from @Samoed
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (2ce07c4)
2.7.28
2.7.27
2.7.27 (2026-02-11)
Documentation
-
docs: Outline for adding a task documentation (#4082)
-
docs: Outline for adding a task documentation
This is a suggested structure, PR is just to get feedback before I finish it up.
fixes #4077
-
upd docs
-
install dependencies in ci
-
add example with retrieval
-
filled out the missing segments
-
lint and format
-
Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
-
fix numerating and indent
-
add missing imports
-
fix links
-
add full example for retrieval dataset
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (42b8058)
-
docs: Improve docstring for some of the main abstasks (#4083)
-
docs: fix
AbsTaskClassificationdocstring formatting and improve docstrings for some of the main tasks -
format (
50bd0fa)
Fix
- fix: Add performance per language tab to more benchmarks (#4066)
Add Performance per language Tab to more benchmarks (4ca1922)
Unknown
-
dataset: add 'law-ir_ko' dataset for IR task (#4052)
-
law_ir_ko
-
Update mteb/tasks/retrieval/kor/law_ir_ko.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
-
law_ir_ko info revision
-
description
-
metadata-info rev
-
metadata-info rev
-
Update mteb/tasks/retrieval/kor/law_ir_ko.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
-
statistics(), reference
-
format citation
-
author & howpublished rev
-
make lint
-
description rev
-
Update mteb/tasks/retrieval/kor/law_ir_ko.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
- Update mteb/tasks/retrieval/kor/law_ir_ko.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> (1cdc662)
- model: Add ModelMeta for geoffsee/auto-g-embed-st (#4074)
Add ModelMeta for geoffsee/auto-g-embed-st (81540a2)
-
Add MetaCLIP 2 model integration (#4065)
-
Add MetaCLIP 2 model integration
Add support for facebook/metaclip-2-mt5-worldwide-b32, a multilingual
vision-language model using mT5 tokenizer for worldwide language support.
- 254M parameters, 512 embedding dimension
- Supports 99 languages (XLMR language set)
- Handles MetaCLIP 2's BaseModelOutputWithPooling return format
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add n_embedding_parameters for MetaCLIP 2 model
Set n_embedding_parameters to 128,057,344 (mT5 vocab size 250,112 × embed_dim 512)
to fix test_n_embedding_parameters test failure.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add training metadata for MetaCLIP 2 model
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Apply suggestion from @Samoed
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (4280dd9)
2.7.26
2.7.26 (2026-02-07)
Fix
-
fix: filter corrupted image in Birdsnap (#4068)
-
fix: filter corrupted image in Birdsnap and drop unused splits in zero-shot tasks
- Filter out corrupted/truncated image at index 3854 in Birdsnap train split
- Add dataset_transform to AbsTaskZeroShotClassification to keep only eval splits
(zero-shot tasks don't need train splits)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- fix: handle BaseModelOutputWithPooling in CLIP model wrapper
In transformers 5.x, get_text_features and get_image_features return
BaseModelOutputWithPooling instead of a tensor directly. Extract the
pooler_output when needed.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- fix: add None check for dataset in zeroshot classification transform
Fixes mypy type errors where self.dataset could be None when accessing
.keys() and deleting splits in dataset_transform method.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (6c00506)
Unknown
-
Backfill missing metadata for historic datasets (#4063)
-
Backfill missing metadata for historic datasets
Fill in missing TaskMetadata fields for ~90 historic datasets as
described in issue #2502. This includes:
- Classification tasks (Polish, Chinese)
- Clustering tasks (German, French, Spanish, Swedish, Chinese, Multilingual)
- Pair classification tasks (Polish, Chinese)
- Reranking tasks (English, French, Chinese)
- Retrieval tasks (German, English, Japanese, Korean, Polish, Spanish, Chinese, Multilingual)
- STS tasks (German, English, French, Korean, Spanish, Chinese)
Fields filled include: date, domains, task_subtypes, license,
annotations_creators, dialect, sample_creation, and bibtex_citation.
The _HISTORIC_DATASETS list is reduced from ~90 entries to just 4
aggregate tasks whose metadata computation has a separate issue
(the compute* methods return None for single-valued fields).
Closes #2502
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix type annotation in _compute_license method
Add StrURL to the return type and set type annotation to match
the license field type (Licenses | StrURL | None).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (b5fb471)
2.7.25
2.7.24
2.7.23
2.7.23 (2026-02-04)
Fix
-
fix: Fill in embedding and total parameters in ModelMeta (#4031)
-
Filling Embedding/Total Parameters in ModelMeta
-
Add parameter for other models
-
Add parameters for more models
-
Added exact value for n_parameters
-
Fix tests
-
set n_embedding_parameters to None
-
Add results of some more models
-
Add tests
-
Add _HISTORIC_MODELS list in test
-
Update tests/test_models/test_model_meta.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
-
fix tests
-
correct tests
-
fix _HISTORIC_MODELS list
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (bc6e6cb)
Unknown
-
dataset: Add ERESS reranking task (#3991)
-
dataset: Add ERESS reranking task
- Add ERESSReranking task for e-commerce product relevance reranking
- Dataset: thebajajra/eress with ~72k query-product pairs
- Supports graded relevance (0-100 integer scale)
- Main metric: nDCG@5
- Add E-commerce domain and Product Reranking subtypes to TaskMetadata
- Include descriptive statistics
-
fix: align dataset_transform signature with base class
-
fix: dataset reuploaded, custom transformation removed
-
fix: rev updated with title + text combination
-
description moved away from docstring
-
Update mteb/tasks/reranking/eng/ecommerce_product_relevance_reranking.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> (fe67f8e)
2.7.22
2.7.22 (2026-02-03)
Documentation
-
docs: Added changelog (#3741)
-
docs: Added changelog
- Clean up docs to prepare for adding the changelog. By adding missing links and removing references to documentation that does not exist
- Added whats new section
- Added changes from 2.0 upwards. I might be missing some
I think going forward we can just update this as well go.
-
minor fix
-
added autogenerated changelog
-
rename
-
add autogenerated workflows
-
updates
-
update
-
update (
2082d3e)
Fix
-
fix: backfilling historic tasks (#4034)
-
fix: backfilling historic tasks
- Backfilled task metadata
- extended test to ensure that backfilled tasks are removed from the historic list
addresses #2502
-
back citation, date and task subtypes where only those are missing
-
Update mteb/tasks/pair_classification/pol/polish_pc.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
- add famteb citation
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (e542519)