Skip to content

DAGE-127: [M2] Add MTEB top model comparison metrics#1

Merged
nseidan merged 3 commits intomainfrom
dage-127_model_comparison
Dec 3, 2025
Merged

DAGE-127: [M2] Add MTEB top model comparison metrics#1
nseidan merged 3 commits intomainfrom
dage-127_model_comparison

Conversation

@nseidan
Copy link
Collaborator

@nseidan nseidan commented Nov 27, 2025

  • Added MTEB Leaderboard comparison metrics
    Eg.
....
"evaluation_time": 555.2600812911987,
  "kg_co2_emissions": null,
  "mteb_leaderboard_model_comparison": {
    "top_model": "Qwen/Qwen3-Embedding-8B",
    "top_model_avg_main_score": 69.4402,
    "user_model": "sentence-transformers/all-MiniLM-L12-v2",
    "user_model_mteb_avg_main_score": 45.455999999999996,
    "user_model_custom_task_main_score": 50.624,
    "model_main_score_diff": "Your model's main_score on custom task main_score=50.62 vs. MTEB leaderboard shown avg_main_score=45.46",
    "warning": "Your model=sentence-transformers/all-MiniLM-L12-v2 is 27.10% worse than the top model=Qwen/Qwen3-Embedding-8B."
  }

4. Return user model (model_name), top model avg_main_scores.
"""

benchmark = mteb.get_benchmark("MTEB(eng, v2)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment for future work: now we are looking at the English leaderboard. What happens if the dataset is not in English? We should in some way address this comparison

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out tool focuses on only English as of now. We need different ways to handle other languages starting from LLMs.
To your question-> there is a separate multilingual MTEB leaderboard containing 200+ languages.

benchmark = mteb.get_benchmark("MTEB(eng, v2)")
tasks = [t for t in benchmark.tasks if t.metadata.type == task_type]
# include "mostly complete" tasks for the models, so excluding models which are run on a few tasks
num_tasks = len(tasks) * 0.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we filter on the number of tasks? Are there many models with few tasks executed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment. # include "mostly complete" tasks for the models, so excluding models which are run on a few tasks

Copy link
Collaborator Author

@nseidan nseidan Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: mteb version 2x, added tasks even they run on a few tasks. And the mteb leaderboard shows v.2x mteb verson results.

@nseidan nseidan merged commit 156ac60 into main Dec 3, 2025
3 checks passed
@nseidan nseidan deleted the dage-127_model_comparison branch December 3, 2025 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants