DAGE-127: [M2] Add MTEB top model comparison metrics by nseidan · Pull Request #1 · SeaseLtd/llm-search-quality-evaluation

nseidan · 2025-11-27T09:48:33Z

Added MTEB Leaderboard comparison metrics
Eg.

....
"evaluation_time": 555.2600812911987,
  "kg_co2_emissions": null,
  "mteb_leaderboard_model_comparison": {
    "top_model": "Qwen/Qwen3-Embedding-8B",
    "top_model_avg_main_score": 69.4402,
    "user_model": "sentence-transformers/all-MiniLM-L12-v2",
    "user_model_mteb_avg_main_score": 45.455999999999996,
    "user_model_custom_task_main_score": 50.624,
    "model_main_score_diff": "Your model's main_score on custom task main_score=50.62 vs. MTEB leaderboard shown avg_main_score=45.46",
    "warning": "Your model=sentence-transformers/all-MiniLM-L12-v2 is 27.10% worse than the top model=Qwen/Qwen3-Embedding-8B."
  }

…ion into dage-127_model_comparison

nicolo-rinaldi · 2025-12-01T09:51:05Z

src/llm_search_quality_evaluation/vector_search_doctor/embedding_model_evaluator/main.py

+    4. Return user model (model_name), top model avg_main_scores.
+    """
+
+    benchmark = mteb.get_benchmark("MTEB(eng, v2)")


Just a comment for future work: now we are looking at the English leaderboard. What happens if the dataset is not in English? We should in some way address this comparison

out tool focuses on only English as of now. We need different ways to handle other languages starting from LLMs.
To your question-> there is a separate multilingual MTEB leaderboard containing 200+ languages.

nicolo-rinaldi · 2025-12-01T09:52:34Z

src/llm_search_quality_evaluation/vector_search_doctor/embedding_model_evaluator/main.py

+    benchmark = mteb.get_benchmark("MTEB(eng, v2)")
+    tasks = [t for t in benchmark.tasks if t.metadata.type == task_type]
+    # include "mostly complete" tasks for the models, so excluding models which are run on a few tasks
+    num_tasks = len(tasks) * 0.7


Why do we filter on the number of tasks? Are there many models with few tasks executed?

See the comment. # include "mostly complete" tasks for the models, so excluding models which are run on a few tasks

FYI: mteb version 2x, added tasks even they run on a few tasks. And the mteb leaderboard shows v.2x mteb verson results.

nseidan added 3 commits November 27, 2025 14:47

DAGE-127: [M2] Add MTEB top model comparison metrics

0d57ef8

Update

f1805ae

Merge branch 'main' of github.com:SeaseLtd/llm-search-quality-evaluat…

e68dd49

…ion into dage-127_model_comparison

nicolo-rinaldi reviewed Dec 1, 2025

View reviewed changes

nicolo-rinaldi approved these changes Dec 2, 2025

View reviewed changes

nseidan merged commit 156ac60 into main Dec 3, 2025
3 checks passed

nseidan deleted the dage-127_model_comparison branch December 3, 2025 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAGE-127: [M2] Add MTEB top model comparison metrics#1

DAGE-127: [M2] Add MTEB top model comparison metrics#1
nseidan merged 3 commits intomainfrom
dage-127_model_comparison

nseidan commented Nov 27, 2025

Uh oh!

nicolo-rinaldi Dec 1, 2025

Uh oh!

nseidan Dec 1, 2025

Uh oh!

nicolo-rinaldi Dec 1, 2025

Uh oh!

nseidan Dec 1, 2025

Uh oh!

nseidan Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nseidan commented Nov 27, 2025

Uh oh!

nicolo-rinaldi Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

nseidan Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

nicolo-rinaldi Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

nseidan Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

nseidan Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nseidan Dec 1, 2025 •

edited

Loading