DAGE-127: [M2] Add MTEB top model comparison metrics#1
Conversation
…ion into dage-127_model_comparison
| 4. Return user model (model_name), top model avg_main_scores. | ||
| """ | ||
|
|
||
| benchmark = mteb.get_benchmark("MTEB(eng, v2)") |
There was a problem hiding this comment.
Just a comment for future work: now we are looking at the English leaderboard. What happens if the dataset is not in English? We should in some way address this comparison
There was a problem hiding this comment.
out tool focuses on only English as of now. We need different ways to handle other languages starting from LLMs.
To your question-> there is a separate multilingual MTEB leaderboard containing 200+ languages.
| benchmark = mteb.get_benchmark("MTEB(eng, v2)") | ||
| tasks = [t for t in benchmark.tasks if t.metadata.type == task_type] | ||
| # include "mostly complete" tasks for the models, so excluding models which are run on a few tasks | ||
| num_tasks = len(tasks) * 0.7 |
There was a problem hiding this comment.
Why do we filter on the number of tasks? Are there many models with few tasks executed?
There was a problem hiding this comment.
See the comment. # include "mostly complete" tasks for the models, so excluding models which are run on a few tasks
There was a problem hiding this comment.
FYI: mteb version 2x, added tasks even they run on a few tasks. And the mteb leaderboard shows v.2x mteb verson results.
Eg.