Research Project: MTEB Gym #3068
Replies: 3 comments 12 replies
-
The idea sounds great to me; it would be a simple way to also allow for evaluations for unlabelled datasets (at least for retrieval). |
Beta Was this translation helpful? Give feedback.
-
I chatted with @Muennighoff offline and do really think this is an exciting project! I think some folks at Hopkins can help with the implementation (tagging some @robro612 @rekriz11). I was thinking this eval setup would be especially useful for tasks that are hard to annotate: complex material in science/law or even to various multimodal aspect, etc. especially since VLMs are so good these days. Could help alleviate some of MTEB's gaps in the any-to-any retrieval setting. Some thoughts:
What are your thoughts @Muennighoff @KennethEnevoldsen? |
Beta Was this translation helpful? Give feedback.
-
How will this differ from standard retrieval/STS tasks? We could create a new benchmark with automatic evaluation, but I'm not sure how LLMs would be used in this scenario. We could also add more evaluators using LLM-as-a-judge to make it easier to use them in the current context if needed |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The MTEB arena is fun but it takes time to get rankings. When evaluating generative LLMs, one solution is to use judge models, ideally pairwise for best results. We can do the same for the arena to allow running it quickly offline - a "gym" 🙂
The leaderboard could be either added to the existing one as a tab on the side or have a separate one
Curious to hear people's thoughts & if anybody is interested in leading this!
Beta Was this translation helpful? Give feedback.
All reactions