Research Project: MTEB Gym #3068

Muennighoff · 2025-08-22T17:44:19Z

Muennighoff
Aug 22, 2025
Maintainer

The MTEB arena is fun but it takes time to get rankings. When evaluating generative LLMs, one solution is to use judge models, ideally pairwise for best results. We can do the same for the arena to allow running it quickly offline - a "gym" 🙂

Use one reference embedding model (e.g. gemini) to retrieve docs / sts / clustering for all the queries (could use the same datasets as in the arena)
Use model to be evaluated to retrieve / sts / clustering
Ask a judge which is better; probably a generative LLM for retrieved docs & maybe a multimodal LLM for sts / clustering
Get a ranking of embedding models like AlpacaEval (https://tatsu-lab.github.io/alpaca_eval/)

The leaderboard could be either added to the existing one as a tab on the side or have a separate one

Curious to hear people's thoughts & if anybody is interested in leading this!

KennethEnevoldsen · 2025-08-26T09:37:26Z

KennethEnevoldsen
Aug 26, 2025
Maintainer

The idea sounds great to me; it would be a simple way to also allow for evaluations for unlabelled datasets (at least for retrieval).

0 replies

orionw · 2025-09-05T20:37:16Z

orionw
Sep 5, 2025
Maintainer

I chatted with @Muennighoff offline and do really think this is an exciting project! I think some folks at Hopkins can help with the implementation (tagging some @robro612 @rekriz11).

I was thinking this eval setup would be especially useful for tasks that are hard to annotate: complex material in science/law or even to various multimodal aspect, etc. especially since VLMs are so good these days. Could help alleviate some of MTEB's gaps in the any-to-any retrieval setting.

Some thoughts:

Would it be better to use large corpus (web crawl?) rather than the standard Wikipedia? I think it could allow for more realistic web searches with unique content, but the downside is the indexing would be more expensive if we have O(billion) docs. But we might have enough compute to pull that off, and we could ask companies submitting models to index it themselves and provide it via HF. One candidate for multimodal could be any large image corpus (FineVision?).
Should queries be generated on-the-fly or re-use the same set? I'm leaning towards on-the-fly, just like LMSys would do it. E.g. maybe give an LLM n random docs, have it generate a query that is related, but not answered by those, and then repeat?
We would save the results and track elo scores and add an extra tab on the Leaderboard. We would also have the data for humans and LLM-as-a-judge, so we could analyze how well human vs LMs track this performance.

What are your thoughts @Muennighoff @KennethEnevoldsen?

6 replies

KennethEnevoldsen Sep 6, 2025
Maintainer

Scale would be great! All for that, but it would be great for us to design it in such a way that a company could set it up with internal documents and get a score.
I would go with on-the-fly as well, but no strong preference (we can do experiments to have the synthetic scores align with human preferences)

Would be great to have it multimodal. For fine-vision, are you just thinking of using the images?

Samoed Sep 6, 2025
Maintainer

I think that Wikipedia and FineVision will primarily be used for training purposes and are unlikely to be representative of real-world usage.

KennethEnevoldsen Sep 7, 2025
Maintainer

A way to cut it down might be to have a set of gyms, one for each use case, such as Wikipedia, Customer service for a company, and Arxiv. All seems reasonably constrained if we, e.g., make a good enough page, I would be happy to use the Arxiv search (we could even have it continually updated).

orionw Sep 8, 2025
Maintainer

I think that Wikipedia and FineVision will primarily be used for training purposes

I was thinking eval purposes, FineVision just for a large diverse set of images. But we could find other image collections also.

design it in such a way that a company could set it up with internal documents and get a score.

+1, great idea!

A way to cut it down might be to have a set of gyms, one for each use case, such as Wikipedia, Customer service for a company, and Arxiv

These would be easy enough to do so we could do all the above, but I'm a bit worried that yet another Wikipedia/Arxiv search eval will be too narrow and not highlight the benefits to this approach, namely being able to eval on anything / large corpora easily.

orionw Sep 8, 2025
Maintainer

& 3. Great points - I think they relate to the same question of whether we follow the approach of: [offline or online]

I think an AlpacaEval style-eval is the easiest to run and benchmark against. I do like the LMArena one more, but I see the practical utility of a set eval. Seems like a good place to start and then we could increase scope later if we ever decide to.

Samoed · 2025-09-06T19:07:35Z

Samoed
Sep 6, 2025
Maintainer

How will this differ from standard retrieval/STS tasks? We could create a new benchmark with automatic evaluation, but I'm not sure how LLMs would be used in this scenario. We could also add more evaluators using LLM-as-a-judge to make it easier to use them in the current context if needed

6 replies

Muennighoff Sep 6, 2025
Maintainer Author

Code-wise, this would probably only change the evaluator part but the task stays the same I think

robro612 Sep 7, 2025

The greatest benefit I see of using LLM-as-judge at inference time is the richer notions of relevance we can evaluate compared to AIR-Bench which is quite similar but bakes its relevance judgements into its query/hard negative generation pipeline to be evaluated with classic IR measures. This would allow us to evaluate multi-doc evidence collection for downstream consumption/RAG, rather than the classic "ten blue links" web search paradigm. Sufficiently complex synthetic query generation, then, is also the greatest unknown here IMO.

KennethEnevoldsen Sep 8, 2025
Maintainer

Sufficiently complex synthetic query generation, then, is also the greatest unknown here IMO.

Yeah, I think there is nothing to do but comparing it usage data and see if it matches up

robro612 Sep 8, 2025

Sorry, could you clarify what you mean about:

comparing it usage data and see if it matches up

KennethEnevoldsen Sep 9, 2025
Maintainer

Sorry, I would simply compare the performance obtained using synthetic queries with human queries (usage data) and see if it matches

Research Project: MTEB Gym #3068

Uh oh!

Muennighoff Aug 22, 2025 Maintainer

Replies: 3 comments · 12 replies

Uh oh!

KennethEnevoldsen Aug 26, 2025 Maintainer

Uh oh!

Uh oh!

orionw Sep 5, 2025 Maintainer

Uh oh!

KennethEnevoldsen Sep 6, 2025 Maintainer

Uh oh!

Samoed Sep 6, 2025 Maintainer

Uh oh!

KennethEnevoldsen Sep 7, 2025 Maintainer

Uh oh!

orionw Sep 8, 2025 Maintainer

Uh oh!

orionw Sep 8, 2025 Maintainer

Uh oh!

Samoed Sep 6, 2025 Maintainer

Uh oh!

Muennighoff Sep 6, 2025 Maintainer Author

Uh oh!

robro612 Sep 7, 2025

Uh oh!

KennethEnevoldsen Sep 8, 2025 Maintainer

Uh oh!

robro612 Sep 8, 2025

Uh oh!

KennethEnevoldsen Sep 9, 2025 Maintainer

Muennighoff
Aug 22, 2025
Maintainer

Replies: 3 comments 12 replies

KennethEnevoldsen
Aug 26, 2025
Maintainer

orionw
Sep 5, 2025
Maintainer

KennethEnevoldsen Sep 6, 2025
Maintainer

Samoed Sep 6, 2025
Maintainer

KennethEnevoldsen Sep 7, 2025
Maintainer

orionw Sep 8, 2025
Maintainer

orionw Sep 8, 2025
Maintainer

Samoed
Sep 6, 2025
Maintainer

Muennighoff Sep 6, 2025
Maintainer Author

KennethEnevoldsen Sep 8, 2025
Maintainer

KennethEnevoldsen Sep 9, 2025
Maintainer