[Discussion]Add dedicated display for RTEB benchmark results #3096

q275343119 · 2025-08-29T03:46:24Z

q275343119
Aug 29, 2025
Collaborator

Description of the feature

We need to display RTEB results in the interface.
To support this, we are opening this issue to discuss the related changes before finalizing the implementation.

A demo of the current implementation can be found here:
https://huggingface.co/spaces/SmileXing/leaderboard

There are currently three points that need discussion:

Data filtering
- When the benchmark is RTEB, only RTEB model results are displayed.
- Question: How should we determine which models belong to RTEB and should be shown?
Field adjustments
- When the benchmark is RTEB, the zero-shot field is hidden (including the zero-shot option in advanced model filters).
- Question: What do you think about this adjustment?
Model ranking
- When the benchmark is RTEB, the ranking logic is changed to use the RTEB-specific algorithm.
- Question: What do you think about this change?

Samoed · 2025-08-29T06:53:00Z

Samoed
Aug 29, 2025
Maintainer

Question: How should we determine which models belong to RTEB and should be shown?

How do you select models to RTEB?

Question: What do you think about this adjustment?

I don't think that you can hide it in advanced model filters

When the benchmark is RTEB, the ranking logic is changed to use the RTEB-specific algorithm.

How different is your logic?

2 replies

q275343119 Aug 30, 2025
Collaborator Author

Question: How should we determine which models belong to RTEB and should be shown?

How do you select models to RTEB?

the models in https://github.com/embedding-benchmark/rteb/blob/main/results/models.json
maybe we can build the list from this JSON.

Samoed Aug 30, 2025
Maintainer

Why do you want to select only limited list of models? For mteb leaderboard we're selecting models automatically if they have results

KennethEnevoldsen · 2025-08-29T07:28:30Z

KennethEnevoldsen
Aug 29, 2025
Maintainer

Question: How should we determine which models belong to RTEB and should be shown?

Why not just show all models that are evaluated on the benchmark?

zero-shot filter

So, the argument would be that closed datasets become a better proxy for generalization? That way, zero-shot is not a reasonable annotation.

I would be interested to see if this is the case and when it is not. In general though, I think it is a reasonable change given the closed datasets

ranking

Mean has some issues, as noted in the MMTEB paper; I would suspect that Borda would generally provide a better ranking.

What is the reason for changing it?

5 replies

fzliu Aug 29, 2025

For zero-shot:

We tried to select open datasets which have not been trained on by any model, but going through all papers to get an accurate, numerical zero-shot value will take a significant amount of time and is fairly error prone.
The difference in performance between open and closed datasets should be a reasonable proxy as to whether or not a model overfits to the open datasets.

2 is perhaps the stronger argument.

fzliu Aug 29, 2025

For ranking:

Borda rank significantly over-weighs easy and hard datasets. A 0.1% improvement on easy tasks such as DBpedia and Hagrid are given a much higher weight since it's easy to shoot up many places on that task.
Borda rank is a bit harder to understand for newer users.

1 is perhaps the stronger argument.

Samoed Aug 30, 2025
Maintainer

I don't think that we need to remove zero-shot filter, as it will make it easier for users to understand that the models were not trained on the data.

fzliu Aug 31, 2025

Keeping zero-shot for now seems okay, but at a certain point not too far in the future, I think that it will simply not be possible to accurate calculate zero-shot values.

KennethEnevoldsen Aug 31, 2025
Maintainer

I think it is reasonable to make the summary table custom to the benchmarks - different benchmarks often call for different aggregations - examples where the current setup doesn't work that well are long-context retrieval and bright.
In the case of RTEB, given that it has an alternative (closed datasets), it is reasonable not to have it initially, but note that things might change depending on the beta feedback.

Ranking:

Hmm, I am not sure I buy this one - borda provides a ranking, not a continuous estimate. Converting mean to rank gives the exact same problem, but before the conversion, the means significantly overweight tasks with high variance (and underweight those with low variance). This might favor low-quality tasks with more label noise, which seems to me like a horrible side effect.
This argument I can buy, but given that we use Borda rank currently, it would be more confusing to use multiple.

Samoed · 2025-08-29T07:54:17Z

Samoed
Aug 29, 2025
Maintainer

As this is a discussion, I think it would be better in discussions

0 replies

q275343119 · 2025-09-02T03:46:47Z

q275343119
Sep 2, 2025
Collaborator Author

Hi everyone, I’d like to summarize the current outcomes of our discussion:

Data filtering: No special filtering for RTEB models.
Field adjustments: Keep zero-shot visible, no special handling.
Model ranking: Continue using borda rank.

Please confirm if this matches our consensus. I will proceed to update the PR accordingly based on these decisions in the coming days.
@KennethEnevoldsen @fzliu @Samoed

3 replies

Samoed Sep 2, 2025
Maintainer

I think that's correct

KennethEnevoldsen Sep 2, 2025
Maintainer

This would work for me

q275343119 Sep 3, 2025
Collaborator Author

Hi @KennethEnevoldsen @Samoed , I’ve pushed the updates reflecting our discussion (no special filtering, keeping zero-shot, and using borda rank). The PR（https://github.com/embeddings-benchmark/mteb/pull/3089） is now ready for review.

Demo:https://huggingface.co/spaces/SmileXing/leaderboard

Do you have any further comments or suggestions? If everything looks good, can we proceed with merging it into the main branch?

[Discussion]Add dedicated display for RTEB benchmark results #3096

Uh oh!

q275343119 Aug 29, 2025 Collaborator

Description of the feature

Replies: 4 comments · 10 replies

Uh oh!

Uh oh!

Samoed Aug 29, 2025 Maintainer

Uh oh!

q275343119 Aug 30, 2025 Collaborator Author

Uh oh!

Uh oh!

Samoed Aug 30, 2025 Maintainer

Uh oh!

KennethEnevoldsen Aug 29, 2025 Maintainer

Uh oh!

fzliu Aug 29, 2025

Uh oh!

fzliu Aug 29, 2025

Uh oh!

Samoed Aug 30, 2025 Maintainer

Uh oh!

fzliu Aug 31, 2025

Uh oh!

KennethEnevoldsen Aug 31, 2025 Maintainer

Uh oh!

Samoed Aug 29, 2025 Maintainer

Uh oh!

q275343119 Sep 2, 2025 Collaborator Author

Uh oh!

Samoed Sep 2, 2025 Maintainer

Uh oh!

Uh oh!

KennethEnevoldsen Sep 2, 2025 Maintainer

Uh oh!

q275343119 Sep 3, 2025 Collaborator Author

q275343119
Aug 29, 2025
Collaborator

Replies: 4 comments 10 replies

Samoed
Aug 29, 2025
Maintainer

q275343119 Aug 30, 2025
Collaborator Author

Samoed Aug 30, 2025
Maintainer

KennethEnevoldsen
Aug 29, 2025
Maintainer

Samoed Aug 30, 2025
Maintainer

KennethEnevoldsen Aug 31, 2025
Maintainer

Samoed
Aug 29, 2025
Maintainer

q275343119
Sep 2, 2025
Collaborator Author

Samoed Sep 2, 2025
Maintainer

KennethEnevoldsen Sep 2, 2025
Maintainer

q275343119 Sep 3, 2025
Collaborator Author