Replies: 3 comments 4 replies
-
I see your points. We can't change |
Beta Was this translation helpful? Give feedback.
-
Hi @pocman, great that you raised this. Since this is more a discussion, so I will move it over. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the great overview @pocman wasn't actually aware that OpenAI uses RandomForest in their cookbook. Just so that it is mentioned, it is possible to change the classifier (e.g., to test the effect of this): # using mteb v2 (currently not released)
task = mteb.get_task(any_clf_task)
task.classifier = # specify your own clf using any sklearn compatible models
# evaluate However, this will of course not produce official mteb results. For that, we would need to implement a new task with the classifier. I would probably have to see some evidence that these two approaches produce different model rankings, which I don't believe they do at the moment. If there are models out there that are punished by the linear approach, we would love to know |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description of the feature
Hello MTEB Maintainers and Community,
First, thank you for creating and maintaining this invaluable benchmark.
This issue is intended to open a constructive discussion about the choice of LogisticRegression as the sole classifier for the AbsTaskAnyClassification tasks in the benchmark. While LogisticRegression offers clear advantages in terms of speed, simplicity, and interpretability, its fundamental nature as a linear classifier may have important implications for how we evaluate and compare modern text embedding models.
The Core of the Discussion: Linear vs. Non-Linear Evaluation
The primary concern is that a linear classifier can only assess an embedding's quality through the lens of linear separability. It measures how well the embedding space can be partitioned by a hyperplane. However, it is plausible that state-of-the-art embedding models capture complex, non-linear semantic relationships that are not well-approximated by a linear decision boundary.
This raises a critical question: Could the benchmark be inadvertently penalizing powerful embedding models that produce well-structured but non-linearly separable class representations? A model's score on MTEB's classification tasks is currently a function of both its semantic representation quality and the geometric simplicity of that representation. A more powerful, non-linear classifier might reveal different strengths and weaknesses, potentially leading to different model rankings.
Conflicting Industry Guidance
This topic is particularly relevant given the lack of a clear industry consensus. For instance:
This divergence highlights that the choice is non-trivial and warrants a deliberate discussion within the context of a standardized benchmark.
A Comparison of Trade-offs
Proposals for Discussion
This issue is not a demand for an immediate change but rather a proposal to explore the topic as a community. Here are a few potential paths forward we could discuss:
We believe that a community discussion on this topic could lead to valuable improvements and increase the robustness and transparency of the MTEB benchmark. We look forward to hearing your thoughts and the perspectives of other community members.
Beta Was this translation helpful? Give feedback.
All reactions