[v2] Re-evaluating the Use of LogisticRegression for Classification Tasks #3078

pocman · 2025-08-26T08:57:41Z

pocman
Aug 26, 2025

Description of the feature

Hello MTEB Maintainers and Community,

First, thank you for creating and maintaining this invaluable benchmark.

This issue is intended to open a constructive discussion about the choice of LogisticRegression as the sole classifier for the AbsTaskAnyClassification tasks in the benchmark. While LogisticRegression offers clear advantages in terms of speed, simplicity, and interpretability, its fundamental nature as a linear classifier may have important implications for how we evaluate and compare modern text embedding models.

The Core of the Discussion: Linear vs. Non-Linear Evaluation

The primary concern is that a linear classifier can only assess an embedding's quality through the lens of linear separability. It measures how well the embedding space can be partitioned by a hyperplane. However, it is plausible that state-of-the-art embedding models capture complex, non-linear semantic relationships that are not well-approximated by a linear decision boundary.

This raises a critical question: Could the benchmark be inadvertently penalizing powerful embedding models that produce well-structured but non-linearly separable class representations? A model's score on MTEB's classification tasks is currently a function of both its semantic representation quality and the geometric simplicity of that representation. A more powerful, non-linear classifier might reveal different strengths and weaknesses, potentially leading to different model rankings.

Conflicting Industry Guidance

This topic is particularly relevant given the lack of a clear industry consensus. For instance:

The OpenAI Cookbook uses RandomForestClassifier for classification tasks on top of its embeddings, suggesting a preference for a non-linear model.
The Mistral AI Cookbook, conversely, uses LogisticRegression in its examples, aligning with the current MTEB methodology.

This divergence highlights that the choice is non-trivial and warrants a deliberate discussion within the context of a standardized benchmark.

A Comparison of Trade-offs

Feature	LogisticRegression (Current)	RandomForest (Alternative)
Expressive Power	Models linear relationships only.	Can model complex, non-linear relationships.
Evaluation Fairness	May favor models with simpler, linearly separable geometry.	Offers a more general probe of semantic clustering, regardless of geometry.
Computational Cost	Very fast and efficient.	Significantly slower and more resource-intensive.
Interpretability	High (transparent model).	Low ("black-box" model).

Proposals for Discussion

This issue is not a demand for an immediate change but rather a proposal to explore the topic as a community. Here are a few potential paths forward we could discuss:

Conducting a Study: Would it be feasible to conduct a formal study on the impact of classifier choice on the rankings of a subset of top models? The results would provide crucial data to guide this discussion.
Augmenting the Benchmark: Could we introduce an optional, parallel evaluation track using a RandomForest classifier? This would provide a richer, dual-perspective leaderboard without invalidating existing results.
Improving Documentation: At a minimum, could we add a note to the documentation clarifying the rationale for using LogisticRegression and acknowledging its limitations?

We believe that a community discussion on this topic could lead to valuable improvements and increase the robustness and transparency of the MTEB benchmark. We look forward to hearing your thoughts and the perspectives of other community members.

Samoed · 2025-08-26T09:15:28Z

Samoed
Aug 26, 2025
Maintainer

I see your points. We can't change LogisticRegression for reproducibility of the current results, but you can create a copy of tasks and change evaluator for them. CC @KennethEnevoldsen

0 replies

KennethEnevoldsen · 2025-08-26T09:52:27Z

KennethEnevoldsen
Aug 26, 2025
Maintainer

Hi @pocman, great that you raised this. Since this is more a discussion, so I will move it over.

0 replies

KennethEnevoldsen · 2025-08-26T09:59:13Z

KennethEnevoldsen
Aug 26, 2025
Maintainer

Thanks for the great overview @pocman wasn't actually aware that OpenAI uses RandomForest in their cookbook.

Just so that it is mentioned, it is possible to change the classifier (e.g., to test the effect of this):

# using mteb v2 (currently not released)

task = mteb.get_task(any_clf_task)
task.classifier = # specify your own clf using any sklearn compatible models

# evaluate

However, this will of course not produce official mteb results. For that, we would need to implement a new task with the classifier.

I would probably have to see some evidence that these two approaches produce different model rankings, which I don't believe they do at the moment. If there are models out there that are punished by the linear approach, we would love to know

4 replies

pocman Aug 26, 2025
Author

I would probably have to see some evidence that these two approaches produce different model rankings, which I don't believe they do at the moment. If there are models out there that are punished by the linear approach, we would love to know

Yes, I'll try to conduct a study starting by Qwen2 vs Qwen3. What we noticed on our production pipeline at BlaBlaCar is that a classifier using Qwen2 + XGBoost has quite similar performance to Qwen3 + XGBoost.
At least, we don't see such a massive gap as observed in the benchmark.

KennethEnevoldsen Aug 26, 2025
Maintainer

Great looking forward to seeing the results - let me know if you need a hand getting starteed with using the v2 branch

pocman Sep 17, 2025
Author

let me know if you need a hand getting starteed with using the v2 branch

Thanks, yes, I'm a bit lost in the script folder.
What is the easiest way to compare the results of a model on two different branches ?

Samoed Sep 17, 2025
Maintainer

We don’t have an easy way to do this. I think you can just run the original tasks and your new one from the same branch, and then access the results through the ModelResults object.

mteb/mteb/load_results/benchmark_results.py

Line 69 in e7141d9

class ModelResult(BaseModel):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v2] Re-evaluating the Use of LogisticRegression for Classification Tasks #3078

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[v2] Re-evaluating the Use of LogisticRegression for Classification Tasks #3078

Uh oh!

Uh oh!

pocman Aug 26, 2025

Description of the feature

The Core of the Discussion: Linear vs. Non-Linear Evaluation

Conflicting Industry Guidance

A Comparison of Trade-offs

Proposals for Discussion

Replies: 3 comments · 4 replies

Uh oh!

Samoed Aug 26, 2025 Maintainer

Uh oh!

KennethEnevoldsen Aug 26, 2025 Maintainer

Uh oh!

KennethEnevoldsen Aug 26, 2025 Maintainer

Uh oh!

pocman Aug 26, 2025 Author

Uh oh!

KennethEnevoldsen Aug 26, 2025 Maintainer

Uh oh!

pocman Sep 17, 2025 Author

Uh oh!

Samoed Sep 17, 2025 Maintainer

pocman
Aug 26, 2025

Replies: 3 comments 4 replies

Samoed
Aug 26, 2025
Maintainer

KennethEnevoldsen
Aug 26, 2025
Maintainer

KennethEnevoldsen
Aug 26, 2025
Maintainer

pocman Aug 26, 2025
Author

KennethEnevoldsen Aug 26, 2025
Maintainer

pocman Sep 17, 2025
Author

Samoed Sep 17, 2025
Maintainer