Question for using QA datasets to evaluate retrieval #3099

yjoonjang · 2025-08-27T04:49:49Z

yjoonjang
Aug 27, 2025

Hi all,

I have one question about using QA dataset for evaluating retrieval task.

Some QA datasets in MMTEB (e.g. XPQARetrieval, WebFaqRetrieval) does not seem to evaluate retrieval well.

These are the examples for the two datasets:

XPQARetrieval (original Korean query-positive, translated to English by gemini-2.5-pro)

Example 1)
Query: Is it new/unopened?
Document: No. It is a renewed product.

Example 2)
Query: Does it come with two batteries?
Document: No. Another buyer said, "I wish it came with two batteries."

WebFaqRetrieval (original English query-positive)

Example 1)
Query: How do I turn on Walk Assist?
Document: Hold thw (-) button on display, until walking figure appears on display screen.

Example 2)
Query: Can I freeze this macaroni salad?
Document: No. The mayonnaise does not freeze well and will separate when frozen.

I think these datasets are query-positive set for evaluating QA tasks, not Retrieval tasks.

What are your thoughts?
I kindly tag @KennethEnevoldsen, @Samoed.

Thank you.

Youngjoon Jang

Samoed · 2025-08-27T05:13:48Z

Samoed
Aug 27, 2025
Maintainer

I think it's better to convert them to reranking tsks instead, if they're QA. We don't have specific dataset type for this, but we have some QA datasets in retrieval and reranking

0 replies

yjoonjang · 2025-08-27T05:24:45Z

yjoonjang
Aug 27, 2025
Author

Yes, I think these datasets should not be used for evaluating retrieval. Like your comment, it would be better to move these datasets to Reranking tasks. I want to hear other maintainers' thoughts too.

0 replies

yjoonjang · 2025-08-28T05:11:25Z

yjoonjang
Aug 28, 2025
Author

Sorry for tagging again, I would like to ask other contributor's opinion. @KennethEnevoldsen @Muennighoff

0 replies

KennethEnevoldsen · 2025-08-29T09:50:27Z

KennethEnevoldsen
Aug 29, 2025
Maintainer

From my understanding, a common (recent) use case of a retrieval model is to make an FAQ more accessible using retrieval (potentially full on RAG).

I think the questions that you use in the FAQ are intended to mimic user queries. The answer is, of course, the most direct way to answer the question, so it should be an ideal positive.

There are, of course, plenty of cases that are not handled (when there is no answer, partial answers, etc.). So yes, it is a proxy dataset. I think you can create better retrieval (and reranking) datasets, but, e.g., WebFaqRetrieval gives us quite a broad coverage.

We also have summary article retrieval, I think that is also problematic for a classic retrieval task, but they do a good job of assessing the semantic space. Evaluation task does not necessarily need to be a 1-1 mapping with the real use case (e.g., this blog post examines word order).

Just an additional note: Neither WebFaqRetrieval nor XPQARetrieval is a part of MTEB(Multilingual) (XPQARetrieval is a part of the mteb(fra) (@imenelydiaker might have some additional considerations)

0 replies

KennethEnevoldsen · 2025-08-29T10:06:21Z

KennethEnevoldsen
Aug 29, 2025
Maintainer

(since this is a dicussion, I will just move it over)

0 replies

imenelydiaker · 2025-09-02T15:04:00Z

imenelydiaker
Sep 2, 2025
Maintainer

@yjoonjang I understand your point and agree with you, a QA dataset should not be a retrieval dataset. In a RAG system, the answer is generated by the "LLM generator" component given a list of relevant documents, and since we're just evaluating the retrieval part then QA datasets may not be a good fit.

XPQA was not part of the initial MTEB-French evaluation. I think we added it last minute, I will remove it. It should not affect the benchmark much + it's not part of MTEB-multilingual - I'm upgrading French-MTEB to a v2 with better datasets.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question for using QA datasets to evaluate retrieval #3099

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question for using QA datasets to evaluate retrieval #3099

Uh oh!

Uh oh!

yjoonjang Aug 27, 2025

Replies: 6 comments

Uh oh!

Samoed Aug 27, 2025 Maintainer

Uh oh!

yjoonjang Aug 27, 2025 Author

Uh oh!

Uh oh!

yjoonjang Aug 28, 2025 Author

Uh oh!

KennethEnevoldsen Aug 29, 2025 Maintainer

Uh oh!

KennethEnevoldsen Aug 29, 2025 Maintainer

Uh oh!

imenelydiaker Sep 2, 2025 Maintainer

yjoonjang
Aug 27, 2025

Samoed
Aug 27, 2025
Maintainer

yjoonjang
Aug 27, 2025
Author

yjoonjang
Aug 28, 2025
Author

KennethEnevoldsen
Aug 29, 2025
Maintainer

KennethEnevoldsen
Aug 29, 2025
Maintainer

imenelydiaker
Sep 2, 2025
Maintainer