Replies: 6 comments
-
I think it's better to convert them to reranking tsks instead, if they're QA. We don't have specific dataset type for this, but we have some QA datasets in retrieval and reranking |
Beta Was this translation helpful? Give feedback.
-
Yes, I think these datasets should not be used for evaluating retrieval. Like your comment, it would be better to move these datasets to Reranking tasks. I want to hear other maintainers' thoughts too. |
Beta Was this translation helpful? Give feedback.
-
Sorry for tagging again, I would like to ask other contributor's opinion. @KennethEnevoldsen @Muennighoff |
Beta Was this translation helpful? Give feedback.
-
From my understanding, a common (recent) use case of a retrieval model is to make an FAQ more accessible using retrieval (potentially full on RAG). I think the questions that you use in the FAQ are intended to mimic user queries. The answer is, of course, the most direct way to answer the question, so it should be an ideal positive. There are, of course, plenty of cases that are not handled (when there is no answer, partial answers, etc.). So yes, it is a proxy dataset. I think you can create better retrieval (and reranking) datasets, but, e.g., WebFaqRetrieval gives us quite a broad coverage. We also have summary article retrieval, I think that is also problematic for a classic retrieval task, but they do a good job of assessing the semantic space. Evaluation task does not necessarily need to be a 1-1 mapping with the real use case (e.g., this blog post examines word order). Just an additional note: Neither WebFaqRetrieval nor XPQARetrieval is a part of MTEB(Multilingual) (XPQARetrieval is a part of the mteb(fra) (@imenelydiaker might have some additional considerations) |
Beta Was this translation helpful? Give feedback.
-
(since this is a dicussion, I will just move it over) |
Beta Was this translation helpful? Give feedback.
-
@yjoonjang I understand your point and agree with you, a QA dataset should not be a retrieval dataset. In a RAG system, the answer is generated by the "LLM generator" component given a list of relevant documents, and since we're just evaluating the retrieval part then QA datasets may not be a good fit. XPQA was not part of the initial MTEB-French evaluation. I think we added it last minute, I will remove it. It should not affect the benchmark much + it's not part of MTEB-multilingual - I'm upgrading French-MTEB to a v2 with better datasets. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I have one question about using QA dataset for evaluating retrieval task.
Some QA datasets in MMTEB (e.g. XPQARetrieval, WebFaqRetrieval) does not seem to evaluate retrieval well.
These are the examples for the two datasets:
I think these datasets are query-positive set for evaluating QA tasks, not Retrieval tasks.
What are your thoughts?
I kindly tag @KennethEnevoldsen, @Samoed.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions