dataset: Add REAL_MM_RAG benchmark #3224

roipony · 2025-09-29T19:55:17Z

I have outlined why this dataset is filling an existing gap in mteb
I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- ibm-granite/granite-vision-3.3-2b-embedding
- jinaai/jina-embeddings-v4
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

mteb/benchmarks/benchmarks/benchmarks.py

add revisions

Samoed · 2025-09-29T21:37:35Z

can you run make format-citations?

KennethEnevoldsen

Hi great to see a PR and congratulations on the paper release!

I think the main thing that is missing at the moment is documentation I have put a few pointers below.

Note: will be influenced by #3222 (if merged we can move it down to the retrieval section)

KennethEnevoldsen · 2025-09-30T08:32:11Z

mteb/benchmarks/benchmarks/benchmarks.py

+            "RealMMRagTechSlidesRetrieval",
+        ],
+    ),
+    description="Realistic and multi-modal document retrieval benchmark.",


This description is too short. Why should I prefer this over another VDR benchmark

KennethEnevoldsen · 2025-09-30T08:33:34Z

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

+class RealMMRagFinReportRetrieval(AbsTaskAny2AnyRetrieval):
+    metadata = TaskMetadata(
+        name="RealMMRagFinReportRetrieval",
+        description="Retrieve associated pages according to questions.",


This description is too vague - It should be clear from the description what queries and corpus it contains, as well as the retrieval goal. Please fix this for all tasks.

KennethEnevoldsen · 2025-09-30T08:34:20Z

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

@Samoed should we snake-case the filename (easier to merge with v2)

Agree, that will be better

KennethEnevoldsen · 2025-09-30T08:36:55Z

mteb/leaderboard/benchmark_selector.py

                        "MIEB(Img)",
                        "VisualDocumentRetrieval",
                        "JinaVDR",
+                        "REAL_MM_RAG"


Let us not add it to a benchmark yet due to #3222 (this means that we can merge this without caring about the other PR and once both is merge then we can add both)

add better descriptions on benchmark subsets

roipony · 2025-11-02T09:05:09Z

@KennethEnevoldsen @Samoed
Let me know if there’s anything else I should do to help move the pull request forward.

Samoed · 2025-11-02T17:08:13Z

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

Since v2 release, you need to move your tasks into retrieval/eng/ folder

Samoed · 2025-11-02T17:09:07Z

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

+from __future__ import annotations
+
+from datasets import load_dataset
+
+from mteb.abstasks.Image.AbsTaskAny2AnyRetrieval import AbsTaskAny2AnyRetrieval
+from mteb.abstasks.TaskMetadata import TaskMetadata


This also would be

Suggested change

from __future__ import annotations

from datasets import load_dataset

from mteb.abstasks.Image.AbsTaskAny2AnyRetrieval import AbsTaskAny2AnyRetrieval

from mteb.abstasks.TaskMetadata import TaskMetadata

from datasets import load_dataset

from mteb.abstasks.retrieval import AbsTaskRetrieval

from mteb.abstasks.task_metadata import TaskMetadata

Samoed · 2025-11-02T17:09:38Z

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

+                "image": None,
+                "modality": "text",


You shouldn't add columns with None and don't need modality column

Samoed · 2025-11-02T17:10:14Z

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

+        prompt={"query": "Find a screenshot that relevant to the user's question."},
+        descriptive_stats={
+            "n_samples": None,
+            "avg_character_length": {
+                "test": {
+                    "average_document_length": 141.5,
+                    "num_documents": 19,
+                    "num_queries": 853,
+                    "average_relevant_docs_per_query": 1.0,
+                }
+            },
+        },


We don't have descriptive_stats in task metadata. You need to use task.calculate_desriptive_statistics()

Suggested change

prompt={"query": "Find a screenshot that relevant to the user's question."},

descriptive_stats={

"n_samples": None,

"avg_character_length": {

"test": {

"average_document_length": 141.5,

"num_documents": 19,

"num_queries": 853,

"average_relevant_docs_per_query": 1.0,

}

},

},

prompt={"query": "Find a screenshot that relevant to the user's question."},

Samoed · 2025-11-02T17:12:24Z

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

+from mteb.abstasks.TaskMetadata import TaskMetadata
+
+
+def _load_data(


You can reupload your tasks using task.push_dataset_to_hub() to use our format

Samoed · 2025-11-03T11:57:25Z

Can you merge main to resolve conflicts?

roipony · 2025-11-03T12:26:24Z

Can you merge main to resolve conflicts?

I'm working on it.
Once finish I'll ping

roipony added 3 commits September 28, 2025 13:29

add real_mm_rag benchmark

e234f2c

add real_mm_rag benchmark

0bb9d1c

add real_mm_rag benchmark

6251545

Samoed reviewed Sep 29, 2025

View reviewed changes

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py Outdated Show resolved Hide resolved

Samoed requested a review from isaac-chung September 29, 2025 20:05

Samoed changed the title ~~Add REAL_MM_RAG benchmark~~ dataset: Add REAL_MM_RAG benchmark Sep 29, 2025

Samoed added the new benchmark Issues related to adding a new benchmark label Sep 29, 2025

Samoed reviewed Sep 29, 2025

View reviewed changes

mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py Outdated Show resolved Hide resolved

roipony and others added 2 commits September 29, 2025 23:08

add REAL_MM_RAG benchmark

2c70756

Update mteb/tasks/Image/Any2AnyRetrieval/eng/RealMMRagBenchRetrieval.py

ff7b861

Co-authored-by: Roman Solomatin <[email protected]>

Samoed reviewed Sep 29, 2025

View reviewed changes

mteb/benchmarks/benchmarks/benchmarks.py Show resolved Hide resolved

roipony added 2 commits September 29, 2025 23:18

Update RealMMRagBenchRetrieval.py

a430b68

add revisions

Update benchmark_selector.py

0ba2a1a

add REAL_MM_RAG benchmark

6cd75a4

KennethEnevoldsen requested changes Sep 30, 2025

View reviewed changes

roi.pony and others added 7 commits October 8, 2025 10:11

Merge remote-tracking branch 'mteb_pony/main' into main_pony

75e47da

add better descriptions on benchmark subsets

bfd5cec

Merge pull request #1 from roipony/main_pony

28469a7

add better descriptions on benchmark subsets

Merge branch 'main' into main

b33167d

Update benchmarks.py

ed8898c

Update __init__.py

03e9633

fixed automatic test errors

57ad227

Samoed added the image The image extension of MTEB label Oct 19, 2025

Samoed reviewed Nov 2, 2025

View reviewed changes

Fix code according to review

ccd43d8

		from mteb.abstasks.TaskMetadata import TaskMetadata


		def _load_data(

dataset: Add REAL_MM_RAG benchmark #3224

Are you sure you want to change the base?

dataset: Add REAL_MM_RAG benchmark #3224

Uh oh!

Conversation

roipony commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Sep 29, 2025

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roipony commented Nov 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed commented Nov 3, 2025

Uh oh!

roipony commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roipony commented Sep 29, 2025 •

edited

Loading