Refactor queries and document dataloader to allow multiple modalities by ayush1298 · Pull Request #4232 · embeddings-benchmark/mteb

ayush1298 · 2026-03-12T14:36:57Z

Both _create_queries_dataloader and _create_document_dataloader are refactored.
Before refactor, these function just used 1 of the modality from [ image, text, audio, video ] , and uses that only to create dataloader. So, its giving issue for tasks like EncyclopediaVQAIT2ITRetrieval with both ['image', 'text'] modality in dataset.

In the refactor, a helper function _prepare_multimodal_dataset is used which does all dataset specific transformation together and then dataloader conversion is happened in respective function like _create_queries_dataloader and _create_document_dataloader. So, this now handles a dataset with more than 1 modality.

Tested with EncyclopediaVQAIT2ITRetrieval using below dummy script:

from datasets import Dataset
from datasets import Image as HFImage
from PIL import Image as PILImage

from mteb._create_dataloaders import _create_queries_dataloader
from mteb.tasks.retrieval.eng import EncyclopediaVQAIT2ITRetrieval
from mteb.types import PromptType

# Get the real task metadata
task = EncyclopediaVQAIT2ITRetrieval()
task_metadata = task.metadata

# Create a small synthetic dataset mimicking the real one:
# features: ['id', 'modality', 'text', 'image'], where image is HF Image type
dummy_images = [PILImage.new("RGB", (32, 32), color="red") for _ in range(4)]
dataset = Dataset.from_dict(
    {
        "id": ["q1", "q2", "q3", "q4"],
        "modality": ["image_text"] * 4,
        "text": ["What is this?", "Describe this.", "Name this.", "Identify this."],
        "image": dummy_images,
    }
).cast_column("image", HFImage())

# Verify: category is "it2it", so query modalities = ["image", "text"]
print("Query modalities:", task_metadata.get_modalities(PromptType.query))
# Expected: ['image', 'text']

# Now test the refactored function
from mteb.types import PromptType

dl = _create_queries_dataloader(
    dataset,
    task_metadata,
    batch_size=2,
    input_column=None,
    num_proc=None,
)

batch = next(iter(dl))
print("Batch keys:", batch.keys())
# Before refactor: dict_keys(['image'])
# After refactor:  dict_keys(['id', 'modality', 'text', 'query', 'image'])

# Verify text was processed (query key exists)
print("Has 'text':", "text" in batch)
print("Has 'query':", "query" in batch)
print("Has 'image':", "image" in batch)

Output of above script:

Query modalities: ['image', 'text']
Processing queries for dataloading: 100%|█| 4/4 [00:00<00:00, 551.86 ex
Batch keys: dict_keys(['id', 'modality', 'text', 'image', 'query'])
Has 'text': True
Has 'query': True
Has 'image': True

So, its working as expected now.

I want to propose 2 more extensions in this refactor to make it more clean and easy to use(removing mess of lot of functions):

We can deprecated/remove the following functions as their work is already absorbed during the refactor in other function and they are just duplicate:

_create_dataloader_for_retrieval_corpus, _create_text_dataloader_for_queries, _create_dataloader_for_queries_conversation

Also, I think we can update bm25 and bb25 which uses _create_text_queries_dataloader
with _combine_queries_with_instruction_text and remove _create_text_queries_dataloader function also.
Because, in both these files, we are 1st creating a dataloader and then immediately flattening them back:
https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/model_implementations/bm25.py#L89-L90
So, instead of that, we can simply use _combine_queries_with_instruction_text directly.

Also, functions like: _create_image_dataloader, _create_audio_dataloader, and _create_video_dataloader are called only in the function create_dataloader as part of the fallback when prompt_type=None. In this function, we are only doing column renaming and then wrapping under Dataloader, so that thing can be done in the create_dataloader function only.

Would love to have your opinion on both these refactor extension @KennethEnevoldsen @Samoed

Copilot

Pull request overview

This PR refactors retrieval query/document dataloader creation to support datasets with multiple modalities (e.g., ["image", "text"] for it2it tasks), addressing issue #4182 where only one modality was previously included in the dataloader batches.

Changes:

Introduces _prepare_multimodal_dataset() to centralize modality-specific dataset transformations before wrapping in a DataLoader.
Updates _create_queries_dataloader() and _create_document_dataloader() to use the shared preparation function and to select a custom collate function when needed.
Enables multimodal retrieval tasks to produce batches containing all expected modality keys (e.g., both image and text).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-12T14:43:58Z

mteb/_create_dataloaders.py

+    prepared = _prepare_multimodal_dataset(
+        dataset,
+        task_metadata,
+        prompt_type=PromptType.document,
+        input_column=input_column,
+        num_proc=num_proc,
+    )


_create_document_dataloader/_prepare_multimodal_dataset don't actually implement the behavior described in the surrounding docstring (selecting the first column matching the modality when input_column is None). Right now the code assumes canonical column names (e.g., "image") unless input_column is explicitly provided. Consider either adding the advertised column inference or adjusting the docstring so callers aren't misled.

Copilot · 2026-03-12T14:43:58Z

mteb/_create_dataloaders.py

+        if prompt_type == PromptType.document:
+            new_ds = new_ds.map(
+                _corpus_to_dict,
+                desc="Standardizing text corpus format",
+                num_proc=num_proc,
+            )
+        elif prompt_type == PromptType.query:
+            if isinstance(new_ds["text"][0], list):
+                new_ds = new_ds.map(
+                    _convert_conv_history_to_query,
+                    desc="Converting conversations to queries",
+                    num_proc=num_proc,
+                )
+            else:
+                new_ds = new_ds.map(
+                    _combine_queries_with_instruction_text,


In _prepare_multimodal_dataset, the .map(...) calls for text processing run over the full multimodal rows. For datasets that include heavy columns (e.g., image/audio/video), this can force decoding/serialization of those columns during mapping even though the mapper only needs text fields, which can significantly slow preprocessing and increase memory use (especially with num_proc). Consider using Dataset.map(..., input_columns=[...]) (or an equivalent approach) so the mapper only receives the columns it actually needs (e.g., ['id','text','title'] for _corpus_to_dict, ['text','instruction'] for _combine_queries_with_instruction_text).

Suggested change

if prompt_type == PromptType.document:

new_ds = new_ds.map(

_corpus_to_dict,

desc="Standardizing text corpus format",

num_proc=num_proc,

)

elif prompt_type == PromptType.query:

if isinstance(new_ds["text"][0], list):

new_ds = new_ds.map(

_convert_conv_history_to_query,

desc="Converting conversations to queries",

num_proc=num_proc,

)

else:

new_ds = new_ds.map(

_combine_queries_with_instruction_text,

if prompt_type == PromptType.document:

corpus_input_columns = [

col

for col in ("id", "text", "title")

if col in new_ds.column_names

]

new_ds = new_ds.map(

_corpus_to_dict,

input_columns=corpus_input_columns,

desc="Standardizing text corpus format",

num_proc=num_proc,

)

elif prompt_type == PromptType.query:

if isinstance(new_ds["text"][0], list):

conv_input_columns = [

col

for col in ("text",)

if col in new_ds.column_names

]

new_ds = new_ds.map(

_convert_conv_history_to_query,

input_columns=conv_input_columns,

desc="Converting conversations to queries",

num_proc=num_proc,

)

else:

query_input_columns = [

col

for col in ("text", "instruction")

if col in new_ds.column_names

]

new_ds = new_ds.map(

_combine_queries_with_instruction_text,

input_columns=query_input_columns,

mteb/_create_dataloaders.py

Samoed · 2026-03-12T14:48:21Z

We can deprecated/remove the following functions as their work is already absorbed during the refactor in other function and they are just duplicate

Agree

so that thing can be done in the create_dataloader function only.

Agree

…and bb25

…er_for_queries, _create_dataloader_for_queries_conversation functions

…e_video_dataloader under refactoring

ayush1298 · 2026-03-12T17:15:49Z

Simplified as much as possible.
Now, there is just a create_dataloader function, which is a single entry-point, and then just _prepare_multimodal_dataset function, which does dataset-specific transformation for all modalities(handles more than 1 modality) and then uses the corresponding helper function. Remove all these modalities' specific dataloader as they were of no use. Keep _create_dataloader_from_texts as it was used by multiple task evaluators.

mteb/_create_dataloaders.py

Refactor queries and document dataloader to allow multiple modalities

55f3187

Copilot AI review requested due to automatic review settings March 12, 2026 14:36

Copilot started reviewing on behalf of ayush1298 March 12, 2026 14:37 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Samoed requested changes Mar 12, 2026

View reviewed changes

mteb/_create_dataloaders.py Outdated Show resolved Hide resolved

mteb/_create_dataloaders.py Outdated Show resolved Hide resolved

mteb/_create_dataloaders.py Outdated Show resolved Hide resolved

mteb/_create_dataloaders.py Show resolved Hide resolved

ayush1298 added 8 commits March 12, 2026 20:31

update passing collate

4991d1c

update bm25 and bb25 models based on refactor

4ca2fe2

remmove _create_text_queries_dataloader function after updating bm25 …

0bac9da

…and bb25

remove _create_dataloader_for_retrieval_corpus, _create_text_dataload…

5045c50

…er_for_queries, _create_dataloader_for_queries_conversation functions

refactor create_dataloader function

3c86626

remove _create_image_dataloader, _create_audio_dataloader, and _creat…

1db7cd9

…e_video_dataloader under refactoring

Remove _create_queries_dataloader and _create_document_dataloader

aa53d05

Remove _create_dataloader_from_texts

3264a22

ayush1298 added 3 commits March 12, 2026 22:49

Added _create_dataloader_from_texts again

d2881e7

small refactor

c1ed504

fix check for failing tests

fef466f

ayush1298 requested review from KennethEnevoldsen and Samoed March 12, 2026 18:17

Samoed approved these changes Mar 12, 2026

View reviewed changes

mteb/_create_dataloaders.py Outdated Show resolved Hide resolved

mteb/_create_dataloaders.py Outdated Show resolved Hide resolved

apply changes from review

324496e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor queries and document dataloader to allow multiple modalities#4232

Refactor queries and document dataloader to allow multiple modalities#4232
ayush1298 wants to merge 13 commits intoembeddings-benchmark:mainfrom
ayush1298:refactor_dataloader

ayush1298 commented Mar 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Mar 12, 2026

Uh oh!

ayush1298 commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ayush1298 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Mar 12, 2026

Uh oh!

ayush1298 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ayush1298 commented Mar 12, 2026 •

edited

Loading

ayush1298 commented Mar 12, 2026 •

edited

Loading