Skip to content

Refactor queries and document dataloader to allow multiple modalities#4232

Open
ayush1298 wants to merge 13 commits intoembeddings-benchmark:mainfrom
ayush1298:refactor_dataloader
Open

Refactor queries and document dataloader to allow multiple modalities#4232
ayush1298 wants to merge 13 commits intoembeddings-benchmark:mainfrom
ayush1298:refactor_dataloader

Conversation

@ayush1298
Copy link
Collaborator

@ayush1298 ayush1298 commented Mar 12, 2026

closes #4182

Both _create_queries_dataloader and _create_document_dataloader are refactored.
Before refactor, these function just used 1 of the modality from [ image, text, audio, video ] , and uses that only to create dataloader. So, its giving issue for tasks like EncyclopediaVQAIT2ITRetrieval with both ['image', 'text'] modality in dataset.

In the refactor, a helper function _prepare_multimodal_dataset is used which does all dataset specific transformation together and then dataloader conversion is happened in respective function like _create_queries_dataloader and _create_document_dataloader. So, this now handles a dataset with more than 1 modality.

Tested with EncyclopediaVQAIT2ITRetrieval using below dummy script:

from datasets import Dataset
from datasets import Image as HFImage
from PIL import Image as PILImage

from mteb._create_dataloaders import _create_queries_dataloader
from mteb.tasks.retrieval.eng import EncyclopediaVQAIT2ITRetrieval
from mteb.types import PromptType

# Get the real task metadata
task = EncyclopediaVQAIT2ITRetrieval()
task_metadata = task.metadata

# Create a small synthetic dataset mimicking the real one:
# features: ['id', 'modality', 'text', 'image'], where image is HF Image type
dummy_images = [PILImage.new("RGB", (32, 32), color="red") for _ in range(4)]
dataset = Dataset.from_dict(
    {
        "id": ["q1", "q2", "q3", "q4"],
        "modality": ["image_text"] * 4,
        "text": ["What is this?", "Describe this.", "Name this.", "Identify this."],
        "image": dummy_images,
    }
).cast_column("image", HFImage())

# Verify: category is "it2it", so query modalities = ["image", "text"]
print("Query modalities:", task_metadata.get_modalities(PromptType.query))
# Expected: ['image', 'text']

# Now test the refactored function
from mteb.types import PromptType

dl = _create_queries_dataloader(
    dataset,
    task_metadata,
    batch_size=2,
    input_column=None,
    num_proc=None,
)

batch = next(iter(dl))
print("Batch keys:", batch.keys())
# Before refactor: dict_keys(['image'])
# After refactor:  dict_keys(['id', 'modality', 'text', 'query', 'image'])

# Verify text was processed (query key exists)
print("Has 'text':", "text" in batch)
print("Has 'query':", "query" in batch)
print("Has 'image':", "image" in batch)

Output of above script:

Query modalities: ['image', 'text']
Processing queries for dataloading: 100%|█| 4/4 [00:00<00:00, 551.86 ex
Batch keys: dict_keys(['id', 'modality', 'text', 'image', 'query'])
Has 'text': True
Has 'query': True
Has 'image': True

So, its working as expected now.

I want to propose 2 more extensions in this refactor to make it more clean and easy to use(removing mess of lot of functions):

We can deprecated/remove the following functions as their work is already absorbed during the refactor in other function and they are just duplicate:

_create_dataloader_for_retrieval_corpus, _create_text_dataloader_for_queries, _create_dataloader_for_queries_conversation

Also, I think we can update bm25 and bb25 which uses _create_text_queries_dataloader
with _combine_queries_with_instruction_text and remove _create_text_queries_dataloader function also.
Because, in both these files, we are 1st creating a dataloader and then immediately flattening them back:
https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/model_implementations/bm25.py#L89-L90
So, instead of that, we can simply use _combine_queries_with_instruction_text directly.

Also, functions like: _create_image_dataloader, _create_audio_dataloader, and _create_video_dataloader are called only in the function create_dataloader as part of the fallback when prompt_type=None. In this function, we are only doing column renaming and then wrapping under Dataloader, so that thing can be done in the create_dataloader function only.

Would love to have your opinion on both these refactor extension @KennethEnevoldsen @Samoed

Copilot AI review requested due to automatic review settings March 12, 2026 14:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors retrieval query/document dataloader creation to support datasets with multiple modalities (e.g., ["image", "text"] for it2it tasks), addressing issue #4182 where only one modality was previously included in the dataloader batches.

Changes:

  • Introduces _prepare_multimodal_dataset() to centralize modality-specific dataset transformations before wrapping in a DataLoader.
  • Updates _create_queries_dataloader() and _create_document_dataloader() to use the shared preparation function and to select a custom collate function when needed.
  • Enables multimodal retrieval tasks to produce batches containing all expected modality keys (e.g., both image and text).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +492 to +498
prepared = _prepare_multimodal_dataset(
dataset,
task_metadata,
prompt_type=PromptType.document,
input_column=input_column,
num_proc=num_proc,
)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_create_document_dataloader/_prepare_multimodal_dataset don't actually implement the behavior described in the surrounding docstring (selecting the first column matching the modality when input_column is None). Right now the code assumes canonical column names (e.g., "image") unless input_column is explicitly provided. Consider either adding the advertised column inference or adjusting the docstring so callers aren't misled.

Copilot uses AI. Check for mistakes.
Comment on lines +395 to +410
if prompt_type == PromptType.document:
new_ds = new_ds.map(
_corpus_to_dict,
desc="Standardizing text corpus format",
num_proc=num_proc,
)
elif prompt_type == PromptType.query:
if isinstance(new_ds["text"][0], list):
new_ds = new_ds.map(
_convert_conv_history_to_query,
desc="Converting conversations to queries",
num_proc=num_proc,
)
else:
new_ds = new_ds.map(
_combine_queries_with_instruction_text,
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _prepare_multimodal_dataset, the .map(...) calls for text processing run over the full multimodal rows. For datasets that include heavy columns (e.g., image/audio/video), this can force decoding/serialization of those columns during mapping even though the mapper only needs text fields, which can significantly slow preprocessing and increase memory use (especially with num_proc). Consider using Dataset.map(..., input_columns=[...]) (or an equivalent approach) so the mapper only receives the columns it actually needs (e.g., ['id','text','title'] for _corpus_to_dict, ['text','instruction'] for _combine_queries_with_instruction_text).

Suggested change
if prompt_type == PromptType.document:
new_ds = new_ds.map(
_corpus_to_dict,
desc="Standardizing text corpus format",
num_proc=num_proc,
)
elif prompt_type == PromptType.query:
if isinstance(new_ds["text"][0], list):
new_ds = new_ds.map(
_convert_conv_history_to_query,
desc="Converting conversations to queries",
num_proc=num_proc,
)
else:
new_ds = new_ds.map(
_combine_queries_with_instruction_text,
if prompt_type == PromptType.document:
corpus_input_columns = [
col
for col in ("id", "text", "title")
if col in new_ds.column_names
]
new_ds = new_ds.map(
_corpus_to_dict,
input_columns=corpus_input_columns,
desc="Standardizing text corpus format",
num_proc=num_proc,
)
elif prompt_type == PromptType.query:
if isinstance(new_ds["text"][0], list):
conv_input_columns = [
col
for col in ("text",)
if col in new_ds.column_names
]
new_ds = new_ds.map(
_convert_conv_history_to_query,
input_columns=conv_input_columns,
desc="Converting conversations to queries",
num_proc=num_proc,
)
else:
query_input_columns = [
col
for col in ("text", "instruction")
if col in new_ds.column_names
]
new_ds = new_ds.map(
_combine_queries_with_instruction_text,
input_columns=query_input_columns,

Copilot uses AI. Check for mistakes.
@Samoed
Copy link
Member

Samoed commented Mar 12, 2026

We can deprecated/remove the following functions as their work is already absorbed during the refactor in other function and they are just duplicate

Agree

so that thing can be done in the create_dataloader function only.

Agree

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Mar 12, 2026

Simplified as much as possible.
Now, there is just a create_dataloader function, which is a single entry-point, and then just _prepare_multimodal_dataset function, which does dataset-specific transformation for all modalities(handles more than 1 modality) and then uses the corresponding helper function. Remove all these modalities' specific dataloader as they were of no use. Keep _create_dataloader_from_texts as it was used by multiple task evaluators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"text" is not included in it2it tasks

3 participants