Skip to content

[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199

Open
AdnanElAssadi56 wants to merge 54 commits intoembeddings-benchmark:mainfrom
AdnanElAssadi56:mveb-video-integration
Open

[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199
AdnanElAssadi56 wants to merge 54 commits intoembeddings-benchmark:mainfrom
AdnanElAssadi56:mveb-video-integration

Conversation

@AdnanElAssadi56
Copy link
Contributor

@AdnanElAssadi56 AdnanElAssadi56 commented Mar 5, 2026

(From closed PR)

Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797

Also includes some remaining components from the parallel video integration work we accidently did.

@Samoed Samoed added new model Questions related to adding a new model to the benchmark new dataset Issues related to adding a new task or dataset video video extension labels Mar 5, 2026
Comment on lines +198 to +202
modality_to_column = {
"video": "video",
"audio": "audio",
"image": "image",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just extend input_column_name to list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also cause changes in dataloader; we can do this separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changs? I think easier to use list for processing rather than processing like this

@AdnanElAssadi56
Copy link
Contributor Author

I think tests are failing because of n_embedding_parameters. It can't be calculated by the method in Model Meta:
Could not calculate embedding parameters for facebook/pe-av-base-16-frame as config.json could not be loaded

@Samoed
Copy link
Member

Samoed commented Mar 7, 2026

Strange. I get it without problems

import mteb

meta = mteb.models.ModelMeta.from_hub("facebook/pe-av-base-16-frame")
meta.n_embedding_parameters
# 51576832

num_proc=num_proc,
)
if "video" in task_metadata.modalities:
return _create_video_dataloader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider doing a more principled refactor here as discussed in #4182

That issue also showed how the current approach can lead to some odd interactions between modalities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll do it a bit later

@AdnanElAssadi56
Copy link
Contributor Author

AdnanElAssadi56 commented Mar 7, 2026

@Samoed Tests resolved here

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a few non-blocking questions. Can be addressed in a separate PR.

"VideoPairClassification",
"VideoZeroshotClassification",
"VideoCentricQA",
"Any2AnyRetrieval",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated?

Suggested change
"Any2AnyRetrieval",


is_cross_validation: bool = False

def load_data(self, **kwargs) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of code to be repeated for every task that needs this. Can we move the combine modalities code in the AbsTask instead? And in the task instance, can we set which modalities to be combined, perhaps via the task category, e.g. va2t, or an explicit param like input_modalities_combine=["video","audio"] ?

@AdnanElAssadi56
Copy link
Contributor Author

@Samoed @isaac-chung Changed input_column to list.

@AdnanElAssadi56
Copy link
Contributor Author

lint is giving error because list is mutable

@isaac-chung
Copy link
Collaborator

It's looking for something like this I think:

from typing import ClassVar

input_column_name: ClassVar[list[str]] = ["video", "audio"]


class Kinetics400Classification(AbsTaskClassification):
metadata = TaskMetadata(
name="Kinetics400",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this task have as audio as separate modality originally or this is just part of video? Probably we need to add some annotation whether video has audio or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all our current tasks, the audio column is referring to the audio coming from the video itself.

If we later add tasks that do otherwise, we can probably add some metadata differentiating the two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep video as one column input then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some videos don't have audio, so this wouldn't differentiate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also no longer need to combine them like you did in msr-vtt. It would be clear that video always refers to frames and audio refers to the audio from the video.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now video have incorrect format, because it should be a dict with frames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the _create_dataloaders.py's VideoCollator still wraps the frames in a dictionary before passing it to the model.

Copy link
Member

@Samoed Samoed Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find where this happen. When I tried to run MSRVTTV2T with mteb/baseline-random-encoder I got an error

TypeError: Unsupported key type: <class 'str'>. Supported types are int and slice

And I don't think that collator should change input format. Videos from new tasks will be passed in wrong format

from mteb.abstasks.task_metadata import TaskMetadata


class RAVDESSAVClustering(AbsTaskClustering):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this task have as audio as separate modality originally or this is just part of video?

Comment on lines +622 to +630
if isinstance(input_column, str):
text_data = dataset[input_column]
elif "text" in input_column and "text" in dataset.column_names:
text_data = dataset["text"]
else:
raise ValueError(
"Cannot determine which column to use for text evaluation. "
"Please include 'text' in input_column_name or use a single string."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the text-evaluator needs to pull text data, this ensures it selects the "text" column from that input column list rather than crashing or trying to embed an audio column as text.

Comment on lines -55 to -66
query = query.map(
_combine_modalities,
features=Features(
{
"id": query_features["id"],
"video": {
"frames": query_features["video"],
"audio": query_features["audio"],
},
}
),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you cahnge input format? We're agreed to process videos as
{"video": {"frames": ..., "audio": ...}}. We need to pass data in correct format after load_data. We shouldn't fix it in collators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do this on the task side for all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue I had with your approach was distinguishing between normal videos and silent videos. I think I’ll stick with input_column being just "video" rather than a list, and specify the modalities as video and audio, noting that the video contains audio. The only downside is that this makes VA2VA vs. V2A ambiguous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinguishing between normal videos and silent videos

We can make audio in video oprional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new dataset Issues related to adding a new task or dataset new model Questions related to adding a new model to the benchmark video video extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add model: PE-AV

4 participants