[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset by AdnanElAssadi56 · Pull Request #4199 · embeddings-benchmark/mteb

AdnanElAssadi56 · 2026-03-05T07:26:38Z

(From closed PR)

Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797

Also includes some remaining components from the parallel video integration work we accidently did.

Samoed · 2026-03-05T08:33:22Z

mteb/abstasks/classification.py

+                modality_to_column = {
+                    "video": "video",
+                    "audio": "audio",
+                    "image": "image",
+                }


Can we just extend input_column_name to list?

This will also cause changes in dataloader; we can do this separately.

What changs? I think easier to use list for processing rather than processing like this

mteb/models/model_implementations/pe_av_models.py

AdnanElAssadi56 · 2026-03-07T10:29:38Z

I think tests are failing because of n_embedding_parameters. It can't be calculated by the method in Model Meta:
Could not calculate embedding parameters for facebook/pe-av-base-16-frame as config.json could not be loaded

Samoed · 2026-03-07T10:59:41Z

Strange. I get it without problems

import mteb

meta = mteb.models.ModelMeta.from_hub("facebook/pe-av-base-16-frame")
meta.n_embedding_parameters
# 51576832

KennethEnevoldsen · 2026-03-07T13:23:06Z

mteb/_create_dataloaders.py

            num_proc=num_proc,
        )
+    if "video" in task_metadata.modalities:
+        return _create_video_dataloader(


Should we consider doing a more principled refactor here as discussed in #4182

That issue also showed how the current approach can lead to some odd interactions between modalities.

Yes, I'll do it a bit later

AdnanElAssadi56 · 2026-03-07T21:00:33Z

@Samoed Tests resolved here

isaac-chung

Got a few non-blocking questions. Can be addressed in a separate PR.

isaac-chung · 2026-03-08T09:26:44Z

mteb/abstasks/task_metadata.py

+    "VideoPairClassification",
+    "VideoZeroshotClassification",
+    "VideoCentricQA",
+    "Any2AnyRetrieval",


Duplicated?

Suggested change

"Any2AnyRetrieval",

isaac-chung · 2026-03-08T09:35:19Z

mteb/tasks/video/classification/eng/kinetics400_classification.py

+
+    is_cross_validation: bool = False
+
+    def load_data(self, **kwargs) -> None:


This is a lot of code to be repeated for every task that needs this. Can we move the combine modalities code in the AbsTask instead? And in the task instance, can we set which modalities to be combined, perhaps via the task category, e.g. va2t, or an explicit param like input_modalities_combine=["video","audio"] ?

AdnanElAssadi56 · 2026-03-09T08:41:25Z

@Samoed @isaac-chung Changed input_column to list.

AdnanElAssadi56 · 2026-03-09T08:56:59Z

lint is giving error because list is mutable

isaac-chung · 2026-03-09T09:00:14Z

It's looking for something like this I think:

from typing import ClassVar

input_column_name: ClassVar[list[str]] = ["video", "audio"]

mteb/abstasks/classification.py

Samoed · 2026-03-09T10:37:28Z

mteb/tasks/video/classification/eng/kinetics400_classification.py

+
+class Kinetics400Classification(AbsTaskClassification):
+    metadata = TaskMetadata(
+        name="Kinetics400",


Does this task have as audio as separate modality originally or this is just part of video? Probably we need to add some annotation whether video has audio or not

For all our current tasks, the audio column is referring to the audio coming from the video itself.

If we later add tasks that do otherwise, we can probably add some metadata differentiating the two.

Can we keep video as one column input then?

Some videos don't have audio, so this wouldn't differentiate them.

We would also no longer need to combine them like you did in msr-vtt. It would be clear that video always refers to frames and audio refers to the audio from the video.

Now video have incorrect format, because it should be a dict with frames

No, the _create_dataloaders.py's VideoCollator still wraps the frames in a dictionary before passing it to the model.

I can't find where this happen. When I tried to run MSRVTTV2T with mteb/baseline-random-encoder I got an error

TypeError: Unsupported key type: <class 'str'>. Supported types are int and slice

And I don't think that collator should change input format. Videos from new tasks will be passed in wrong format

mteb/abstasks/clustering.py

Samoed · 2026-03-09T10:43:50Z

mteb/tasks/video/clustering/eng/ravdess_av_clustering.py

+from mteb.abstasks.task_metadata import TaskMetadata
+
+
+class RAVDESSAVClustering(AbsTaskClustering):


Does this task have as audio as separate modality originally or this is just part of video?

Samoed · 2026-03-10T07:36:40Z

mteb/_create_dataloaders.py

+        if isinstance(input_column, str):
+            text_data = dataset[input_column]
+        elif "text" in input_column and "text" in dataset.column_names:
+            text_data = dataset["text"]
+        else:
+            raise ValueError(
+                "Cannot determine which column to use for text evaluation. "
+                "Please include 'text' in input_column_name or use a single string."
+            )


Why this needed?

When the text-evaluator needs to pull text data, this ensures it selects the "text" column from that input column list rather than crashing or trying to embed an audio column as text.

Samoed · 2026-03-10T19:32:03Z

mteb/tasks/video/retrieval/eng/msr_vtt.py

-        query = query.map(
-            _combine_modalities,
-            features=Features(
-                {
-                    "id": query_features["id"],
-                    "video": {
-                        "frames": query_features["video"],
-                        "audio": query_features["audio"],
-                    },
-                }
-            ),
-        )


Why do you cahnge input format? We're agreed to process videos as
{"video": {"frames": ..., "audio": ...}}. We need to pass data in correct format after load_data. We shouldn't fix it in collators

Do you want to do this on the task side for all?

The main issue I had with your approach was distinguishing between normal videos and silent videos. I think I’ll stick with input_column being just "video" rather than a list, and specify the modalities as video and audio, noting that the video contains audio. The only downside is that this makes VA2VA vs. V2A ambiguous.

distinguishing between normal videos and silent videos

We can make audio in video oprional

AdnanElAssadi56 and others added 26 commits March 5, 2026 02:05

Adding video modality

2b411bb

Add Kinetics-400 dataset

fd0ce74

Add pe_av model

a65f505

fix typo

ecca13e

fix collator bug

8210f82

Edit selecting column in classification abstask

f4e0ece

Properly handle frames in PE_AV

8f67fb7

add self kwarg to method

80d9217

Add audio collator

c01d591

fix type error

287d47c

fix audio_video embeds object handling

66c108f

Add Ravdess_av clustering

b24794b

fix task metadata

82ccf4d

start video integration

6979034

start video integration

4af8520

upd task structure

fa753b4

upd video input type

f5e7a8f

combine video and audio to dict

32f3b4f

fix task side

77e964a

fix pe_av model

f1b7989

lower writer batch size

95d75d9

fix col labels

e59f283

lint

5321b3c

add pe_av model metadata

7b36363

fix datasets metadata

05cd7f6

remove accidently commited files

23c3135

Samoed added new model Questions related to adding a new model to the benchmark new dataset Issues related to adding a new task or dataset video video extension labels Mar 5, 2026

Samoed reviewed Mar 5, 2026

View reviewed changes

add n_embedding_parameters

68747c2

KennethEnevoldsen reviewed Mar 7, 2026

View reviewed changes

fix task type test

fa9c3d6

isaac-chung reviewed Mar 8, 2026

View reviewed changes

AdnanElAssadi56 added 2 commits March 9, 2026 04:28

change input col name to list

b9273d9

lint + type check

7907adb

AdnanElAssadi56 added 2 commits March 9, 2026 05:02

add classvar

75bc5c7

add str to classvar

4c87896

Samoed reviewed Mar 9, 2026

View reviewed changes

AdnanElAssadi56 added 4 commits March 9, 2026 17:25

Change list to sequence

cb39536

lint + type check error

61c775f

edit dataloader and msrvtt handling of input column

400925b

move seqeuence out of type checking

64c94b5

Samoed reviewed Mar 10, 2026

View reviewed changes

AdnanElAssadi56 added 2 commits March 10, 2026 15:00

fix random baseline

a131a89

add collator to random baseline

73bf160

Samoed reviewed Mar 10, 2026

View reviewed changes

AdnanElAssadi56 added 7 commits March 10, 2026 17:38

restore previous dict structure + make audio optional

978622e

clean structure

939eefa

lint

57eb8d9

safety check

ac7484f

decrease writer batch size

bb68de2

match msrvtt format

91cada2

type check fix

56c243f


		is_cross_validation: bool = False

		def load_data(self, **kwargs) -> None:

		from mteb.abstasks.task_metadata import TaskMetadata


		class RAVDESSAVClustering(AbsTaskClustering):

Conversation

AdnanElAssadi56 commented Mar 5, 2026 • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AdnanElAssadi56 commented Mar 7, 2026

Uh oh!

Samoed commented Mar 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdnanElAssadi56 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdnanElAssadi56 commented Mar 9, 2026

Uh oh!

AdnanElAssadi56 commented Mar 9, 2026

Uh oh!

isaac-chung commented Mar 9, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AdnanElAssadi56 commented Mar 5, 2026 •

edited by Samoed

Loading

AdnanElAssadi56 commented Mar 7, 2026 •

edited

Loading

Samoed Mar 10, 2026 •

edited

Loading