[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199
[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199AdnanElAssadi56 wants to merge 54 commits intoembeddings-benchmark:mainfrom
Conversation
mteb/abstasks/classification.py
Outdated
| modality_to_column = { | ||
| "video": "video", | ||
| "audio": "audio", | ||
| "image": "image", | ||
| } |
There was a problem hiding this comment.
Can we just extend input_column_name to list?
There was a problem hiding this comment.
This will also cause changes in dataloader; we can do this separately.
There was a problem hiding this comment.
What changs? I think easier to use list for processing rather than processing like this
|
I think tests are failing because of |
|
Strange. I get it without problems import mteb
meta = mteb.models.ModelMeta.from_hub("facebook/pe-av-base-16-frame")
meta.n_embedding_parameters
# 51576832 |
| num_proc=num_proc, | ||
| ) | ||
| if "video" in task_metadata.modalities: | ||
| return _create_video_dataloader( |
There was a problem hiding this comment.
Should we consider doing a more principled refactor here as discussed in #4182
That issue also showed how the current approach can lead to some odd interactions between modalities.
|
@Samoed Tests resolved here |
isaac-chung
left a comment
There was a problem hiding this comment.
Got a few non-blocking questions. Can be addressed in a separate PR.
| "VideoPairClassification", | ||
| "VideoZeroshotClassification", | ||
| "VideoCentricQA", | ||
| "Any2AnyRetrieval", |
There was a problem hiding this comment.
Duplicated?
| "Any2AnyRetrieval", |
|
|
||
| is_cross_validation: bool = False | ||
|
|
||
| def load_data(self, **kwargs) -> None: |
There was a problem hiding this comment.
This is a lot of code to be repeated for every task that needs this. Can we move the combine modalities code in the AbsTask instead? And in the task instance, can we set which modalities to be combined, perhaps via the task category, e.g. va2t, or an explicit param like input_modalities_combine=["video","audio"] ?
|
@Samoed @isaac-chung Changed input_column to list. |
|
lint is giving error because list is mutable |
|
It's looking for something like this I think: from typing import ClassVar
input_column_name: ClassVar[list[str]] = ["video", "audio"] |
|
|
||
| class Kinetics400Classification(AbsTaskClassification): | ||
| metadata = TaskMetadata( | ||
| name="Kinetics400", |
There was a problem hiding this comment.
Does this task have as audio as separate modality originally or this is just part of video? Probably we need to add some annotation whether video has audio or not
There was a problem hiding this comment.
For all our current tasks, the audio column is referring to the audio coming from the video itself.
If we later add tasks that do otherwise, we can probably add some metadata differentiating the two.
There was a problem hiding this comment.
Can we keep video as one column input then?
There was a problem hiding this comment.
Some videos don't have audio, so this wouldn't differentiate them.
There was a problem hiding this comment.
We would also no longer need to combine them like you did in msr-vtt. It would be clear that video always refers to frames and audio refers to the audio from the video.
There was a problem hiding this comment.
Now video have incorrect format, because it should be a dict with frames
There was a problem hiding this comment.
No, the _create_dataloaders.py's VideoCollator still wraps the frames in a dictionary before passing it to the model.
There was a problem hiding this comment.
I can't find where this happen. When I tried to run MSRVTTV2T with mteb/baseline-random-encoder I got an error
TypeError: Unsupported key type: <class 'str'>. Supported types are int and slice
And I don't think that collator should change input format. Videos from new tasks will be passed in wrong format
| from mteb.abstasks.task_metadata import TaskMetadata | ||
|
|
||
|
|
||
| class RAVDESSAVClustering(AbsTaskClustering): |
There was a problem hiding this comment.
Does this task have as audio as separate modality originally or this is just part of video?
| if isinstance(input_column, str): | ||
| text_data = dataset[input_column] | ||
| elif "text" in input_column and "text" in dataset.column_names: | ||
| text_data = dataset["text"] | ||
| else: | ||
| raise ValueError( | ||
| "Cannot determine which column to use for text evaluation. " | ||
| "Please include 'text' in input_column_name or use a single string." | ||
| ) |
There was a problem hiding this comment.
When the text-evaluator needs to pull text data, this ensures it selects the "text" column from that input column list rather than crashing or trying to embed an audio column as text.
| query = query.map( | ||
| _combine_modalities, | ||
| features=Features( | ||
| { | ||
| "id": query_features["id"], | ||
| "video": { | ||
| "frames": query_features["video"], | ||
| "audio": query_features["audio"], | ||
| }, | ||
| } | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Why do you cahnge input format? We're agreed to process videos as
{"video": {"frames": ..., "audio": ...}}. We need to pass data in correct format after load_data. We shouldn't fix it in collators
There was a problem hiding this comment.
Do you want to do this on the task side for all?
There was a problem hiding this comment.
The main issue I had with your approach was distinguishing between normal videos and silent videos. I think I’ll stick with input_column being just "video" rather than a list, and specify the modalities as video and audio, noting that the video contains audio. The only downside is that this makes VA2VA vs. V2A ambiguous.
There was a problem hiding this comment.
distinguishing between normal videos and silent videos
We can make audio in video oprional
(From closed PR)
Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797
Also includes some remaining components from the parallel video integration work we accidently did.