Skip to content

Conversation

@merveenoyan
Copy link
Contributor

More and more datasets are showing up for multimodal tasks, and some authors are picking wrong task tags because hideInDataset is true, so removing them

Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any intuition of how many such cases are there?

@pcuenca
Copy link
Member

pcuenca commented May 26, 2025

Yes, a few examples could be great for better understanding. I see, for example, this one that could possibly be assigned an image-text-to-text tag, but I wonder if other VQA datasets, such as the Cauldron, should have the same.

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline with Merve and had another look at things.

I'd be supportive of merging, given that:

But please, let's wait for vb to come back and see if he has additional insight!

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no objection!

Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling the numbers! Only recommendation/ suggestion would be to tag a few more datasets for the following:

any-to-any (13), visual-document-retrieval (8)

atleast so we have one page full of datasets.

@pcuenca
Copy link
Member

pcuenca commented Jun 20, 2025

Can we maybe merge this PR? We can always iterate later.

@merveenoyan
Copy link
Contributor Author

merveenoyan commented Jun 20, 2025

sorry through the releases I couldn't work on this, @pcuenca I'm currently opening automatic PRs to a lot of models, I think it's ok to merge this

@merveenoyan
Copy link
Contributor Author

I have opened more than 100 PRs, merging this, thanks a ton!

@merveenoyan merveenoyan merged commit 0e2b369 into main Jun 20, 2025
4 of 5 checks passed
@merveenoyan merveenoyan deleted the mm-datasets branch June 20, 2025 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants