-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
https://arxiv.org/abs/2204.14198
43M pages making 183M images and 182GB of text
Max 5 images (they limit to that) per page.
Sequences of text and image that are broadly in the same context
I think we would need some filtering beyond "dump the whole page"
Quite likely that some images are very unrelated with the rest of the text, which isn't useful
I guess we could use clip to tell us what to keep
mehdidc and TheodoreGalanos
Metadata
Metadata
Assignees
Labels
No labels
