-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Several datasets have repeated text across examples:
- crawled newspapers tend to have the links to other articles at the bottom, which are nearly always the same
- datasets like the wiki datasets tend to have templates at the start, also always the same.
The difficulty is that some datasets have legitimate repetitions, such as parliamentary proceedings (lm_en_the_pile_europarl f.e.)
Metadata
Metadata
Assignees
Labels
No labels