-
Notifications
You must be signed in to change notification settings - Fork 11
Closed
Labels
enhancementNew feature or requestNew feature or requestwontfixThis will not be worked onThis will not be worked on
Milestone
Description
Background
According to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py, c4/webtextlike uses
OPENWEBTEXT_CC_VERSIONS = ( # August 2018 - July 2019
"2019-18", # Original default for single-crawl dataset (April 2019).
"2019-30",
"2019-26",
"2019-22",
"2019-13",
"2019-09",
"2019-04",
"2018-51",
"2018-47",
"2018-43",
"2018-39",
"2018-34")However, OpenWebText URLs are almost all older than the above CC indices, except for 2018-34, 2018-39, and 2018-43.
Since C4 downloads a much larger set of CC WET texts and then filter out those texts by different sets of URLs, we probably won't gain much benefit of throughput based on C4 configurations.
Another intriguing situation is that, AllenNLP people had tried to replicate c4/webtextlike but stopped, cf. https://huggingface.co/datasets/allenai/c4/blame/f888b0f407c37dd4a0e52d0c3bf56b8a7088f58b/README.md. I wonder what happened...
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestwontfixThis will not be worked onThis will not be worked on