Skip to content

perf: separate C4 or CommonCrawl URLs from OpenWebText URLs #59

@tianjianjiang

Description

@tianjianjiang

Background

According to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py, c4/webtextlike uses

OPENWEBTEXT_CC_VERSIONS = (  # August 2018 - July 2019
    "2019-18",  # Original default for single-crawl dataset (April 2019).
    "2019-30",
    "2019-26",
    "2019-22",
    "2019-13",
    "2019-09",
    "2019-04",
    "2018-51",
    "2018-47",
    "2018-43",
    "2018-39",
    "2018-34")

However, OpenWebText URLs are almost all older than the above CC indices, except for 2018-34, 2018-39, and 2018-43.

Since C4 downloads a much larger set of CC WET texts and then filter out those texts by different sets of URLs, we probably won't gain much benefit of throughput based on C4 configurations.

Another intriguing situation is that, AllenNLP people had tried to replicate c4/webtextlike but stopped, cf. https://huggingface.co/datasets/allenai/c4/blame/f888b0f407c37dd4a0e52d0c3bf56b8a7088f58b/README.md. I wonder what happened...

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestwontfixThis will not be worked on

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions