perf: separate C4 or CommonCrawl URLs from OpenWebText URLs

# Background
According to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py, `c4/webtextlike` uses
```python
OPENWEBTEXT_CC_VERSIONS = (  # August 2018 - July 2019
    "2019-18",  # Original default for single-crawl dataset (April 2019).
    "2019-30",
    "2019-26",
    "2019-22",
    "2019-13",
    "2019-09",
    "2019-04",
    "2018-51",
    "2018-47",
    "2018-43",
    "2018-39",
    "2018-34")
```
However, OpenWebText URLs are almost all older than the above CC indices, except for `2018-34`, `2018-39`, and `2018-43`.

Since C4 downloads a much larger set of CC WET texts and then filter out those texts by different sets of URLs, we probably won't gain much benefit of throughput based on C4 configurations.

Another intriguing situation is that, AllenNLP people had tried to replicate `c4/webtextlike` but stopped, cf. https://huggingface.co/datasets/allenai/c4/blame/f888b0f407c37dd4a0e52d0c3bf56b8a7088f58b/README.md. I wonder what happened...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: separate C4 or CommonCrawl URLs from OpenWebText URLs #59

Background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: separate C4 or CommonCrawl URLs from OpenWebText URLs #59

Description

Background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions