Skip to content

feat(commoncrawl-CC-MAIN-2025-43-draft/metadata.json): add remaining RecordSets#1006

Open
handecelikkanat wants to merge 3 commits intomainfrom
hande/add-commoncrawl-ds-recordsets
Open

feat(commoncrawl-CC-MAIN-2025-43-draft/metadata.json): add remaining RecordSets#1006
handecelikkanat wants to merge 3 commits intomainfrom
hande/add-commoncrawl-ds-recordsets

Conversation

@handecelikkanat
Copy link
Contributor

This PR add remaining 6 RecordSets:
- wat-records
- wet-records
- robotstxt-records
- non200responses-records
- cc-index-records
- cc-index-table-records

@github-actions
Copy link

github-actions bot commented Jan 30, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@handecelikkanat
Copy link
Contributor Author

handecelikkanat commented Jan 30, 2026

@mqzhou-dev I added the remaining 6 RecordSets, do they look alright to you?


EDIT: Resolved - we dont have this feature yet.

Btw - did your previous PR (#1001) allow adding prefixes to the urls read in by read_lines? Like adding the fixed prefix "http://data.commoncrawl.org/" to the left of each read line? I am not sure if this is left for future or already added in :)
Asking in order to add to our metadata if it can already be represented.

}
},
"encodingFormat": "application/warc",
"includes": "*.warc.gz",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mqzhou-dev @benjelloun

Should this includes be of a url-like, rather than .*warc.gz?

Because in the associated record set (here), the type is url:

  "recordSet": [
    {
      "@id": "warc-records",
      "@type": "cr:RecordSet",
      "field": [
        {
          "@id": "warc-records/url",
          ...
          "dataType": "sc:URL",
...

So manifest FileSet including some urls - not the actual files.

Did I get it correct this time? 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant