feat(commoncrawl-CC-MAIN-2025-43-draft/metadata.json): add remaining RecordSets#1006
feat(commoncrawl-CC-MAIN-2025-43-draft/metadata.json): add remaining RecordSets#1006handecelikkanat wants to merge 3 commits intomainfrom
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
@mqzhou-dev I added the remaining 6 RecordSets, do they look alright to you? EDIT: Resolved - we dont have this feature yet.
|
…tputs for the 6 new RecordSets
| } | ||
| }, | ||
| "encodingFormat": "application/warc", | ||
| "includes": "*.warc.gz", |
There was a problem hiding this comment.
Should this includes be of a url-like, rather than .*warc.gz?
Because in the associated record set (here), the type is url:
"recordSet": [
{
"@id": "warc-records",
"@type": "cr:RecordSet",
"field": [
{
"@id": "warc-records/url",
...
"dataType": "sc:URL",
...
So manifest FileSet including some urls - not the actual files.
Did I get it correct this time? 😆
This PR add remaining 6 RecordSets:
- wat-records
- wet-records
- robotstxt-records
- non200responses-records
- cc-index-records
- cc-index-table-records