-
Notifications
You must be signed in to change notification settings - Fork 97
Open
Description
The Croissant file exposed by HuggingFace seems to correspond to the parquet branch of the dataset, even when the dataset is native parquet:
IIUC, the parquet branch is not complete for datasets >5GB (not exactly like that since the 5GB are per split), but overall the branch can be often incomplete for large datasets. There are exceptions though, in this dataset the Parquet branch seems complete:
Instead, there should be a way of retrieving a Croissant referring to the main native-parquet branch. Maybe for backward compatibility it would be better to expose both Croissant files (parquet branch and main branch) although exposing only the "complete" one could also be an option.
Metadata
Metadata
Assignees
Labels
No labels