Skip to content

Croissant refers to incomplete parquet branch in native parquet datasets #3101

@fylux

Description

@fylux

The Croissant file exposed by HuggingFace seems to correspond to the parquet branch of the dataset, even when the dataset is native parquet:

IIUC, the parquet branch is not complete for datasets >5GB (not exactly like that since the 5GB are per split), but overall the branch can be often incomplete for large datasets. There are exceptions though, in this dataset the Parquet branch seems complete:

Instead, there should be a way of retrieving a Croissant referring to the main native-parquet branch. Maybe for backward compatibility it would be better to expose both Croissant files (parquet branch and main branch) although exposing only the "complete" one could also be an option.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions