Skip to content

Mismatch of the Available Data Quantity on Huggingface #37

@cll-mtk

Description

@cll-mtk

I tried to download English part of Roots these days.
According to the paper, there are 484,953,009,124 bytes of English data.
However, after downloading all roots-related datasets on huggingface by filtering, I found there is only about 43.8 GB data.
I wonder how to explain the difference?
Are those huggingface datasets only a subset of Roots?
Are those huggingface datasets processed Roots so that the quantity shrinks from 480 GB to 43.8 GB?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions