Mismatch of the Available Data Quantity on Huggingface

I tried to download English part of Roots these days.
According to the paper, there are 484,953,009,124 bytes of English data.
However, after downloading all roots-related datasets on [huggingface](https://huggingface.co/datasets?language=language:en&sort=downloads&search=bigscience-data%2F) by filtering, I found there is only about 43.8 GB data.
I wonder how to explain the difference?
Are those huggingface datasets only a subset of Roots?
Are those huggingface datasets processed Roots so that the quantity shrinks from 480 GB to 43.8 GB?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mismatch of the Available Data Quantity on Huggingface #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mismatch of the Available Data Quantity on Huggingface #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions