I tried to download English part of Roots these days.
According to the paper, there are 484,953,009,124 bytes of English data.
However, after downloading all roots-related datasets on huggingface by filtering, I found there is only about 43.8 GB data.
I wonder how to explain the difference?
Are those huggingface datasets only a subset of Roots?
Are those huggingface datasets processed Roots so that the quantity shrinks from 480 GB to 43.8 GB?